Skip to main content

ACF2: Accelerating Checkpoint-Free Failure Recovery for Distributed Graph Processing

  • Conference paper
  • First Online:
Book cover Web and Big Data (APWeb-WAIM 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13421))

  • 823 Accesses

Abstract

Iterative computation in distributed graph processing systems typically incurs a long runtime. Hence, it is crucial for graph processing to tolerate and quick recover from intermittent failures. Existing solutions can be categorized into checkpoint-based and checkpoint-free solution. The former writes checkpoints periodically during execution, which leads to significant overhead. Differently, the latter requires no checkpoint. Once failure happens, it reloads input data and resets the value of lost vertices directly. However, reloading input data involves repartitioning, which incurs additional overhead. Moreover, we observe that checkpoint-free solution cannot effectively handle failures for graph algorithms with topological mutations. To address these issues, we propose ACF2 with a partition-aware backup strategy and an incremental protocol. In particular, the partition-aware backup strategy backs up the sub-graphs of all nodes after initial partitioning. Once failure happens, the partition-aware backup strategy recovers the lost sub-graphs from the backups, and then resumes computation like checkpoint-free solution. To effectively handle failures involving topological mutations, the incremental protocol logs topological mutations during normal execution which would be exploited for recovery. We implement ACF2 based on Apache Giraph and our experiments show that ACF2 significantly outperforms existing solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://networkrepository.com/orkut.php.

  2. 2.

    http://networkrepository.com/web-cc12-PayLevelDomain.php.

  3. 3.

    https://snap.stanford.edu/data/com-Friendster.html.

References

  1. Ammar, K., et al.: Experimental analysis of distributed graph systems. Proc. VLDB Endow. 11(10), 1151–1164 (2018)

    Article  Google Scholar 

  2. Dathathri, R., et al.: Phoenix: a substrate for resilient distributed graph analytics. In: ASPLOS, pp. 615–630 (2019)

    Google Scholar 

  3. Gonzalez, J.E., et al.: Powergraph: distributed graph-parallel computation on natural graphs. In: OSDI, pp. 17–30 (2012)

    Google Scholar 

  4. Gonzalez, J.E., et al.: Graphx: graph processing in a distributed dataflow framework. In: OSDI, pp. 599–613 (2014)

    Google Scholar 

  5. Kalavri, V., et al.: High-level programming abstractions for distributed graph processing. IEEE Trans. Knowl. Data Eng. 30(2), 305–324 (2018)

    Article  Google Scholar 

  6. Li, B., et al.: : A trusted parallel route planning model on dynamic road networks. TITS (2022)

    Google Scholar 

  7. Low, Y., et al.: Distributed graphLab: a framework for machine learning in the cloud. Proc. VLDB Endow. 5(8), 716–727 (2012)

    Article  Google Scholar 

  8. Lu, Y., et al.: Large-scale distributed graph computing systems: an experimental evaluation. Proc. VLDB Endow. 8(3), 281–292 (2014)

    Article  Google Scholar 

  9. Malewicz, G., et al.: Pregel: a system for large-scale graph processing. In: SIGMOD, pp. 135–146 (2010)

    Google Scholar 

  10. McCune, R.R., Weninger, T., Madey, G.: Thinking like a vertex: a survey of vertex-centric frameworks for large-scale distributed graph processing. ACM Comput. Surv. 48(2), 1–39 (2015)

    Google Scholar 

  11. Pundir, M., et al.: Zorro: zero-cost reactive failure recovery in distributed graph processing. In: SoCC, pp. 195–208 (2015)

    Google Scholar 

  12. Schelter, S., et al.: “All roads lead to rome”: optimistic recovery for distributed iterative data processing. In: CIKM, pp. 1919–1928 (2013)

    Google Scholar 

  13. Shen, Y., et al.: Fast failure recovery in distributed graph processing systems. Proc. VLDB Endow. 8(4), 437–448 (2014)

    Article  Google Scholar 

  14. Vora, K., et al.: Coral: confined recovery in distributed asynchronous graph processing. In: ASPLOS, pp. 223–236 (2017)

    Google Scholar 

  15. Wang, P., et al.: Replication-based fault-tolerance for large-scale graph processing. In: DSN, pp. 562–573 (2014)

    Google Scholar 

  16. Xu, C., et al.: Efficient fault-tolerance for iterative graph processing on distributed dataflow systems. In: ICDE, pp. 613–624 (2016)

    Google Scholar 

  17. Yan, D., et al.: Lightweight fault tolerance in Pregel-like systems. In: ICPP, pp. 1–10 (2019)

    Google Scholar 

  18. Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI, pp. 15–28 (2012)

    Google Scholar 

Download references

Acknowledgments

This work has been supported by the National Natural Science Foundation of China (No. 61902128).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chen Xu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xu, C., Yang, Y., Pan, Q., Zhou, H. (2023). ACF2: Accelerating Checkpoint-Free Failure Recovery for Distributed Graph Processing. In: Li, B., Yue, L., Tao, C., Han, X., Calvanese, D., Amagasa, T. (eds) Web and Big Data. APWeb-WAIM 2022. Lecture Notes in Computer Science, vol 13421. Springer, Cham. https://doi.org/10.1007/978-3-031-25158-0_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-25158-0_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-25157-3

  • Online ISBN: 978-3-031-25158-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics