Skip to main content
Log in

An I/O-efficient and adaptive fault-tolerant framework for distributed graph computations

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

In recent year, many large-scale iterative graph computation systems such as Pregel have been developed. To ensure that these systems are fault-tolerant, checkpointing, which archives graph states onto distributed file systems periodically, has been proposed. However, fault-tolerance remains to be challenging because the whole data set is archived with a static interval, rendering underlying graph computations to entail I/O-costs in terms of disk and network communication. Motivated by this, we first propose to dynamically adjust checkpoint intervals based on a carefully designed cost-analysis model, by taking the underlying computing workload into account. Furthermore, for algorithms that can be restarted from any point during computations, we prioritize graph states and then checkpointing can be performed with selected data, instead of the entire dataset, to reduce archiving overhead while simultaneously guaranteeing the failure recovery efficiency. Finally, we conduct extensive performance studies to confirm the effectiveness of our approaches over existing up-to-date solutions using a broad spectrum of real-world graphs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  1. Apache flink. https://flink.apache.org/

  2. Apache hadoop. http://hadoop.apache.org/

  3. Apache spark. http://spark.apache.org/

  4. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: Proceedings of the 7th International Conference on the World Wide Web, pp. 107–117. Elsevier, Amsterdam (1998)

  5. Bu, Y., Borkar, V., Jia, J., Carey, M.J., Condie, T.: Pregelix: Big (ger) graph analytics on a dataflow engine. Proc. VLDB Endow. 8(2), 161–172 (2014)

    Article  Google Scholar 

  6. Chen, R., Shi, J., Chen, Y., Chen, H.: Powerlyra: differentiated graph computation and partitioning on skewed graphs. In: Proceedings of EuroSys, p. 1. ACM, New York (2015)

  7. Chen, Z.: Algorithm-based recovery for iterative methods without checkpointing. In: Proceedings of HPDC, pp. 73–84. ACM, New York (2011)

  8. Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gen. Comput. Syst. 22(3), 303–312 (2006)

    Article  Google Scholar 

  9. Ewen, S., Tzoumas, K., Kaufmann, M., Markl, V.: Spinning fast iterative data flows. Proc. VLDB Endow. 5(11), 1268–1279 (2012)

    Article  Google Scholar 

  10. Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: Powergraph: distributed graph-parallel computation on natural graphs. In: Proceedings of OSDI, vol. 12, p. 2 (2012)

  11. Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: Graphx: graph processing in a distributed dataflow framework. In: Proceedings of OSDI, pp. 599–613 (2014)

  12. Giraph. http://giraph.apache.org/

  13. Han, M., Daudjee, K., Ammar, K., Özsu, M.T., Wang, X., Jin, T.: An experimental comparison of pregel-like graph processing systems. Proc. VLDB Endow. 7(12), 1047–1058 (2014)

    Article  Google Scholar 

  14. Katz, L.: A new status index derived from sociometric analysis. Psychometrika 18(1), 39–43 (1953)

    Article  MATH  Google Scholar 

  15. Khayyat, Z., Awara, K., Alonazi, A., Jamjoom, H., Williams, D., Kalnis, P.: Mizan: a system for dynamic load balancing in large-scale graph processing. In: Proceedings of Eurosys, pp. 169–182. ACM, New York (2013)

  16. Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J.M.: Distributed graphlab: a framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5(8), 716–727 (2012)

    Article  Google Scholar 

  17. Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of SIGMOD, pp. 135–146. ACM, New York (2010)

  18. Pundir, M., Leslie, L.M., Gupta, I., Campbell, R.H.: Zorro: zero-cost reactive failure recovery in distributed graph processing. In: Proceedings of the Sixth ACM Symposium on Cloud Computing (SoCC), pp. 195–208. ACM, New York (2015)

  19. Salihoglu, S., Widom, J.: GPS: a graph processing system. In: Proceedings of SSDBM, p. 22. ACM, New York (2013)

  20. Schelter, S., Ewen, S., Tzoumas, K., Markl, V.: All roads lead to Rome: optimistic recovery for distributed iterative data processing. In: Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pp. 1919–1928. ACM, New York (2013)

  21. Seo, S., Yoon, E.J., Kim, J., Jin, S., Kim, J.S., Maeng, S.: Hama: an efficient matrix computation with the mapreduce framework. In: CloudCom, pp. 721–726. IEEE, Washington (2010)

  22. Shang, Z., Yu, J.X.: Catch the wind: graph workload balancing on cloud. In: Proceedings of ICDE, pp. 553–564. IEEE, New York (2013)

  23. Shen, Y., Chen, G., Jagadish, H., Lu, W., Ooi, B.C., Tudor, B.M.: Fast failure recovery in distributed graph processing systems. Proc. VLDB Endow. 8(4), 437–448 (2014)

    Article  Google Scholar 

  24. Tian, Y., Balmin, A., Corsten, S.A., Tatikonda, S., McPherson, J.: From think like a vertex to think like a graph. Proc. VLDB Endow. 7(3), 193–204 (2013)

    Article  Google Scholar 

  25. Wang, Z., Gao, L., Gu, Y., Bao, Y., Yu, G.: A fault-tolerant framework for asynchronous iterative computations in cloud environments. In: Proceedings of the Seventh ACM Symposium on Cloud Computing (SoCC), pp. 71–83. ACM, New York (2016)

  26. Wang, Z., Gu, Y., Bao, Y., Yu, G., Yu, J.X.: Hybrid pulling/pushing for i/o-efficient distributed and iterative graph computing. In: Proceedings of SIGMOD, pp. 479–494. ACM, New York (2016)

  27. Xie, C., Chen, R., Guan, H., Zang, B., Chen, H.: Sync or async: time to fuse for distributed graph-parallel computation. In: Proceedings of PPoPP, pp. 194–204. ACM, New York (2015)

  28. Xu, C., Holzemer, M., Kaul, M., Markl, V.: Efficient fault-tolerance for iterative graph processing on distributed dataflow systems. In: Proceedings of ICDE, pp. 613–624. IEEE, New York (2016)

  29. Xue, J., Yang, Z., Qu, Z., Hou, S., Dai, Y.: Seraph: an efficient, low-cost system for concurrent graph processing. In: Proceedings of HPDC, pp. 227–238. ACM, New York (2014)

  30. Yan, D., Cheng, J., Lu, Y., Ng, W.: Blogel: a block-centric framework for distributed computation on real-world graphs. Proc. VLDB Endow. 7(14), 1981–1992 (2014)

    Article  Google Scholar 

  31. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of NSDI, pp. 2–2. USENIX Association, Berkeley (2012)

  32. Zhang, Y., Gao, Q., Gao, L., Wang, C.: Priter: a distributed framework for prioritizing iterative computations. IEEE Trans. Parallel Distrib. Syst. 24(9), 1884–1893 (2013)

    Article  Google Scholar 

  33. Zhang, Y., Gao, Q., Gao, L., Wang, C.: Maiter: an asynchronous graph processing framework for delta-based accumulative iterative computation. TPDS 25(8), 2091–2100 (2014)

    Google Scholar 

  34. Zhou, C., Gao, J., Sun, B., Yu, J.X.: Mocgraph: scalable distributed graph processing using message online computing. Proc. VLDB Endow. 8(4), 377–388 (2014)

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (61472071, 61433008, 61528203, 61272179, and 61602103), and the U.S. NSF Grant CNS-1217284. Zhigang Wang was a visiting student at UMass Amherst, supported by China Scholarship Council, when this work was performed. Authors are also grateful to anonymous reviewers for their constructive comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ge Yu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Z., Gu, Y., Bao, Y. et al. An I/O-efficient and adaptive fault-tolerant framework for distributed graph computations. Distrib Parallel Databases 35, 177–196 (2017). https://doi.org/10.1007/s10619-017-7192-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-017-7192-2

Keywords

Navigation