An I/O-efficient and adaptive fault-tolerant framework for distributed graph computations

Wang, Zhigang; Gu, Yu; Bao, Yubin; Yu, Ge; Gao, Lixin

doi:10.1007/s10619-017-7192-2

An I/O-efficient and adaptive fault-tolerant framework for distributed graph computations

Published: 09 March 2017

Volume 35, pages 177–196, (2017)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

Zhigang Wang¹,
Yu Gu¹,
Yubin Bao¹,
Ge Yu¹ &
…
Lixin Gao²

529 Accesses
7 Citations
Explore all metrics

Abstract

In recent year, many large-scale iterative graph computation systems such as Pregel have been developed. To ensure that these systems are fault-tolerant, checkpointing, which archives graph states onto distributed file systems periodically, has been proposed. However, fault-tolerance remains to be challenging because the whole data set is archived with a static interval, rendering underlying graph computations to entail I/O-costs in terms of disk and network communication. Motivated by this, we first propose to dynamically adjust checkpoint intervals based on a carefully designed cost-analysis model, by taking the underlying computing workload into account. Furthermore, for algorithms that can be restarted from any point during computations, we prioritize graph states and then checkpointing can be performed with selected data, instead of the entire dataset, to reduce archiving overhead while simultaneously guaranteeing the failure recovery efficiency. Finally, we conduct extensive performance studies to confirm the effectiveness of our approaches over existing up-to-date solutions using a broad spectrum of real-world graphs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Demonstration on Unblocking Checkpoint for Fault-Tolerance in Pregel-Like Systems

Exploiting Unblocking Checkpoint for Fault-Tolerance in Pregel-Like Systems

ACF2: Accelerating Checkpoint-Free Failure Recovery for Distributed Graph Processing

References

Apache flink. https://flink.apache.org/
Apache hadoop. http://hadoop.apache.org/
Apache spark. http://spark.apache.org/
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: Proceedings of the 7th International Conference on the World Wide Web, pp. 107–117. Elsevier, Amsterdam (1998)
Bu, Y., Borkar, V., Jia, J., Carey, M.J., Condie, T.: Pregelix: Big (ger) graph analytics on a dataflow engine. Proc. VLDB Endow. 8(2), 161–172 (2014)
Article Google Scholar
Chen, R., Shi, J., Chen, Y., Chen, H.: Powerlyra: differentiated graph computation and partitioning on skewed graphs. In: Proceedings of EuroSys, p. 1. ACM, New York (2015)
Chen, Z.: Algorithm-based recovery for iterative methods without checkpointing. In: Proceedings of HPDC, pp. 73–84. ACM, New York (2011)
Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gen. Comput. Syst. 22(3), 303–312 (2006)
Article Google Scholar
Ewen, S., Tzoumas, K., Kaufmann, M., Markl, V.: Spinning fast iterative data flows. Proc. VLDB Endow. 5(11), 1268–1279 (2012)
Article Google Scholar
Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: Powergraph: distributed graph-parallel computation on natural graphs. In: Proceedings of OSDI, vol. 12, p. 2 (2012)
Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: Graphx: graph processing in a distributed dataflow framework. In: Proceedings of OSDI, pp. 599–613 (2014)
Giraph. http://giraph.apache.org/
Han, M., Daudjee, K., Ammar, K., Özsu, M.T., Wang, X., Jin, T.: An experimental comparison of pregel-like graph processing systems. Proc. VLDB Endow. 7(12), 1047–1058 (2014)
Article Google Scholar
Katz, L.: A new status index derived from sociometric analysis. Psychometrika 18(1), 39–43 (1953)
Article MATH Google Scholar
Khayyat, Z., Awara, K., Alonazi, A., Jamjoom, H., Williams, D., Kalnis, P.: Mizan: a system for dynamic load balancing in large-scale graph processing. In: Proceedings of Eurosys, pp. 169–182. ACM, New York (2013)
Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J.M.: Distributed graphlab: a framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5(8), 716–727 (2012)
Article Google Scholar
Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of SIGMOD, pp. 135–146. ACM, New York (2010)
Pundir, M., Leslie, L.M., Gupta, I., Campbell, R.H.: Zorro: zero-cost reactive failure recovery in distributed graph processing. In: Proceedings of the Sixth ACM Symposium on Cloud Computing (SoCC), pp. 195–208. ACM, New York (2015)
Salihoglu, S., Widom, J.: GPS: a graph processing system. In: Proceedings of SSDBM, p. 22. ACM, New York (2013)
Schelter, S., Ewen, S., Tzoumas, K., Markl, V.: All roads lead to Rome: optimistic recovery for distributed iterative data processing. In: Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pp. 1919–1928. ACM, New York (2013)
Seo, S., Yoon, E.J., Kim, J., Jin, S., Kim, J.S., Maeng, S.: Hama: an efficient matrix computation with the mapreduce framework. In: CloudCom, pp. 721–726. IEEE, Washington (2010)
Shang, Z., Yu, J.X.: Catch the wind: graph workload balancing on cloud. In: Proceedings of ICDE, pp. 553–564. IEEE, New York (2013)
Shen, Y., Chen, G., Jagadish, H., Lu, W., Ooi, B.C., Tudor, B.M.: Fast failure recovery in distributed graph processing systems. Proc. VLDB Endow. 8(4), 437–448 (2014)
Article Google Scholar
Tian, Y., Balmin, A., Corsten, S.A., Tatikonda, S., McPherson, J.: From think like a vertex to think like a graph. Proc. VLDB Endow. 7(3), 193–204 (2013)
Article Google Scholar
Wang, Z., Gao, L., Gu, Y., Bao, Y., Yu, G.: A fault-tolerant framework for asynchronous iterative computations in cloud environments. In: Proceedings of the Seventh ACM Symposium on Cloud Computing (SoCC), pp. 71–83. ACM, New York (2016)
Wang, Z., Gu, Y., Bao, Y., Yu, G., Yu, J.X.: Hybrid pulling/pushing for i/o-efficient distributed and iterative graph computing. In: Proceedings of SIGMOD, pp. 479–494. ACM, New York (2016)
Xie, C., Chen, R., Guan, H., Zang, B., Chen, H.: Sync or async: time to fuse for distributed graph-parallel computation. In: Proceedings of PPoPP, pp. 194–204. ACM, New York (2015)
Xu, C., Holzemer, M., Kaul, M., Markl, V.: Efficient fault-tolerance for iterative graph processing on distributed dataflow systems. In: Proceedings of ICDE, pp. 613–624. IEEE, New York (2016)
Xue, J., Yang, Z., Qu, Z., Hou, S., Dai, Y.: Seraph: an efficient, low-cost system for concurrent graph processing. In: Proceedings of HPDC, pp. 227–238. ACM, New York (2014)
Yan, D., Cheng, J., Lu, Y., Ng, W.: Blogel: a block-centric framework for distributed computation on real-world graphs. Proc. VLDB Endow. 7(14), 1981–1992 (2014)
Article Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of NSDI, pp. 2–2. USENIX Association, Berkeley (2012)
Zhang, Y., Gao, Q., Gao, L., Wang, C.: Priter: a distributed framework for prioritizing iterative computations. IEEE Trans. Parallel Distrib. Syst. 24(9), 1884–1893 (2013)
Article Google Scholar
Zhang, Y., Gao, Q., Gao, L., Wang, C.: Maiter: an asynchronous graph processing framework for delta-based accumulative iterative computation. TPDS 25(8), 2091–2100 (2014)
Google Scholar
Zhou, C., Gao, J., Sun, B., Yu, J.X.: Mocgraph: scalable distributed graph processing using message online computing. Proc. VLDB Endow. 8(4), 377–388 (2014)
Article Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (61472071, 61433008, 61528203, 61272179, and 61602103), and the U.S. NSF Grant CNS-1217284. Zhigang Wang was a visiting student at UMass Amherst, supported by China Scholarship Council, when this work was performed. Authors are also grateful to anonymous reviewers for their constructive comments.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Northeastern University, Shenyang, 110819, China
Zhigang Wang, Yu Gu, Yubin Bao & Ge Yu
Department of Electrical and Computer Engineering, University of Massachusetts Amherst, Amherst, MA, 01003, USA
Lixin Gao

Authors

Zhigang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yu Gu
View author publications
You can also search for this author in PubMed Google Scholar
Yubin Bao
View author publications
You can also search for this author in PubMed Google Scholar
Ge Yu
View author publications
You can also search for this author in PubMed Google Scholar
Lixin Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ge Yu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, Z., Gu, Y., Bao, Y. et al. An I/O-efficient and adaptive fault-tolerant framework for distributed graph computations. Distrib Parallel Databases 35, 177–196 (2017). https://doi.org/10.1007/s10619-017-7192-2

Download citation

Published: 09 March 2017
Issue Date: June 2017
DOI: https://doi.org/10.1007/s10619-017-7192-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An I/O-efficient and adaptive fault-tolerant framework for distributed graph computations

Abstract

Access this article

Similar content being viewed by others

Demonstration on Unblocking Checkpoint for Fault-Tolerance in Pregel-Like Systems

Exploiting Unblocking Checkpoint for Fault-Tolerance in Pregel-Like Systems

ACF2: Accelerating Checkpoint-Free Failure Recovery for Distributed Graph Processing

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An I/O-efficient and adaptive fault-tolerant framework for distributed graph computations

Abstract

Access this article

Similar content being viewed by others

Demonstration on Unblocking Checkpoint for Fault-Tolerance in Pregel-Like Systems

Exploiting Unblocking Checkpoint for Fault-Tolerance in Pregel-Like Systems

ACF2: Accelerating Checkpoint-Free Failure Recovery for Distributed Graph Processing

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation