Skip to main content
Log in

SOOP: Efficient Distributed Graph Computation Supporting Second-Order Random Walks

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

The second-order random walk has recently been shown to effectively improve the accuracy in graph analysis tasks. Existing work mainly focuses on centralized second-order random walk (SOW) algorithms. SOW algorithms rely on edge-to-edge transition probabilities to generate next random steps. However, it is prohibitively costly to store all the probabilities for large-scale graphs, and restricting the number of probabilities to consider can negatively impact the accuracy of graph analysis tasks. In this paper, we propose and study an alternative approach, SOOP (second-order random walks with on-demand probability computation), that avoids the space overhead by computing the edge-to-edge transition probabilities on demand during the random walk. However, the same probabilities may be computed multiple times when the same edge appears multiple times in SOW, incurring extra cost for redundant computation and communication. We propose two optimization techniques that reduce the complexity of computing edge-to-edge transition probabilities to generate next random steps, and reduce the cost of communicating out-neighbors for the probability computation, respectively. Our experiments on real-world and synthetic graphs show that SOOP achieves orders of magnitude better performance than baseline precompute solutions, and it can efficiently computes SOW algorithms on billion-scale graphs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Grover A, Leskovec J. Node2Vec: Scalable feature learning for networks. In Proc. the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2016, pp.855-864. https://doi.org/10.1145/2939-672.2939754.

  2. Wu Y, Bian Y, Zhang X. Remember where you came from: On the second-order random walk based proximity measures. Proceedings of the VLDB Endowment, 2016, 10(1): 13-24. https://doi.org/10.14778/3015270.3015272.

    Article  Google Scholar 

  3. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv:1301.3781, 2013. https://arxiv.org/abs//1301.3781, March 2021.

  4. Tsoumakas G, Katakis I. Multi-label classification: An overview. International Journal of Data Warehousing and Mining, 2007, 3(3): 1-13. https://doi.org/10.4018/jdwm.2007070101.

    Article  Google Scholar 

  5. Liben-Nowell D, Kleinberg J M. The link prediction problem for social networks. In Proc. the 2003 ACM CIKM International Conference on Information and Knowledge Management, November 2003, pp.556-559. https://doi.org/10.1145/956863.956972.

  6. Tang L, Liu H. Leveraging social media networks for classification. Data Min. Knowl. Discov., 2011, 23(3): 447-478. https://doi.org/10.1007/s10618-010-0210-x.

    Article  MathSciNet  MATH  Google Scholar 

  7. Perozzi B, Al-Rfou R, Skiena S. DeepWalk: Online learning of social representations. In Proc. the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2014, pp.701-710. https://doi.org/10.1145/2623330.2623732.

  8. Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q. LINE: Large-scale information network embedding. In Proc. the 24th International Conference on World Wide Web, May 2015, pp.1067-1077. https://doi.org/10.1145/2736277.2741093.

  9. Malewicz G, Austern M H, Bik A J C, Dehnert J C, Horn I, Leiser N, Czajkowski G. Pregel: A system for large-scale graph processing. In Proc. the 2010 ACM SIGMOD International Conference on Management of Data, June 2010, pp.135-146. https://doi.org/10.1145/1807167.1807184.

  10. Gonzalez J E, Low Y, Gu H, Bickson D, Guestrin C. PowerGraph: Distributed graph-parallel computation on natural graphs. In Proc. the 10th USENIX Symposium on Operating Systems Design and Implementation, October 2012, pp.17-30.

  11. Salihoglu S, Widom J. GPS: A graph processing system. In Proc. the 25th International Conference on Scientific and Statistical Database Management, July 2013, Article No. 22. https://doi.org/10.1145/2484838.2484843.

  12. Tian Y, Balmin A, Corsten S A, Tatikonda S, McPherson J. From “think like a vertex” to “think like a graph”. Proceedings of the VLDB Endowment, 2013, 7(3): 193-204. https://doi.org/10.14778/2732232.2732238.

  13. Xin R S, Gonzalez J E, Franklin M J, Stoica I. Graphx: A resilient distributed graph system on Spark. In Proc. the 1st International Workshop on Graph Data Management Experiences and Systems, June 2013, Article No. 2. https://doi.org/10.1145/2484425.2484427.

  14. Yan D, Cheng J, Lu Y, Ng W. Blogel: A block-centric framework for distributed computation on real-world graphs. Proceedings of the VLDB Endowment, 2014, 7(14): 1981-1992. https://doi.org/10.14778/2733085.2733103.

  15. Chen R, Shi J, Chen Y, Chen H. PowerLyra: Differentiated graph computation and partitioning on skewed graphs. In Proc. the 10th European Conference on Computer Systems, April 2015, Article No. 1. https://doi.org/10.1145/2741948.2741970.

  16. Zhu X, Chen W, Zheng W, Ma X. Gemini: A computation-centric distributed graph processing system. In Proc. the 12th USENIX Symposium on Operating Systems Design and Implementation, November 2016, pp.301-316.

  17. Fan W, Xu J, Wu Y, Yu W, Jiang J, Zheng Z, Zhang B, Cao Y, Tian C. Parallelizing sequential graph computations. In Proc. the 2017 ACM International Conference on Management of Data, May 2017, pp.495-510. https://doi.org/10.1145/303-5918.3035942.

  18. Iosup A, Hegeman T, Ngai W L, Heldens S, Prat-Pérez A, Manhardt T, Chafi H, Capotă M, Sundaram N, Anderson M J, Tanase I G, Xia Y, Nai L, Boncz P A. LDBC graphalytics: A benchmark for large-scale graph analysis on parallel and distributed platforms. Proceedings of the VLDB Endowment, 2016, 9(13): 1317-1328. https://doi.org/10.14778/3007263.3007270.

  19. Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauly M, Franklin M J, Shenker S, Stoica I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proc. the 9th USENIX Symposium on Networked Systems Design and Implementation, April 2012, pp.15-28.

  20. Zhou D, Niu S, Chen S. Efficient graph computation for Node2Vec. arXiv:1805.00280, 2018. https://arxiv.org/abs/1805.00280, March 2021.

  21. Andersen R, Chung F R K, Lang K J. Local graph partitioning using PageRank vectors. In Proc. the 47th Annual IEEE Symposium on Foundations of Computer Science, October 2006, pp.475-486. https://doi.org/10.1109/FOCS.2006.44.

  22. Yang K, Zhang M, Chen K, Ma X, Bai Y, Jiang Y. KnightKing: A fast distributed graph random walk engine. In Proc. the 27th ACM Symposium on Operating Systems Principles, October 2019, pp.524-537. https://doi.org/10.1145/334-1301.3359634.

  23. Vose M D. A linear algorithm for generating random numbers with a given distribution. IEEE Trans. Software Eng., 1991, 17(9): 972-975. https://doi.org/10.1109/32.92917.

    Article  MathSciNet  Google Scholar 

  24. Niu S, Chen S. Optimizing CPU cache performance for Pregel-like graph computation. In Proc. the 31st IEEE International Conference on Data Engineering Workshops, April 2015, pp.149-154. https://doi.org/10.1109/ICDE-W.2015.7129568.

  25. Yang J, Leskovec J. Defining and evaluating network communities based on ground-truth. In Proc. the 12th IEEE International Conference on Data Mining, December 2012, pp.745-754. https://doi.org/10.1109/ICDM.2012.138.

  26. Boldi P, Vigna S. The WebGraph framework I: Compression techniques. In Proc. the 13th International World Wide Web Conference, May 2004, pp.595-601. https://doi.org/10.1145/988672.988752.

  27. Chakrabarti D, Zhan Y, Faloutsos C. R-MAT: A recursive model for graph mining. In Proc. the 4th SIAM International Conference on Data Mining, April 2004, pp.442-446. https://doi.org/10.1137/1.9781611972740.43.

  28. Park H, Kim M. TrillionG: A trillion-scale synthetic graph generator using a recursive vector model. In Proc. the 2017 ACM International Conference on Management of Data, May 2017, pp.913-928. https://doi.org/10.1145/3035918.3064014.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Songjie Niu.

Supplementary Information

ESM 1

(PDF 151 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Niu, S., Zhou, D. SOOP: Efficient Distributed Graph Computation Supporting Second-Order Random Walks. J. Comput. Sci. Technol. 36, 985–1001 (2021). https://doi.org/10.1007/s11390-021-1234-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-021-1234-y

Keywords

Navigation