SOOP: Efficient Distributed Graph Computation Supporting Second-Order Random Walks

Niu, Songjie; Zhou, Dongyan

doi:10.1007/s11390-021-1234-y

SOOP: Efficient Distributed Graph Computation Supporting Second-Order Random Walks

Regular Paper
Published: 30 September 2021

Volume 36, pages 985–1001, (2021)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Songjie Niu^1,2 &
Dongyan Zhou³

174 Accesses
Explore all metrics

Abstract

The second-order random walk has recently been shown to effectively improve the accuracy in graph analysis tasks. Existing work mainly focuses on centralized second-order random walk (SOW) algorithms. SOW algorithms rely on edge-to-edge transition probabilities to generate next random steps. However, it is prohibitively costly to store all the probabilities for large-scale graphs, and restricting the number of probabilities to consider can negatively impact the accuracy of graph analysis tasks. In this paper, we propose and study an alternative approach, SOOP (second-order random walks with on-demand probability computation), that avoids the space overhead by computing the edge-to-edge transition probabilities on demand during the random walk. However, the same probabilities may be computed multiple times when the same edge appears multiple times in SOW, incurring extra cost for redundant computation and communication. We propose two optimization techniques that reduce the complexity of computing edge-to-edge transition probabilities to generate next random steps, and reduce the cost of communicating out-neighbors for the probability computation, respectively. Our experiments on real-world and synthetic graphs show that SOOP achieves orders of magnitude better performance than baseline precompute solutions, and it can efficiently computes SOW algorithms on billion-scale graphs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Grover A, Leskovec J. Node2Vec: Scalable feature learning for networks. In Proc. the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2016, pp.855-864. https://doi.org/10.1145/2939-672.2939754.
Wu Y, Bian Y, Zhang X. Remember where you came from: On the second-order random walk based proximity measures. Proceedings of the VLDB Endowment, 2016, 10(1): 13-24. https://doi.org/10.14778/3015270.3015272.
Article Google Scholar
Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv:1301.3781, 2013. https://arxiv.org/abs//1301.3781, March 2021.
Tsoumakas G, Katakis I. Multi-label classification: An overview. International Journal of Data Warehousing and Mining, 2007, 3(3): 1-13. https://doi.org/10.4018/jdwm.2007070101.
Article Google Scholar
Liben-Nowell D, Kleinberg J M. The link prediction problem for social networks. In Proc. the 2003 ACM CIKM International Conference on Information and Knowledge Management, November 2003, pp.556-559. https://doi.org/10.1145/956863.956972.
Tang L, Liu H. Leveraging social media networks for classification. Data Min. Knowl. Discov., 2011, 23(3): 447-478. https://doi.org/10.1007/s10618-010-0210-x.
Article MathSciNet MATH Google Scholar
Perozzi B, Al-Rfou R, Skiena S. DeepWalk: Online learning of social representations. In Proc. the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2014, pp.701-710. https://doi.org/10.1145/2623330.2623732.
Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q. LINE: Large-scale information network embedding. In Proc. the 24th International Conference on World Wide Web, May 2015, pp.1067-1077. https://doi.org/10.1145/2736277.2741093.
Malewicz G, Austern M H, Bik A J C, Dehnert J C, Horn I, Leiser N, Czajkowski G. Pregel: A system for large-scale graph processing. In Proc. the 2010 ACM SIGMOD International Conference on Management of Data, June 2010, pp.135-146. https://doi.org/10.1145/1807167.1807184.
Gonzalez J E, Low Y, Gu H, Bickson D, Guestrin C. PowerGraph: Distributed graph-parallel computation on natural graphs. In Proc. the 10th USENIX Symposium on Operating Systems Design and Implementation, October 2012, pp.17-30.
Salihoglu S, Widom J. GPS: A graph processing system. In Proc. the 25th International Conference on Scientific and Statistical Database Management, July 2013, Article No. 22. https://doi.org/10.1145/2484838.2484843.
Tian Y, Balmin A, Corsten S A, Tatikonda S, McPherson J. From “think like a vertex” to “think like a graph”. Proceedings of the VLDB Endowment, 2013, 7(3): 193-204. https://doi.org/10.14778/2732232.2732238.
Xin R S, Gonzalez J E, Franklin M J, Stoica I. Graphx: A resilient distributed graph system on Spark. In Proc. the 1st International Workshop on Graph Data Management Experiences and Systems, June 2013, Article No. 2. https://doi.org/10.1145/2484425.2484427.
Yan D, Cheng J, Lu Y, Ng W. Blogel: A block-centric framework for distributed computation on real-world graphs. Proceedings of the VLDB Endowment, 2014, 7(14): 1981-1992. https://doi.org/10.14778/2733085.2733103.
Chen R, Shi J, Chen Y, Chen H. PowerLyra: Differentiated graph computation and partitioning on skewed graphs. In Proc. the 10th European Conference on Computer Systems, April 2015, Article No. 1. https://doi.org/10.1145/2741948.2741970.
Zhu X, Chen W, Zheng W, Ma X. Gemini: A computation-centric distributed graph processing system. In Proc. the 12th USENIX Symposium on Operating Systems Design and Implementation, November 2016, pp.301-316.
Fan W, Xu J, Wu Y, Yu W, Jiang J, Zheng Z, Zhang B, Cao Y, Tian C. Parallelizing sequential graph computations. In Proc. the 2017 ACM International Conference on Management of Data, May 2017, pp.495-510. https://doi.org/10.1145/303-5918.3035942.
Iosup A, Hegeman T, Ngai W L, Heldens S, Prat-Pérez A, Manhardt T, Chafi H, Capotă M, Sundaram N, Anderson M J, Tanase I G, Xia Y, Nai L, Boncz P A. LDBC graphalytics: A benchmark for large-scale graph analysis on parallel and distributed platforms. Proceedings of the VLDB Endowment, 2016, 9(13): 1317-1328. https://doi.org/10.14778/3007263.3007270.
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauly M, Franklin M J, Shenker S, Stoica I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proc. the 9th USENIX Symposium on Networked Systems Design and Implementation, April 2012, pp.15-28.
Zhou D, Niu S, Chen S. Efficient graph computation for Node2Vec. arXiv:1805.00280, 2018. https://arxiv.org/abs/1805.00280, March 2021.
Andersen R, Chung F R K, Lang K J. Local graph partitioning using PageRank vectors. In Proc. the 47th Annual IEEE Symposium on Foundations of Computer Science, October 2006, pp.475-486. https://doi.org/10.1109/FOCS.2006.44.
Yang K, Zhang M, Chen K, Ma X, Bai Y, Jiang Y. KnightKing: A fast distributed graph random walk engine. In Proc. the 27th ACM Symposium on Operating Systems Principles, October 2019, pp.524-537. https://doi.org/10.1145/334-1301.3359634.
Vose M D. A linear algorithm for generating random numbers with a given distribution. IEEE Trans. Software Eng., 1991, 17(9): 972-975. https://doi.org/10.1109/32.92917.
Article MathSciNet Google Scholar
Niu S, Chen S. Optimizing CPU cache performance for Pregel-like graph computation. In Proc. the 31st IEEE International Conference on Data Engineering Workshops, April 2015, pp.149-154. https://doi.org/10.1109/ICDE-W.2015.7129568.
Yang J, Leskovec J. Defining and evaluating network communities based on ground-truth. In Proc. the 12th IEEE International Conference on Data Mining, December 2012, pp.745-754. https://doi.org/10.1109/ICDM.2012.138.
Boldi P, Vigna S. The WebGraph framework I: Compression techniques. In Proc. the 13th International World Wide Web Conference, May 2004, pp.595-601. https://doi.org/10.1145/988672.988752.
Chakrabarti D, Zhan Y, Faloutsos C. R-MAT: A recursive model for graph mining. In Proc. the 4th SIAM International Conference on Data Mining, April 2004, pp.442-446. https://doi.org/10.1137/1.9781611972740.43.
Park H, Kim M. TrillionG: A trillion-scale synthetic graph generator using a recursive vector model. In Proc. the 2017 ACM International Conference on Management of Data, May 2017, pp.913-928. https://doi.org/10.1145/3035918.3064014.

Download references

Author information

Authors and Affiliations

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Songjie Niu
University of Chinese Academy of Sciences, Beijing, 100049, China
Songjie Niu
Bytedance Technology, Beijing, 100086, China
Dongyan Zhou

Authors

Songjie Niu
View author publications
You can also search for this author in PubMed Google Scholar
Dongyan Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Songjie Niu.

Supplementary Information

ESM 1

(PDF 151 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Niu, S., Zhou, D. SOOP: Efficient Distributed Graph Computation Supporting Second-Order Random Walks. J. Comput. Sci. Technol. 36, 985–1001 (2021). https://doi.org/10.1007/s11390-021-1234-y

Download citation

Received: 25 December 2020
Accepted: 23 August 2021
Published: 30 September 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s11390-021-1234-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SOOP: Efficient Distributed Graph Computation Supporting Second-Order Random Walks

Abstract

Access this article

Similar content being viewed by others

Efficient parallel edge-centric approach for relaxed graph pattern matching

Renovating Watts and Strogatz Random Graph Generation by a Sequential Approach

Sequential stratified regeneration: MCMC for large state spaces with an application to subgraph count estimation

References

Author information

Authors and Affiliations

Corresponding author

Supplementary Information

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SOOP: Efficient Distributed Graph Computation Supporting Second-Order Random Walks

Abstract

Access this article

Similar content being viewed by others

Efficient parallel edge-centric approach for relaxed graph pattern matching

Renovating Watts and Strogatz Random Graph Generation by a Sequential Approach

Sequential stratified regeneration: MCMC for large state spaces with an application to subgraph count estimation

References

Author information

Authors and Affiliations

Corresponding author

Supplementary Information

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation