research-article

Memory-Aware Framework for Efficient Second-Order Random Walk on Large Graphs

Authors:
Yingxia Shao

Beijing Univeristy of Posts and Telecommunications, Beijing, China

Beijing Univeristy of Posts and Telecommunications, Beijing, China
View Profile

,
Shiyue Huang

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Xupeng Miao

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Bin Cui

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Lei Chen

Hong Kong University of Science and Technology, Beijing, China

Hong Kong University of Science and Technology, Beijing, China
View Profile

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of DataJune 2020Pages 1797–1812https://doi.org/10.1145/3318464.3380562

Published:31 May 2020Publication History

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Pages 1797–1812

ABSTRACT

Second-order random walk is an important technique for graph analysis. Many applications use it to capture higher-order patterns in the graph, thus improving the model accuracy. However, the memory explosion problem of this technique hinders it from analyzing large graphs. When processing a billion-edge graph like Twitter, existing solutions (e.g., alias method) of the second-order random walk may take up 1796TB memory. Such high memory overhead comes from the memory-unaware strategies for node sampling across the graph. In this paper, to clearly study the efficiency of various node sampling methods in the context of second-order random walk, we design a cost model, and then propose a new node sampling method following the acceptance-rejection paradigm to achieve a better balance between memory and time cost. Further, to guarantee the efficiency of the second-order random walk within arbitrary memory budgets, we propose a memory-aware framework on the basis of the cost model. The framework applies a cost-based optimizer to assign desirable node sampling method for each node in the graph within a memory budget while minimizing the time cost. Finally, we provide general programming interfaces for users to benefit from the memory-aware framework easily. The empirical studies demonstrate that our memory-aware framework is robust with respect to memory and is able to achieve considerable efficiency by reducing 90% of the memory cost.

Supplemental Material

3318464.3380562.mp4

mp4

135.7 MB

Download

References

Mansurul Bhuiyan and Mohammad Al Hasan. 2018. Representing Graphs as Bag of Vertices and Partitions for Graph Classification. Data Science and Engineering, Vol. 3, 2 (Jun 2018), 150--165.Google ScholarCross Ref
Paolo Boldi and Marco Rosa. 2012. Arc-Community Detection via Triangular Random Walks. In 2012 Eighth Latin American Web Congress. 48--56.Google ScholarDigital Library
Stephen Bonner, Ibad Kureshi, John Brennan, Georgios Theodoropoulos, Andrew Stephen McGough, and Boguslaw Obara. 2019. Exploring the Semantic Content of Unsupervised Graph Embeddings: An Empirical Study. Data Science and Engineering, Vol. 4, 3 (Sep 2019), 269--289.Google ScholarCross Ref
Surajit Chaudhuri. 1998. An Overview of Query Optimization in Relational Systems. In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS '98). 34--43.Google ScholarDigital Library
Krzysztof Dudzinski and Stanislaw Walukiewicz. 1987 a. Exact methods for the knapsack problem and its generalizations. European Journal of Operational Research, Vol. 28, 1 (1987), 3 -- 21.Google ScholarCross Ref
Krzysztof Dudzinski and Stanislaw Walukiewicz. 1987 b. Exact methods for the knapsack problem and its generalizations. European Journal of Operational Research, Vol. 28, 1 (January 1987), 3--21.Google ScholarCross Ref
Shanshan Feng, Gao Cong, Arijit Khan, Xiucheng Li, Yong Liu, and Yeow Meng Chee. 2018. Inf2vec: Latent Representation Model for Social Influence Embedding. In 34th IEEE International Conference on Data Engineering (ICDE '18). 941--952.Google ScholarCross Ref
Geoffrey Grimmett and David Stirzaker. 2001. Probability and random processes. Vol. 80. Oxford university press.Google Scholar
Aditya Grover and Jure Leskovec. 2016. Node2Vec: Scalable Feature Learning for Networks. Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). 855--864.Google ScholarDigital Library
William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). 1025--1035.Google ScholarDigital Library
Huahai He and Ambuj K. Singh. 2008. Graphs-at-a-time: Query Language and Access Methods for Graph Databases. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD '08). 405--418.Google Scholar
Herodotos Herodotou and Shivnath Babu. 2011. Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. Proc. VLDB Endow., Vol. 4, 11 (2011), 1111--1122.Google ScholarDigital Library
Jiewen Huang, Kartik Venkatraman, and Daniel J. Abadi. 2014. Query optimization of distributed pattern matching. In 2014 IEEE 30th International Conference on Data Engineering (ICDE '14). 64--75.Google Scholar
Amy N. Langville and Carl D. Meyer. 2011. Google's PageRank and Beyond: The Science of Search Engine Rankings, chapter The mathematics guide. Princeton University Press.Google ScholarDigital Library
Matthieu Latapy. 2008. Main-memory Triangle Computations for Very Large (Sparse (Power-law)) Graphs. Theor. Comput. Sci., Vol. 407, 1--3 (Nov. 2008), 458--473.Google ScholarDigital Library
Rong-Hua Li, Jeffrey Xu Yu, Lu Qin, Rui Mao, and Tan Jin. 2015. On random walk based graph sampling. In 2015 IEEE 31st International Conference on Data Engineering (ICDE '15). 927--938.Google ScholarCross Ref
Xiaolin Li, Yuan Zhuang, Yanjie Fu, and Xiangdong He. 2019. A trust-aware random walk model for return propensity estimation and consumer anomaly scoring in online shopping. Science China Information Sciences, Vol. 62, 5 (Mar 2019), 52101.Google ScholarCross Ref
David Liben-Nowell and Jon Kleinberg. 2003. The Link Prediction Problem for Social Networks. Proceedings of the Twelfth International Conference on Information and Knowledge Management (CIKM '03). 556--559.Google ScholarDigital Library
Sungsu Lim, Seungwoo Ryu, Sejeong Kwon, Kyomin Jung, and Jae-Gil Lee. 2014. LinkSCAN*: Overlapping community detection using the link-space transformation. In 2014 IEEE 30th International Conference on Data Engineering (ICDE '14). 292--303.Google ScholarCross Ref
Hai Liu, Dongqing Xiao, Pankaj Didwania, and Mohamed Y. Eltabakh. 2016. Exploiting Soft and Hard Correlations in Big Data Query Optimization. Proc. VLDB Endow., Vol. 9, 12 (Aug. 2016), 1005--1016.Google ScholarDigital Library
Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A System for Large-scale Graph Processing. Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD '10). 135--146.Google ScholarDigital Library
George Marsaglia. 1963. Generating Discrete Random Variables in a Computer. Commun. ACM, Vol. 6, 1 (1963), 37--38.Google ScholarDigital Library
Rosvall Martin, Esquivel Alcides V., Andrea Lancichinetti, West Jevin D., and Lambiotte Renaud. 2014. Memory in network flows and its effects on spreading dynamics and community detection. Nature Communications, Vol. 5, 4630 (2014).Google Scholar
Azade Nazi, Zhuojie Zhou, Saravanan Thirumuruganathan, Nan Zhang, and Gautam Das. 2015. Walk, Not Wait: Faster Sampling over Online Social Networks. Proc. VLDB Endow., Vol. 8, 6 (Feb. 2015), 678--689.Google ScholarDigital Library
Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online Learning of Social Representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '14). 701--710.Google ScholarDigital Library
David Pisinger. 1995. A minimal algorithm for the multiple-choice knapsack problem. European Journal of Operational Research, Vol. 83, 2 (1995), 394 -- 410. EURO Summer Institute Combinatorial Optimization.Google ScholarCross Ref
Adrian E. Raftery. 1985. A Model for High-Order Markov Chains. Journal of the Royal Statistical Society. Series B (Methodological), Vol. 47, 3 (1985), 528--539.Google ScholarCross Ref
Christian P. Robert and George Casella. 2010. Monte Carlo Statistical Methods .Springer Publishing Company, Incorporated.Google Scholar
Yousef Saad. 2003. Iterative Methods for Sparse Linear Systems 2nd ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA.Google ScholarDigital Library
Vsevolod Salnikov, Michael T. Schaub, and Renaud Lambiotte. 2016. Using higher-order Markov models to reveal flow-based communities in networks. Scientific reports, Vol. 5, 23194 (2016), 1--13.Google Scholar
Neha Sengupta, Amitabha Bagchi, Maya Ramanath, and Srikanta Bedathur. 2019. ARROW: Approximating Reachability Using Random Walks Over Web-Scale Graphs. In 2019 IEEE 35th International Conference on Data Engineering (ICDE '19). 470--481.Google Scholar
Yingxia Shao, Bin Cui, Lei Chen, Mingming Liu, and Xing Xie. 2015. An Efficient Similarity Search Framework for SimRank over Large Dynamic Graphs. Proc. VLDB Endow., Vol. 8, 8 (April 2015), 838--849.Google ScholarDigital Library
Prabhakant Sinha and Andris A. Zoltners. 1979. The Multiple-Choice Knapsack Problem. Operations Research, Vol. 27, 3 (1979), 503--515.Google ScholarDigital Library
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: A Warehousing Solution over a Map-reduce Framework. Proc. VLDB Endow., Vol. 2, 2 (Aug. 2009), 1626--1629.Google ScholarDigital Library
Anton Tsitsulin, Davide Mottin, Panagiotis Karras, and Emmanuel Müller. 2018. VERSE: Versatile Graph Embeddings from Similarity Measures. In Proceedings of the 2018 World Wide Web Conference (WWW '18). 539--548.Google ScholarDigital Library
Grigorios Tsoumakas and Ioannis Katakis. 2007. Multi-label classification: An overview. Int J Data Warehousing and Mining, Vol. 2007 (2007), 1--13.Google ScholarCross Ref
Alastair J. Walker. 1977. An Efficient Method for Generating Discrete Random Variables with General Distributions. ACM Trans. Math. Softw., Vol. 3, 3 (1977), 253--256.Google ScholarDigital Library
Yubao Wu, Yuchen Bian, and Xiang Zhang. 2016. Remember Where You Came from: On the Second-order Random Walk Based Proximity Measures. Proc. VLDB Endow., Vol. 10, 1 (2016), 13--24.Google ScholarDigital Library
Yunpeng Xiao, Xixi Li, Yuanni Liu, Hong Liu, and Qian Li. 2018. Correlations multiplexing for link prediction in multidimensional network spaces. Science China Information Sciences, Vol. 61, 11 (Jun 2018), 112103.Google ScholarCross Ref
Jian Xu, Thanuka Wickramarathne, and Nitesh V. Chawla. 2016. Representing higher-order dependencies in networks. In Science Advances.Google Scholar
Eitan Zemel. 1980. The Linear Multiple Choice Knapsack Problem. Operations Research, Vol. 28, 6 (1980), 1412--1423.Google ScholarDigital Library
Zhipeng Zhang, Yingxia Shao, Bin Cui, and Ce Zhang. 2017. An Experimental Evaluation of Simrank-Based Similarity Search Algorithms. Proc. VLDB Endow., Vol. 10, 5 (2017), 601--612.Google ScholarDigital Library
Peixiang Zhao and Jiawei Han. 2010. On Graph Query Optimization in Large Networks. Proc. VLDB Endow., Vol. 3, 1--2 (Sept. 2010), 340--351.Google ScholarDigital Library
Dongyan Zhou, Songjie Niu, and Shimin Chen. 2018. Efficient Graph Computation for Node2Vec. CoRR, Vol. abs/1805.00280 (2018). arxiv: 1805.00280Google Scholar

Index Terms

Memory-Aware Framework for Efficient Second-Order Random Walk on Large Graphs
1. Theory of computation
  1. Design and analysis of algorithms
    1. Graph algorithms analysis
    2. Parallel algorithms

Recommendations

Memory-aware framework for fast and scalable second-order random walk over billion-edge natural graphs
Abstract
Second-order random walk is an important technique for graph analysis. Many applications including graph embedding, proximity measure and community detection use it to capture higher-order patterns in the graph, thus improving the model accuracy. ...
Read More
Random walk on node cliques for high-quality samples to estimate large graphs with high accuracies and low costs
Abstract
Random-walk-based sampling is an efficient way to extract and analyze the properties of large and complex graphs representing social networks. However, it is almost impractical for existing random-walk-based sampling schemes to reach the desired ...
Read More
Random walk-based graphical sampling in unbalanced heterogeneous bipartite social graphs
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

We investigate sampling techniques in unbalanced heterogeneous bipartite graphs (UHBGs), which have wide applications in real world web-scale social networks. We propose random walked-based link sampling and stratified sampling for UHBGs and show that ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
June 2020
2925 pages
ISBN:9781450367356
DOI:10.1145/3318464
General Chairs:
David Maier
Portland State University, USA
,
Rachel Pottinger
University of British Columbia, Canada
,
Program Chairs:
AnHai Doan
University of Wisconsin, USA
,
Wang-Chiew Tan
Megagon Labs, USA
,
Publications Chairs:
Abdussalam Alawini
University of Illinois at Urbana-Champaign, USA
,
Hung Q. Ngo
RelationalAI, USA
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 31 May 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
graph algorithm
large-scale
memory efficient
random walk
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 704
  Total Downloads
- Downloads (Last 12 months)46
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Memory-Aware Framework for Efficient Second-Order Random Walk on Large Graphs

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Memory-aware framework for fast and scalable second-order random walk over billion-edge natural graphs

Random walk on node cliques for high-quality samples to estimate large graphs with high accuracies and low costs

Random walk-based graphical sampling in unbalanced heterogeneous bipartite social graphs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Memory-Aware Framework for Efficient Second-Order Random Walk on Large Graphs

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Memory-aware framework for fast and scalable second-order random walk over billion-edge natural graphs

Random walk on node cliques for high-quality samples to estimate large graphs with high accuracies and low costs

Random walk-based graphical sampling in unbalanced heterogeneous bipartite social graphs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media