ABSTRACT
Second-order random walk is an important technique for graph analysis. Many applications use it to capture higher-order patterns in the graph, thus improving the model accuracy. However, the memory explosion problem of this technique hinders it from analyzing large graphs. When processing a billion-edge graph like Twitter, existing solutions (e.g., alias method) of the second-order random walk may take up 1796TB memory. Such high memory overhead comes from the memory-unaware strategies for node sampling across the graph. In this paper, to clearly study the efficiency of various node sampling methods in the context of second-order random walk, we design a cost model, and then propose a new node sampling method following the acceptance-rejection paradigm to achieve a better balance between memory and time cost. Further, to guarantee the efficiency of the second-order random walk within arbitrary memory budgets, we propose a memory-aware framework on the basis of the cost model. The framework applies a cost-based optimizer to assign desirable node sampling method for each node in the graph within a memory budget while minimizing the time cost. Finally, we provide general programming interfaces for users to benefit from the memory-aware framework easily. The empirical studies demonstrate that our memory-aware framework is robust with respect to memory and is able to achieve considerable efficiency by reducing 90% of the memory cost.
Supplemental Material
- Mansurul Bhuiyan and Mohammad Al Hasan. 2018. Representing Graphs as Bag of Vertices and Partitions for Graph Classification. Data Science and Engineering, Vol. 3, 2 (Jun 2018), 150--165.Google ScholarCross Ref
- Paolo Boldi and Marco Rosa. 2012. Arc-Community Detection via Triangular Random Walks. In 2012 Eighth Latin American Web Congress. 48--56.Google ScholarDigital Library
- Stephen Bonner, Ibad Kureshi, John Brennan, Georgios Theodoropoulos, Andrew Stephen McGough, and Boguslaw Obara. 2019. Exploring the Semantic Content of Unsupervised Graph Embeddings: An Empirical Study. Data Science and Engineering, Vol. 4, 3 (Sep 2019), 269--289.Google ScholarCross Ref
- Surajit Chaudhuri. 1998. An Overview of Query Optimization in Relational Systems. In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS '98). 34--43.Google ScholarDigital Library
- Krzysztof Dudzinski and Stanislaw Walukiewicz. 1987 a. Exact methods for the knapsack problem and its generalizations. European Journal of Operational Research, Vol. 28, 1 (1987), 3 -- 21.Google ScholarCross Ref
- Krzysztof Dudzinski and Stanislaw Walukiewicz. 1987 b. Exact methods for the knapsack problem and its generalizations. European Journal of Operational Research, Vol. 28, 1 (January 1987), 3--21.Google ScholarCross Ref
- Shanshan Feng, Gao Cong, Arijit Khan, Xiucheng Li, Yong Liu, and Yeow Meng Chee. 2018. Inf2vec: Latent Representation Model for Social Influence Embedding. In 34th IEEE International Conference on Data Engineering (ICDE '18). 941--952.Google ScholarCross Ref
- Geoffrey Grimmett and David Stirzaker. 2001. Probability and random processes. Vol. 80. Oxford university press.Google Scholar
- Aditya Grover and Jure Leskovec. 2016. Node2Vec: Scalable Feature Learning for Networks. Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). 855--864.Google ScholarDigital Library
- William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). 1025--1035.Google ScholarDigital Library
- Huahai He and Ambuj K. Singh. 2008. Graphs-at-a-time: Query Language and Access Methods for Graph Databases. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD '08). 405--418.Google Scholar
- Herodotos Herodotou and Shivnath Babu. 2011. Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. Proc. VLDB Endow., Vol. 4, 11 (2011), 1111--1122.Google ScholarDigital Library
- Jiewen Huang, Kartik Venkatraman, and Daniel J. Abadi. 2014. Query optimization of distributed pattern matching. In 2014 IEEE 30th International Conference on Data Engineering (ICDE '14). 64--75.Google Scholar
- Amy N. Langville and Carl D. Meyer. 2011. Google's PageRank and Beyond: The Science of Search Engine Rankings, chapter The mathematics guide. Princeton University Press.Google ScholarDigital Library
- Matthieu Latapy. 2008. Main-memory Triangle Computations for Very Large (Sparse (Power-law)) Graphs. Theor. Comput. Sci., Vol. 407, 1--3 (Nov. 2008), 458--473.Google ScholarDigital Library
- Rong-Hua Li, Jeffrey Xu Yu, Lu Qin, Rui Mao, and Tan Jin. 2015. On random walk based graph sampling. In 2015 IEEE 31st International Conference on Data Engineering (ICDE '15). 927--938.Google ScholarCross Ref
- Xiaolin Li, Yuan Zhuang, Yanjie Fu, and Xiangdong He. 2019. A trust-aware random walk model for return propensity estimation and consumer anomaly scoring in online shopping. Science China Information Sciences, Vol. 62, 5 (Mar 2019), 52101.Google ScholarCross Ref
- David Liben-Nowell and Jon Kleinberg. 2003. The Link Prediction Problem for Social Networks. Proceedings of the Twelfth International Conference on Information and Knowledge Management (CIKM '03). 556--559.Google ScholarDigital Library
- Sungsu Lim, Seungwoo Ryu, Sejeong Kwon, Kyomin Jung, and Jae-Gil Lee. 2014. LinkSCAN*: Overlapping community detection using the link-space transformation. In 2014 IEEE 30th International Conference on Data Engineering (ICDE '14). 292--303.Google ScholarCross Ref
- Hai Liu, Dongqing Xiao, Pankaj Didwania, and Mohamed Y. Eltabakh. 2016. Exploiting Soft and Hard Correlations in Big Data Query Optimization. Proc. VLDB Endow., Vol. 9, 12 (Aug. 2016), 1005--1016.Google ScholarDigital Library
- Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A System for Large-scale Graph Processing. Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD '10). 135--146.Google ScholarDigital Library
- George Marsaglia. 1963. Generating Discrete Random Variables in a Computer. Commun. ACM, Vol. 6, 1 (1963), 37--38.Google ScholarDigital Library
- Rosvall Martin, Esquivel Alcides V., Andrea Lancichinetti, West Jevin D., and Lambiotte Renaud. 2014. Memory in network flows and its effects on spreading dynamics and community detection. Nature Communications, Vol. 5, 4630 (2014).Google Scholar
- Azade Nazi, Zhuojie Zhou, Saravanan Thirumuruganathan, Nan Zhang, and Gautam Das. 2015. Walk, Not Wait: Faster Sampling over Online Social Networks. Proc. VLDB Endow., Vol. 8, 6 (Feb. 2015), 678--689.Google ScholarDigital Library
- Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online Learning of Social Representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '14). 701--710.Google ScholarDigital Library
- David Pisinger. 1995. A minimal algorithm for the multiple-choice knapsack problem. European Journal of Operational Research, Vol. 83, 2 (1995), 394 -- 410. EURO Summer Institute Combinatorial Optimization.Google ScholarCross Ref
- Adrian E. Raftery. 1985. A Model for High-Order Markov Chains. Journal of the Royal Statistical Society. Series B (Methodological), Vol. 47, 3 (1985), 528--539.Google ScholarCross Ref
- Christian P. Robert and George Casella. 2010. Monte Carlo Statistical Methods .Springer Publishing Company, Incorporated.Google Scholar
- Yousef Saad. 2003. Iterative Methods for Sparse Linear Systems 2nd ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA.Google ScholarDigital Library
- Vsevolod Salnikov, Michael T. Schaub, and Renaud Lambiotte. 2016. Using higher-order Markov models to reveal flow-based communities in networks. Scientific reports, Vol. 5, 23194 (2016), 1--13.Google Scholar
- Neha Sengupta, Amitabha Bagchi, Maya Ramanath, and Srikanta Bedathur. 2019. ARROW: Approximating Reachability Using Random Walks Over Web-Scale Graphs. In 2019 IEEE 35th International Conference on Data Engineering (ICDE '19). 470--481.Google Scholar
- Yingxia Shao, Bin Cui, Lei Chen, Mingming Liu, and Xing Xie. 2015. An Efficient Similarity Search Framework for SimRank over Large Dynamic Graphs. Proc. VLDB Endow., Vol. 8, 8 (April 2015), 838--849.Google ScholarDigital Library
- Prabhakant Sinha and Andris A. Zoltners. 1979. The Multiple-Choice Knapsack Problem. Operations Research, Vol. 27, 3 (1979), 503--515.Google ScholarDigital Library
- Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: A Warehousing Solution over a Map-reduce Framework. Proc. VLDB Endow., Vol. 2, 2 (Aug. 2009), 1626--1629.Google ScholarDigital Library
- Anton Tsitsulin, Davide Mottin, Panagiotis Karras, and Emmanuel Müller. 2018. VERSE: Versatile Graph Embeddings from Similarity Measures. In Proceedings of the 2018 World Wide Web Conference (WWW '18). 539--548.Google ScholarDigital Library
- Grigorios Tsoumakas and Ioannis Katakis. 2007. Multi-label classification: An overview. Int J Data Warehousing and Mining, Vol. 2007 (2007), 1--13.Google ScholarCross Ref
- Alastair J. Walker. 1977. An Efficient Method for Generating Discrete Random Variables with General Distributions. ACM Trans. Math. Softw., Vol. 3, 3 (1977), 253--256.Google ScholarDigital Library
- Yubao Wu, Yuchen Bian, and Xiang Zhang. 2016. Remember Where You Came from: On the Second-order Random Walk Based Proximity Measures. Proc. VLDB Endow., Vol. 10, 1 (2016), 13--24.Google ScholarDigital Library
- Yunpeng Xiao, Xixi Li, Yuanni Liu, Hong Liu, and Qian Li. 2018. Correlations multiplexing for link prediction in multidimensional network spaces. Science China Information Sciences, Vol. 61, 11 (Jun 2018), 112103.Google ScholarCross Ref
- Jian Xu, Thanuka Wickramarathne, and Nitesh V. Chawla. 2016. Representing higher-order dependencies in networks. In Science Advances.Google Scholar
- Eitan Zemel. 1980. The Linear Multiple Choice Knapsack Problem. Operations Research, Vol. 28, 6 (1980), 1412--1423.Google ScholarDigital Library
- Zhipeng Zhang, Yingxia Shao, Bin Cui, and Ce Zhang. 2017. An Experimental Evaluation of Simrank-Based Similarity Search Algorithms. Proc. VLDB Endow., Vol. 10, 5 (2017), 601--612.Google ScholarDigital Library
- Peixiang Zhao and Jiawei Han. 2010. On Graph Query Optimization in Large Networks. Proc. VLDB Endow., Vol. 3, 1--2 (Sept. 2010), 340--351.Google ScholarDigital Library
- Dongyan Zhou, Songjie Niu, and Shimin Chen. 2018. Efficient Graph Computation for Node2Vec. CoRR, Vol. abs/1805.00280 (2018). arxiv: 1805.00280Google Scholar
Index Terms
- Memory-Aware Framework for Efficient Second-Order Random Walk on Large Graphs
Recommendations
Memory-aware framework for fast and scalable second-order random walk over billion-edge natural graphs
AbstractSecond-order random walk is an important technique for graph analysis. Many applications including graph embedding, proximity measure and community detection use it to capture higher-order patterns in the graph, thus improving the model accuracy. ...
Random walk on node cliques for high-quality samples to estimate large graphs with high accuracies and low costs
AbstractRandom-walk-based sampling is an efficient way to extract and analyze the properties of large and complex graphs representing social networks. However, it is almost impractical for existing random-walk-based sampling schemes to reach the desired ...
Random walk-based graphical sampling in unbalanced heterogeneous bipartite social graphs
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge ManagementWe investigate sampling techniques in unbalanced heterogeneous bipartite graphs (UHBGs), which have wide applications in real world web-scale social networks. We propose random walked-based link sampling and stratified sampling for UHBGs and show that ...
Comments