skip to main content
10.1145/3318464.3380562acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Memory-Aware Framework for Efficient Second-Order Random Walk on Large Graphs

Authors Info & Claims
Published:31 May 2020Publication History

ABSTRACT

Second-order random walk is an important technique for graph analysis. Many applications use it to capture higher-order patterns in the graph, thus improving the model accuracy. However, the memory explosion problem of this technique hinders it from analyzing large graphs. When processing a billion-edge graph like Twitter, existing solutions (e.g., alias method) of the second-order random walk may take up 1796TB memory. Such high memory overhead comes from the memory-unaware strategies for node sampling across the graph. In this paper, to clearly study the efficiency of various node sampling methods in the context of second-order random walk, we design a cost model, and then propose a new node sampling method following the acceptance-rejection paradigm to achieve a better balance between memory and time cost. Further, to guarantee the efficiency of the second-order random walk within arbitrary memory budgets, we propose a memory-aware framework on the basis of the cost model. The framework applies a cost-based optimizer to assign desirable node sampling method for each node in the graph within a memory budget while minimizing the time cost. Finally, we provide general programming interfaces for users to benefit from the memory-aware framework easily. The empirical studies demonstrate that our memory-aware framework is robust with respect to memory and is able to achieve considerable efficiency by reducing 90% of the memory cost.

Skip Supplemental Material Section

Supplemental Material

3318464.3380562.mp4

mp4

135.7 MB

References

  1. Mansurul Bhuiyan and Mohammad Al Hasan. 2018. Representing Graphs as Bag of Vertices and Partitions for Graph Classification. Data Science and Engineering, Vol. 3, 2 (Jun 2018), 150--165.Google ScholarGoogle ScholarCross RefCross Ref
  2. Paolo Boldi and Marco Rosa. 2012. Arc-Community Detection via Triangular Random Walks. In 2012 Eighth Latin American Web Congress. 48--56.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Stephen Bonner, Ibad Kureshi, John Brennan, Georgios Theodoropoulos, Andrew Stephen McGough, and Boguslaw Obara. 2019. Exploring the Semantic Content of Unsupervised Graph Embeddings: An Empirical Study. Data Science and Engineering, Vol. 4, 3 (Sep 2019), 269--289.Google ScholarGoogle ScholarCross RefCross Ref
  4. Surajit Chaudhuri. 1998. An Overview of Query Optimization in Relational Systems. In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS '98). 34--43.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Krzysztof Dudzinski and Stanislaw Walukiewicz. 1987 a. Exact methods for the knapsack problem and its generalizations. European Journal of Operational Research, Vol. 28, 1 (1987), 3 -- 21.Google ScholarGoogle ScholarCross RefCross Ref
  6. Krzysztof Dudzinski and Stanislaw Walukiewicz. 1987 b. Exact methods for the knapsack problem and its generalizations. European Journal of Operational Research, Vol. 28, 1 (January 1987), 3--21.Google ScholarGoogle ScholarCross RefCross Ref
  7. Shanshan Feng, Gao Cong, Arijit Khan, Xiucheng Li, Yong Liu, and Yeow Meng Chee. 2018. Inf2vec: Latent Representation Model for Social Influence Embedding. In 34th IEEE International Conference on Data Engineering (ICDE '18). 941--952.Google ScholarGoogle ScholarCross RefCross Ref
  8. Geoffrey Grimmett and David Stirzaker. 2001. Probability and random processes. Vol. 80. Oxford university press.Google ScholarGoogle Scholar
  9. Aditya Grover and Jure Leskovec. 2016. Node2Vec: Scalable Feature Learning for Networks. Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). 855--864.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). 1025--1035.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Huahai He and Ambuj K. Singh. 2008. Graphs-at-a-time: Query Language and Access Methods for Graph Databases. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD '08). 405--418.Google ScholarGoogle Scholar
  12. Herodotos Herodotou and Shivnath Babu. 2011. Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. Proc. VLDB Endow., Vol. 4, 11 (2011), 1111--1122.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Jiewen Huang, Kartik Venkatraman, and Daniel J. Abadi. 2014. Query optimization of distributed pattern matching. In 2014 IEEE 30th International Conference on Data Engineering (ICDE '14). 64--75.Google ScholarGoogle Scholar
  14. Amy N. Langville and Carl D. Meyer. 2011. Google's PageRank and Beyond: The Science of Search Engine Rankings, chapter The mathematics guide. Princeton University Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Matthieu Latapy. 2008. Main-memory Triangle Computations for Very Large (Sparse (Power-law)) Graphs. Theor. Comput. Sci., Vol. 407, 1--3 (Nov. 2008), 458--473.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Rong-Hua Li, Jeffrey Xu Yu, Lu Qin, Rui Mao, and Tan Jin. 2015. On random walk based graph sampling. In 2015 IEEE 31st International Conference on Data Engineering (ICDE '15). 927--938.Google ScholarGoogle ScholarCross RefCross Ref
  17. Xiaolin Li, Yuan Zhuang, Yanjie Fu, and Xiangdong He. 2019. A trust-aware random walk model for return propensity estimation and consumer anomaly scoring in online shopping. Science China Information Sciences, Vol. 62, 5 (Mar 2019), 52101.Google ScholarGoogle ScholarCross RefCross Ref
  18. David Liben-Nowell and Jon Kleinberg. 2003. The Link Prediction Problem for Social Networks. Proceedings of the Twelfth International Conference on Information and Knowledge Management (CIKM '03). 556--559.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Sungsu Lim, Seungwoo Ryu, Sejeong Kwon, Kyomin Jung, and Jae-Gil Lee. 2014. LinkSCAN*: Overlapping community detection using the link-space transformation. In 2014 IEEE 30th International Conference on Data Engineering (ICDE '14). 292--303.Google ScholarGoogle ScholarCross RefCross Ref
  20. Hai Liu, Dongqing Xiao, Pankaj Didwania, and Mohamed Y. Eltabakh. 2016. Exploiting Soft and Hard Correlations in Big Data Query Optimization. Proc. VLDB Endow., Vol. 9, 12 (Aug. 2016), 1005--1016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A System for Large-scale Graph Processing. Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD '10). 135--146.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. George Marsaglia. 1963. Generating Discrete Random Variables in a Computer. Commun. ACM, Vol. 6, 1 (1963), 37--38.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Rosvall Martin, Esquivel Alcides V., Andrea Lancichinetti, West Jevin D., and Lambiotte Renaud. 2014. Memory in network flows and its effects on spreading dynamics and community detection. Nature Communications, Vol. 5, 4630 (2014).Google ScholarGoogle Scholar
  24. Azade Nazi, Zhuojie Zhou, Saravanan Thirumuruganathan, Nan Zhang, and Gautam Das. 2015. Walk, Not Wait: Faster Sampling over Online Social Networks. Proc. VLDB Endow., Vol. 8, 6 (Feb. 2015), 678--689.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online Learning of Social Representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '14). 701--710.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. David Pisinger. 1995. A minimal algorithm for the multiple-choice knapsack problem. European Journal of Operational Research, Vol. 83, 2 (1995), 394 -- 410. EURO Summer Institute Combinatorial Optimization.Google ScholarGoogle ScholarCross RefCross Ref
  27. Adrian E. Raftery. 1985. A Model for High-Order Markov Chains. Journal of the Royal Statistical Society. Series B (Methodological), Vol. 47, 3 (1985), 528--539.Google ScholarGoogle ScholarCross RefCross Ref
  28. Christian P. Robert and George Casella. 2010. Monte Carlo Statistical Methods .Springer Publishing Company, Incorporated.Google ScholarGoogle Scholar
  29. Yousef Saad. 2003. Iterative Methods for Sparse Linear Systems 2nd ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Vsevolod Salnikov, Michael T. Schaub, and Renaud Lambiotte. 2016. Using higher-order Markov models to reveal flow-based communities in networks. Scientific reports, Vol. 5, 23194 (2016), 1--13.Google ScholarGoogle Scholar
  31. Neha Sengupta, Amitabha Bagchi, Maya Ramanath, and Srikanta Bedathur. 2019. ARROW: Approximating Reachability Using Random Walks Over Web-Scale Graphs. In 2019 IEEE 35th International Conference on Data Engineering (ICDE '19). 470--481.Google ScholarGoogle Scholar
  32. Yingxia Shao, Bin Cui, Lei Chen, Mingming Liu, and Xing Xie. 2015. An Efficient Similarity Search Framework for SimRank over Large Dynamic Graphs. Proc. VLDB Endow., Vol. 8, 8 (April 2015), 838--849.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Prabhakant Sinha and Andris A. Zoltners. 1979. The Multiple-Choice Knapsack Problem. Operations Research, Vol. 27, 3 (1979), 503--515.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: A Warehousing Solution over a Map-reduce Framework. Proc. VLDB Endow., Vol. 2, 2 (Aug. 2009), 1626--1629.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Anton Tsitsulin, Davide Mottin, Panagiotis Karras, and Emmanuel Müller. 2018. VERSE: Versatile Graph Embeddings from Similarity Measures. In Proceedings of the 2018 World Wide Web Conference (WWW '18). 539--548.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Grigorios Tsoumakas and Ioannis Katakis. 2007. Multi-label classification: An overview. Int J Data Warehousing and Mining, Vol. 2007 (2007), 1--13.Google ScholarGoogle ScholarCross RefCross Ref
  37. Alastair J. Walker. 1977. An Efficient Method for Generating Discrete Random Variables with General Distributions. ACM Trans. Math. Softw., Vol. 3, 3 (1977), 253--256.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Yubao Wu, Yuchen Bian, and Xiang Zhang. 2016. Remember Where You Came from: On the Second-order Random Walk Based Proximity Measures. Proc. VLDB Endow., Vol. 10, 1 (2016), 13--24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Yunpeng Xiao, Xixi Li, Yuanni Liu, Hong Liu, and Qian Li. 2018. Correlations multiplexing for link prediction in multidimensional network spaces. Science China Information Sciences, Vol. 61, 11 (Jun 2018), 112103.Google ScholarGoogle ScholarCross RefCross Ref
  40. Jian Xu, Thanuka Wickramarathne, and Nitesh V. Chawla. 2016. Representing higher-order dependencies in networks. In Science Advances.Google ScholarGoogle Scholar
  41. Eitan Zemel. 1980. The Linear Multiple Choice Knapsack Problem. Operations Research, Vol. 28, 6 (1980), 1412--1423.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Zhipeng Zhang, Yingxia Shao, Bin Cui, and Ce Zhang. 2017. An Experimental Evaluation of Simrank-Based Similarity Search Algorithms. Proc. VLDB Endow., Vol. 10, 5 (2017), 601--612.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Peixiang Zhao and Jiawei Han. 2010. On Graph Query Optimization in Large Networks. Proc. VLDB Endow., Vol. 3, 1--2 (Sept. 2010), 340--351.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Dongyan Zhou, Songjie Niu, and Shimin Chen. 2018. Efficient Graph Computation for Node2Vec. CoRR, Vol. abs/1805.00280 (2018). arxiv: 1805.00280Google ScholarGoogle Scholar

Index Terms

  1. Memory-Aware Framework for Efficient Second-Order Random Walk on Large Graphs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
        June 2020
        2925 pages
        ISBN:9781450367356
        DOI:10.1145/3318464

        Copyright © 2020 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 31 May 2020

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader