Abstract
Second-order random walk is an important technique for graph analysis. Many applications including graph embedding, proximity measure and community detection use it to capture higher-order patterns in the graph, thus improving the model accuracy. However, the memory explosion problem of this technique hinders it from analyzing large graphs. When processing a billion-edge graph like Twitter, existing solutions (e.g., alias method) of the second-order random walk may take up 1796TB memory. Such high memory consumption comes from the memory-unaware strategies for the node sampling during the random walk. In this paper, to clearly compare the efficiency of various node sampling methods, we first design a cost model and propose two new node sampling methods: one follows the acceptance-rejection paradigm to achieve a better balance between memory and time cost, and the other is optimized for fast sampling the skewed probability distributions existed in natural graphs. Second, to achieve the high efficiency of the second-order random walk within arbitrary memory budgets, we propose a novel memory-aware framework on the basis of the cost model. The framework applies a cost-based optimizer to assign desirable node sampling method for each node or edge in the graph within a memory budget meanwhile minimizing the time cost of the random walk. Finally, the framework provides general programming interfaces for users to define new second-order random walk models easily. The empirical studies demonstrate that our memory-aware framework is robust with respect to memory and is able to achieve considerable efficiency by reducing 90% of the memory cost.












Similar content being viewed by others
Notes
Note that the coefficient c is incurred by finding the edge id between previous node u and current node v to access the group information.
Note that the minimal memory of rejection method is different from the one in our conference version, because we store the number of common neighbors of edges in memory for fast computing the exact bounding constant, thus improving the efficiency of rejection method over billion-edge graphs.
References
Boldi, P., Rosa, M.: Arc-community detection via triangular random walks. In: 2012 Eighth Latin American Web Congress, pp. 48–56 (2012)
Bonner, S., Kureshi, I., Brennan, J., Theodoropoulos, G., McGough, A.S., Obara, B.: Exploring the semantic content of unsupervised graph embeddings: an empirical study. Data Sci. Eng. 4(3), 269–289 (2019)
Chaudhuri, S.: An overview of query optimization in relational systems. In: PODS, pp. 34–43 (1998)
Das Sarma, A., Molla, A.R., Pandurangan, G.: Efficient random walk sampling in distributed networks. J. Parallel Distrib. Comput. 77, 84–94 (2015)
Dave, V.S., Zhang, B., Chen, P.Y., Hasan, M.A.: Neural-brane: neural Bayesian personalized ranking for attributed network embedding. Data Sci. Eng. 4(2), 119–131 (2019)
Dudzinski, K., Walukiewicz, S.: Exact methods for the knapsack problem and its generalizations. Eur. J. Op. Res. 28(1), 3–21 (1987)
Feng, S., Cong, G., Khan, A., Li, X., Liu, Y., Chee, Y.M.: Inf2vec: Latent representation model for social influence embedding. In: ICDE, pp. 941–952 (2018)
Grimmett, G., Stirzaker, D.: Probability and Random Processes, vol. 80. Oxford University Press, Oxford (2001)
Grover, A., Leskovec, J.: Node2vec: Scalable feature learning for networks. In: KDD, pp. 855–864 (2016)
Hamilton, W.L., Ying, R., Leskovec, J.: Inductive representation learning on large graphs. In: NIPS, pp. 1025–1035 (2017)
He, H., Singh, A.K.: Graphs-at-a-time: Query language and access methods for graph databases. In: SIGMOD, pp. 405–418 (2008)
Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of mapreduce programs. Proc. VLDB Endow. 4(11), 1111–1122 (2011)
Hu, X., Tao, Y., Chung, C.W.: Massive graph triangulation. In: SIGMOD, p. 325–336 (2013)
Huang, J., Venkatraman, K., Abadi, D.J.: Query optimization of distributed pattern matching. In: ICDE, pp. 64–75 (2014)
Kyrola, A.: Drunkardmob: Billions of random walks on just a pc. In: RecSys, pp. 257–264 (2013)
Langville, A.N., Meyer, C.D.: Google’s PageRank and Beyond: The Science of Search Engine Rankings, Chapter The Mathematics Guide. Princeton University Press, Princeton (2011)
Latapy, M.: Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor. Comput. Sci. 407(1–3), 458–473 (2008)
Li, R.H., Yu, J.X., Qin, L., Mao, R., Jin, T.: On random walk based graph sampling. In: ICDE, pp. 927–938 (2015)
Li, X., Zhuang, Y., Fu, Y., He, X.: A trust-aware random walk model for return propensity estimation and consumer anomaly scoring in online shopping. Sci. China Inf. Sci. 62(5), 52101 (2019)
Liben-Nowell, D., Kleinberg, J.: The link prediction problem for social networks. In: CIKM, pp. 556–559 (2003)
Lim, S., Ryu, S., Kwon, S., Jung, K., Lee, J.G.: Linkscan*: Overlapping community detection using the link-space transformation. In: ICDE, pp. 292–303 (2014)
Liu, H., Xiao, D., Didwania, P., Eltabakh, M.Y.: Exploiting soft and hard correlations in big data query optimization. Proc. VLDB Endow. 9(12), 1005–1016 (2016)
Lombardo, G., Poggi, A.: A scalable and distributed actor-based version of the node2vec algorithm. In: WOA (2019)
Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: A system for large-scale graph processing. In: SIGMOD, pp. 135–146 (2010)
Marsaglia, G.: Generating discrete random variables in a computer. Commun. ACM 6(1), 37–38 (1963)
Martin, R., et al.: Memory in network flows and its effects on spreading dynamics and community detection. Nat. Commun. 5, 4630 (2014)
Nazi, A., Zhou, Z., Thirumuruganathan, S., Zhang, N., Das, G.: Walk, not wait: faster sampling over online social networks. Proc. VLDB Endow. 8(6), 678–689 (2015)
Peng, H., Li, J., Yan, H., Gong, Q., Wang, S., Liu, L., Wang, L., Ren, X.: Dynamic network embedding via incremental skip-gram with negative sampling. Sci. China Inf. Sci. 63(10), 1–19 (2020)
Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: Online learning of social representations. In: KDD, pp. 701–710 (2014)
Pisinger, D.: A minimal algorithm for the multiple-choice knapsack problem. Eur. J. Op. Res. 83(2), 394–410 (1995)
Raftery, A.E.: A model for high-order markov chains. J. R. Stat. Soc. Ser. B 47(3), 528–539 (1985)
Robert, C.P., Casella, G.: Monte Carlo Statistical Methods. Springer Publishing Company, New York (2010)
Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA (2003)
Salnikov, V., Schaub, M.T., Lambiotte, R.: Using higher-order markov models to reveal flow-based communities in networks. Sci. Rep. 5(23194), 1–13 (2016)
Sengupta, N., Bagchi, A., Ramanath, M., Bedathur, S.: Arrow: Approximating reachability using random walks over web-scale graphs. In: ICDE, pp. 470–481 (2019)
Shao, Y., Cui, B., Chen, L., Liu, M., Xie, X.: An efficient similarity search framework for simrank over large dynamic graphs. Proc. VLDB Endow. 8(8), 838–849 (2015)
Shao, Y., Cui, B., Chen, L., Ma, L., Yao, J., Xu, N.: Parallel subgraph listing in a large-scale graph. In: SIGMOD, pp. 625–636 (2014)
Shao, Y., Huang, S., Miao, X., Cui, B., Chen, L.: Memory-aware framework for efficient second-order random walk on large graphs. In: SIGMOD, pp. 1797–1812 (2020)
Sinha, P., Zoltners, A.A.: The multiple-choice knapsack problem. Op. Res. 27(3), 503–515 (1979)
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)
Tsitsulin, A., Mottin, D., Karras, P., Müller, E.: Verse: Versatile graph embeddings from similarity measures. In: WWW, pp. 539–548 (2018)
Tsoumakas, G., Katakis, I.: Multi-label classification: an overview. Int. J. Data Warehous. Min. 2007, 1–13 (2007)
Walker, A.J.: An efficient method for generating discrete random variables with general distributions. ACM Trans. Math. Softw. 3(3), 253–256 (1977)
Wang, R., Li, Y., Xie, H., Xu, Y., Lui, J.C.S.: Graphwalker: An i/o-efficient and resource-friendly graph analytic system for fast and scalable random walks. In: ATC, pp. 559–571 (2020)
Wu, Y., Bian, Y., Zhang, X.: Remember where you came from: on the second-order random walk based proximity measures. Proc. VLDB Endow. 10(1), 13–24 (2016)
Xu, J., Wickramarathne, T., Chawla, N.V.: Representing higher-order dependencies in networks. In: Sci. Adv. (2016)
Yang, K., Zhang, M., Chen, K., Ma, X., Bai, Y., Jiang, Y.: Knightking: a fast distributed graph random walk engine. In: SOSP, pp. 524–537 (2019)
Zemel, E.: The linear multiple choice knapsack problem. Op. Res. 28(6), 1412–1423 (1980)
Zhao, P., Han, J.: On graph query optimization in large networks. Proc. VLDB Endow. 3(1–2), 340–351 (2010)
Zhou, D., Niu, S., Chen, S.: Efficient graph computation for node2vec. CoRR abs/1805.00280 (2018)
Acknowledgements
This work is supported by the National Key Research and Development Program of China (No. 2018YFB140 2600), NSFC (Nos. U1936104, 61902037, 61832001), CAAI-Huawei MindSpore Open Fund, Beijing Academy of Artificial Intelligence (BAAI), PKU-Baidu Fund 2019BD006, the Fundamental Research Funds for the Central Universities 2020RC25. Lei Chen’s work is partially supported by National Key Research and Development Program of China Grant No. 2018AAA0101100, the Hong Kong RGC GRF Project 16202218, CRF Project C6030-18G, C1031-18G, C5026-18G, AOE Project AoE/E-603/18, Theme-based project TRS T41-603/20R, China NSFC No. 61729201, Guangdong Basic and Applied Basic Research Foundation 2019B151530001, Hong Kong ITC ITF grants ITS/044/18FX and ITS/470/18FX, Microsoft Research Asia Collaborative Research Grant, Didi-HKUST joint research lab project, and Wechat and Webank Research Grants.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
The proof of proposition 1
Proof
Let G(V, E) be an unweighted graph G(V, E), u and v be the previous node and current node. Considering that the nodes in the same group have the same probabilities, we simply use \(p_i, i=1..3\), to denote the e2e probability of a node in the ith group, and \(p_i^g, i=1..3\) to denote the probability of the ith group.
According to the definition of groups, we have \(|G_I|=1, |G_II|=\theta _{uv}, |G_{III}|=d_v-1-\theta _{uv}\). Therefore, \(p_{i}^g=\sum _{1}^{|G_i|}p_i\).
Based on the analysis in Section 5.1, the time cost of rejection node sampler \(T_r\) is \(C_vcK\), and the time cost of group-based node sampler \(T_g\) is \((1+p_1^gc+p_2^gC_{v}^{g_2}c+ p_3^gC_{v}^{g_3}c)K\).
In the context of unweighted graph, we have \(C_v = d_v max\{\) \(p_1, p_2, p_3\}\), \(C_v^{g_2}=1\), \(C_v^{g_3}=\frac{d_v}{d_v-\theta _{uv}-1}\).
To derive the condition of \(T_r > T_g\), we should have
Therefore, the above inequation holds when
is satisfied. And the proposition is proved. \(\square \)
LP-domination analysis
In this section, we show that there is no LP domination among the alias, rejection and naive sampling methods. Here, we give the proof with a common setting \(d_f=4\), \(d_i=4\), \(c=1\).
Proof
Following the cost model in Table 2. To prove no LP-domination among the three sampling methods, we need to show that \(\frac{T_{r}-T_{n}}{M_{r}-M_{n}}-\frac{T_{a}-T_{r}}{M_{a}-M_{r}}\le 0\) holds.
Let \(0<M_n=\frac{b_fd_{max}}{|V|}<b_f=4\) and \(C_v\le d_v\), it is easy to figure out \((12d_v-M_n)(8d_v^2-4d_v) > 0\) when \(d_v \ge 1\). Then, we only need to compute the bound of \((C_v-2d_v)(8d_v^2-4d_v)-(1-C_v)(12d_v-M_n)\) as below:
\(\square \)
Analysis about the results of Deg-inc on Youtube
In Fig. 8a, b, when memory budget is larger than 7.5 GB, Dec-inc has similar performance to the LP-std and LP-est on Youtube. To clearly analyze the reasons behind this results, we take Fig. 8a as an example and profile the distribution of types of node samplers. And we also give the concrete node samplers for the nodes with top-10 largest degrees. The statistics are reported in Table 10. From the table, we clearly see that when memory budget is 7.5 GB, the distribution of types of node samplers are all most the same between LP-std and Deg-inc. After checking the complete node sampler assignment, we find only two nodes have different node samplers. Recall that Deg-inc processes the nodes with small degree first, due to the sparsity of Youtube, even all the nodes with small degrees are assigned alias method, there are enough memory budget left which allows nodes with large degrees to use rejection method. But when memory budget is 2.5 GB, nodes with large degrees are assigned naive node sampler by Deg-inc, resulting poor efficiency. Unlike Deg-inc, Deg-dec is able to assign alias method or rejection method to nodes with large degrees no matter memory budget is 2.5 GB or 7.5 GB. However, Deg-dec always processes the largest nodes first, thus consuming a lot of memory budget. Finally, Deg-dec leads to many other nodes using naive method, and the average degree of naive method for Deg-dec in Table 10 implicitly demonstrates such node sampler assignment.
Rights and permissions
About this article
Cite this article
Shao, Y., Huang, S., Li, Y. et al. Memory-aware framework for fast and scalable second-order random walk over billion-edge natural graphs. The VLDB Journal 30, 769–797 (2021). https://doi.org/10.1007/s00778-021-00669-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-021-00669-2