Efficient structural node similarity computation on billion-scale graphs

Chen, Xiaoshuang; Lai, Longbin; Qin, Lu; Lin, Xuemin

doi:10.1007/s00778-021-00654-9

Efficient structural node similarity computation on billion-scale graphs

Regular Paper
Published: 23 February 2021

Volume 30, pages 471–493, (2021)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Xiaoshuang Chen¹,
Longbin Lai ORCID: orcid.org/0000-0001-5443-4435^1,2,
Lu Qin³ &
…
Xuemin Lin^1,4

893 Accesses
3 Citations
Explore all metrics

Abstract

Structural node similarity is widely used in analyzing complex networks. As one of the structural node similarity metrics, role similarity has the good merit of indicating automorphism (isomorphism). Existing algorithms to compute role similarity (e.g., Role Sim and NED) suffer from severe performance bottlenecks and thus cannot handle large real-world graphs. In this paper, we propose a new framework, namely Struct Sim, to compute nodes’ role similarity. Under this framework, we first prove that Struct Sim is an admissible role similarity metric based on the maximum matching. While the maximum matching is still too costly to scale, we then devise the Bin Count matching that not only is efficient to compute but also guarantees the admissibility of Struct Sim. Bin Count-based Struct Sim admits a precomputed index to query a single pair of node in \(O(k\log D)\) time, where k is a small user-defined parameter and D is the maximum node degree. To build the index, we further devise an FM-sketch-based technique that can handle graphs with billions of edges. Extensive empirical studies show that Struct Sim performs much better than the existing works regarding both effectiveness and efficiency when applied to compute structural node similarities on the real-world graphs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

SimRank*: effective and scalable pairwise similarity search based on graph topology

Article Open access 11 January 2019

Efficient and scalable labeled subgraph matching using SGMatch

Article 05 July 2016

RoleSim*: Scaling axiomatic role-based similarity ranking on large graphs

Article Open access 11 August 2021

Notes

It is isomorphism confirmation when computing similarity between nodes in two different graphs.
In the following, we will use the metric (e.g., Role Sim) and the algorithm to compute the metric interchangeably.
In [23], Role Sim has a third initialization, namely “ALL-1” initialization, which renders same similarity scores as the degree-ratio initialization.
https://dblp.uni-trier.de/xml/.
http://www.anac.gov.br/.
http://transtats.bts.gov/.
http://ec.europa.eu/.
We tried \(k=1\), \(k=5\) and \(k=8\) for k-NN and adopted \(k=5\) as all the baselines got better results under this k value.

References

Ahmed, A., Shervashidze, N., Narayanamurthy, S.M., Josifovski, V., Smola, A.J.: Distributed large-scale natural graph factorization. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 37–48 (2013)
Antonellis, I., Garcia-Molina, H., Chang, C.: Simrank++: query rewriting through link analysis of the click graph. Proc. VLDB Endow. 1(1), 408–421 (2008)
Article Google Scholar
Avis, D.: A survey of heuristics for the weighted matching problem. Networks 13(4), 475–493 (1983)
Article MathSciNet Google Scholar
Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques for embedding and clustering. In: Advances in Neural Information Processing Systems, NIPS, pp. 585–591 (2001)
BlogCatalog. https://github.com/quark0/TAE/tree/master/data/BlogCatalog-dataset
Boldi, P., Vigna, S.: The webgraph framework I: compression techniques. In: Proceedings of the 13th International Conference on World Wide Web, pp. 595–602 (2004)
Cao, S., Lu, W., Xu, Q.: Grarep: Learning graph representations with global structural information. In: Proceedings of the 24th ACM International Conference on Information and Knowledge Management, pp. 891–900 (2015)
Cao, S., Lu, W., Xu, Q.: Deep neural networks for learning graph representations. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 1145–1152 (2016)
Chamberlain, B.P., Clough, J.R., Deisenroth, M.P.: Neural embeddings of graphs in hyperbolic space. CoRR, abs/1705.10359 (2017)
Chen, X., Lai, L., Qin, L., Lin,X.: Structsim: querying structural node similarity at billion scale. In: 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20–24, 2020, pp. 1950–1953 (2020)
Conte, A., Ferraro, G., Grossi, R., Marino, A., Sadakane, K., Uno, T.: Node similarity with q-grams for real-world labeled networks. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1282–1291 (2018)
Davis, D., Yaveroğlu, Ö.N., Malod-Dognin, N., Stojmirovic, A., Pržulj, N.: Topology-function conservation in protein–protein interaction networks. Bioinformatics 31(10), 1632–1639 (2015)
Article Google Scholar
Distinguishability, C.: A theoretical analysis of normalized discounted cumulative gain (ndcg) ranking measures
Donnat, C., Zitnik, M., Hallac, D., Leskovec,J.: Learning structural node embeddings via diffusion wavelets. I:n Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1320–1329 (2018)
Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31(2), 182–209 (1985)
Article MathSciNet Google Scholar
Fogaras, D., Rácz, B.: Scaling link-based similarity search. In: Proceedings of the 14th International Conference on World Wide Web, pp. 641–650 (2005)
Fujiwara, Y., Nakatsuji, M., Shiokawa, H., Onizuka, M.: Efficient search algorithm for simrank. In: 29th IEEE International Conference on Data Engineering, pp. 589–600 (2013)
Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864 (2016)
Hamilton, W.L., Ying, R., Leskovec, J.: Representation learning on graphs: methods and applications. IEEE Data Eng. Bull. 40(3), 52–74 (2017)
Google Scholar
Henderson, K., Gallagher, B., Eliassi-Rad, T., Tong, H., Basu, S., Akoglu, L., Koutra, D., Faloutsos, C., Li, L.: Rolx: structural role extraction & mining in large graphs. In: The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1231–1239 (2012)
Henderson, K., Gallagher, B., Li, L., Akoglu, L., Eliassi-Rad, T., Tong, H., Faloutsos, C.: It’s who you know: graph mining using recursive structural features. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 663–671 (2011)
Jeh, G., Widom, J.: Simrank: a measure of structural-context similarity. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 538–543 (2002)
Jin, R., Lee, V.E., Hong, H. Axiomatic ranking of network role similarity. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 922–930 (2011)
Jin, R., Lee, V.E., Li, L.: Scalable and axiomatic ranking of network role similarity. ACM Trans. Knowl. Discov. Data 8(1), 3:1–3:37 (2014)
Article Google Scholar
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)
Article MathSciNet Google Scholar
Kuhn, H.W.: The hungarian method for the assignment problem. In: 50 Years of Integer Programming 1958-2008, pp. 29–47 (2010)
Kusumoto, M., Maehara, T., Kawarabayashi, K.: Scalable similarity search for simrank. In: Proceedings of the 2014 International Conference on Management of Data, pp. 325–336 (2014)
Leicht, E.A., Holme, P., Newman, M.E.: Vertex similarity in networks. Phys. Rev. E 73(2), 026120 (2006)
Article Google Scholar
Leskovec, J., Krevl, A.: SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data (2014)
Li, C., Han, J., He, G., Jin, X., Sun, Y., Yu, Y., Wu, T.: Fast computation of simrank for static and dynamic information networks. In: Proceedings of the 13th International Conference on Extending Database Technology, pp. 465–476 (2010)
Lin, X., Yuan, Y., Zhang, Q., Zhang, Y. Selecting stars: the k most representative skyline operator. In: Proceedings of the 23rd International Conference on Data Engineering, pp. 86–95 (2007)
Lin, Z., Lyu, M. R., King, I.: Matchsim: a novel neighbor-based similarity measure with maximum neighborhood matching. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1613–1616 (2009)
Liu, D., Huang, J., Lin, C.: Recommendation with social roles. IEEE Access 6, 36420–36427 (2018)
Article Google Scholar
Liu, Y., Zheng, B., He, X., Wei, Z., Xiao, X., Zheng, K., Lu, J.: Probesim: scalable single-source and top-k simrank computations on dynamic graphs. Proc. VLDB Endow. 11(1), 14–26 (2017)
Article Google Scholar
Lorrain, F., White, H.C.: Structural equivalence of individuals in social networks. J. Math. Sociol. 1(1), 49–80 (1971)
Article Google Scholar
Lyu, T., Zhang, Y., Zhang, Y.: Enhancing the network embedding quality with structural similarity. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 147–156 (2017)
Optimization and approximation in deterministic sequencing and scheduling: a survey. Volume 5 of Annals of Discrete Mathematics, pp. 287–326 (1979)
Ou, M., Cui, P., Pei, J., Zhang, Z., Zhu, W.: Asymmetric transitivity preserving graph embedding. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1105–1114 (2016)
Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: online learning of social representations. In: The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 701–710 (2014)
Perozzi, B., Kulkarni, V., Skiena, S.: Walklets: multiscale graph embeddings for interpretable network classification. CoRR, abs/1605.02115 (2016)
Ribeiro, L.F., Saverese, P.H., Figueiredo, D.R.: struc2vec: learning node representations from structural identity. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 385–394 (2017)
Rosenberg, A., Hirschberg, J.: V-measure: a conditional entropy-based external cluster evaluation measure. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 410–420 (2007)
Rossi, R. A., Gallagher, B., Neville, J., Henderson, K.: Modeling dynamic behavior in large evolving graphs. In: Sixth ACM International Conference on Web Search and Data Mining, pp. 667–676 (2013)
Serrano, M.A., Boguná, M.: Topology of the world trade web. Phys. Rev. E 68(1), 015101 (2003)
Article Google Scholar
Strehl, A., Ghosh, J.: Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583–617 (2002)
MathSciNet MATH Google Scholar
Tang, J., Qu, M., Wang, M., Zhang, M., Yan, , Mei, Q.: LINE: large-scale information network embedding. In: Proceedings of the 24th International Conference on World Wide Web, pp. 1067–1077 (2015)
Tian, B., Xiao, X.: SLING: a near-optimal index structure for simrank. In: Proceedings of the 2016 International Conference on Management of Data, pp. 1859–1874 (2016)
Wang, D., Cui, P., Zhu, W.: Structural deep network embedding. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1225–1234 (2016)
Wang, X., Tang, L., Gao, H., Liu, H.: Discovering overlapping groups in social media. In: 2010 IEEE International Conference on Data Mining. IEEE, pp. 569–578 (2010)
Wang, Y., Lian, X., Chen, L.: Efficient simrank tracking in dynamic graphs. In: 2018 IEEE 34th International Conference on Data Engineering, pp. 545–556 (2018)
Wasserman, S., Faust, K.: Social Network Analysis: Methods and Applications, vol. 8. Cambridge University Press, Cambridge (1994)
Book Google Scholar
Yu, W., Lin, X., Zhang, W.: Towards efficient simrank computation on large networks. In: 29th IEEE International Conference on Data Engineering, pp. 601–612 (2013)
Yu, W., Lin, X., Zhang, W., Chang, L., Pei, J.: More is simpler: effectively and efficiently assessing node-pair similarities based on hyperlinks. Proc. VLDB Endow. 7(1), 13–24 (2013)
Article Google Scholar
Yu, W., Lin, X., Zhang, W., Pei, J., McCann, J.A.: Simrank: effective and scalable pairwise similarity search based on graph topology. VLDB J. 28(3), 401–426 (2019)
Article Google Scholar
Yu, W., McCann, J.A.: Efficient partial-pairs simrank search for large networks. Proc. VLDB Endow. 8(5), 569–580 (2015)
Article Google Scholar
Yu, W., McCann, J.A.: High quality graph-based similarity search. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 83–92 (2015)
Zhang, K., Statman, R., Shasha, D.: On the editing distance between unordered labeled trees. Inf. Process. Lett. 42(3), 133–139 (1992)
Article MathSciNet Google Scholar
Zhao, P., Han, J., Sun, Y.: P-rank: a comprehensive structural similarity measure over information networks. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 553–562 (2009)
Zheng, W., Zou, L., Feng, Y., Chen, L., Zhao, D.: Efficient simrank-based similarity join over large graphs. Proc. VLDB Endow. 6(7), 493–504 (2013)
Article Google Scholar
Zhu, H., Meng, X., Kollios, G.: NED: an inter-graph node metric based on edit distance. Proc. VLDB Endow. 10(6), 697–708 (2017)
Article Google Scholar

Download references

Acknowledgements

Xuemin Lin is supported by NSFC61232006, 2018YFB1003504, ARC DP200101338, ARC DP180103096 and ARC DP170101628. Lu Qin is supported by ARC FT200100787.

Author information

Authors and Affiliations

University of New South Wales, Sydney, Australia
Xiaoshuang Chen, Longbin Lai & Xuemin Lin
Alibaba Group, Hangzhou, China
Longbin Lai
Centre for AI, University of Technology Sydney, Sydney, Australia
Lu Qin
East China Normal University, Shanghai, China
Xuemin Lin

Authors

Xiaoshuang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Longbin Lai
View author publications
You can also search for this author in PubMed Google Scholar
Lu Qin
View author publications
You can also search for this author in PubMed Google Scholar
Xuemin Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Longbin Lai.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, X., Lai, L., Qin, L. et al. Efficient structural node similarity computation on billion-scale graphs. The VLDB Journal 30, 471–493 (2021). https://doi.org/10.1007/s00778-021-00654-9

Download citation

Received: 23 February 2020
Revised: 07 September 2020
Accepted: 14 January 2021
Published: 23 February 2021
Issue Date: May 2021
DOI: https://doi.org/10.1007/s00778-021-00654-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient structural node similarity computation on billion-scale graphs

Abstract

Access this article

Similar content being viewed by others

SimRank*: effective and scalable pairwise similarity search based on graph topology

Efficient and scalable labeled subgraph matching using SGMatch

RoleSim*: Scaling axiomatic role-based similarity ranking on large graphs

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient structural node similarity computation on billion-scale graphs

Abstract

Access this article

Similar content being viewed by others

SimRank*: effective and scalable pairwise similarity search based on graph topology

Efficient and scalable labeled subgraph matching using SGMatch

RoleSim*: Scaling axiomatic role-based similarity ranking on large graphs

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation