skip to main content
research-article

Fast and Flexible Top-k Similarity Search on Large Networks

Published: 21 August 2017 Publication History

Abstract

Similarity search is a fundamental problem in network analysis and can be applied in many applications, such as collaborator recommendation in coauthor networks, friend recommendation in social networks, and relation prediction in medical information networks. In this article, we propose a sampling-based method using random paths to estimate the similarities based on both common neighbors and structural contexts efficiently in very large homogeneous or heterogeneous information networks. We give a theoretical guarantee that the sampling size depends on the error-bound ε, the confidence level (1-δ), and the path length T of each random walk. We perform an extensive empirical study on a Tencent microblogging network of 1,000,000,000 edges. We show that our algorithm can return top-k similar vertices for any vertex in a network 300× faster than the state-of-the-art methods. We develop a prototype system of recommending similar authors to demonstrate the effectiveness of our method.

Supplementary Material

JPG File (a13-zhang.jpg)
MP4 File (a13-zhang.mp4)

References

[1]
Charu C. Aggarwal, Yuchen Zhao, and S. Yu Philip. 2011. Outlier detection in graph streams. In Proceedings of the 2011 ICDE Conference (ICDE’11). 399--409
[2]
Nesreen K. Ahmed, Nick Duffield, Jennifer Neville, and Ramana Kompella. 2014. Graph sample and hold: A framework for big-graph analytics. In Proceedings of the 2014 KDD Conference (IDD’14). 1446--1455.
[3]
Kazuo Aoyama, Kazumi Saito, Hiroshi Sawada, and Naonori Ueda. 2011. Fast approximate similarity search based on degree-reduced neighborhood graphs. In Proceedings of the 2011 KDD Conference (KDD’11). 1055--1063.
[4]
Ricardo Baeza-Yates and Berthier Ribeiro-Neto. 1999. Modern Information Retrieval. Vol. 463. ACM, New York, NY.
[5]
Vincent D. Blondel, Anahí Gajardo, Maureen Heymans, Pierre Senellart, and Paul Van Dooren. 2004. A measure of similarity between graph vertices: Applications to synonym extraction and Web searching. SIAM Review 46, 4, 647--666.
[6]
Luciana S. Buriol, Gereon Frahling, Stefano Leonardi, Alberto Marchetti-Spaccamela, and Christian Sohler. 2006. Counting triangles in data streams. In Proceedings of the 2006 PODS Conference (PODS’06). 253--262.
[7]
Ronald S. Burt. 1990. Detecting role equivalence. Social Networks 12, 1, 83--97.
[8]
Ronald S. Burt. 2009. Structural Holes: The Social Structure of Competition. Harvard University Press, Cambridge, MA.
[9]
Bin Chen, Ying Ding, and David J. Wild. 2012. Assessing drug target association using semantic linked data. PLoS Computational Biology 8, 7, e1002574.
[10]
G. Cormode and S. Muthukrishnan. 2005. Space efficient mining of multigraph streams. In Proceedings of the 2005 PODS Conference (PODS’05). 271--282.
[11]
Yuxiao Dong, Yang Yang, Jie Tang, Yang Yang, and Nitesh V. Chawla. 2014. Inferring user demographics and social strategies in mobile social networks. In Proceedings of the 2014 KDD Conference (KDD’14). 15--24.
[12]
Nick Duffield, Yunhong Xu, Liangzhen Xia, Nesreen Ahmed, and Minlan Yu. 2017. Stream aggregation through order samplingarXiv:1703.02693.
[13]
William Feller. 2008. An Introduction to Probability Theory and its Applications. Vol. 2. John Wiley 8 Sons.
[14]
Linton C. Freeman. 1977. A set of measures of centrality based on betweenness. Sociometry 40, 1, 35--41.
[15]
Linton C. Freeman. 1979. Centrality in social networks conceptual clarification. Social Networks 1, 3, 215--239.
[16]
Yasuhiro Fujiwara, Makoto Nakatsuji, Hiroaki Shiokawa, Takeshi Mishima, and Makoto Onizuka. 2013. Efficient ad-hoc search for personalized PageRank. In Proceedings of the 2013 SIGMOD Conference (SIGMOD’13). 445--456.
[17]
Kwang-Il Goh, Michael E. Cusick, David Valle, Barton Childs, Marc Vidal, and Albert-László Barabási. 2007. The human disease network. Proceedings of the National Academy of Sciences of the United States of America 104, 21, 8685--8690.
[18]
Wentao Han, Xiaowei Zhu, Ziyan Zhu, Wenguang Chen, Weimin Zheng, and Jianguo Lu. 2016. A comparative analysis on Weibo and Twitter. Tsinghua Science and Technology 21, 1, 1--16.
[19]
Keith Henderson, Brian Gallagher, Tina Eliassi-Rad, Hanghang Tong, Sugato Basu, Leman Akoglu, Danai Koutra, Christos Faloutsos, and Lei Li. 2012. RolX: Structural role extraction and mining in large graphs. In Proceedings of the 2012 KDD Conference (KDD’12). 1231--1239.
[20]
Keith Henderson, Brian Gallagher, Lei Li, Leman Akoglu, Tina Eliassi-Rad, Hanghang Tong, and Christos Faloutsos. 2011. It’s who you know: Graph mining using recursive structural features. In Proceedings of the 2011 KDD Conference (KDD’11). 663--671.
[21]
Paul W. Holland and Samuel Leinhardt. 1981. An exponential family of probability distributions for directed graphs. Journal of the American Statistical Association 76, 373, 33--50.
[22]
John Hopcroft, Tiancheng Lou, and Jie Tang. 2011. Who will follow you back? Reciprocal relationship prediction. In Proceedings of the 2011 CIKM Conference (CIKM’11). 1137--1146.
[23]
Paul Jaccard. 1901. Étude comparative de le distribution florale dans une portion de alpes et du jura. Bulletin de la Société Vaudoise des Sciences Naturelles 37, 547--579.
[24]
Glen Jeh and Jennifer Widom. 2002. SimRank: A measure of structural-context similarity. In Proceedings of the 2002 KDD Conference (KDD’02). 538--543.
[25]
M. Jha, C. Seshadhri, and A. Pinar. 2013. A space efficient streaming algorithm for triangle counting using the birthday paradox. In Proceedings of the 2013 KDD Conference (KDD’13). 589--597.
[26]
M. Jha, C. Seshadhri, and A. Pinar. 2015. Path sampling: A fast and provable method for estimating 4-vertex subgraph counts. In Proceedings of the 2015 WWW Conference (WWW’15). 495--505.
[27]
Ruoming Jin, Victor E. Lee, and Hui Hong. 2011. Axiomatic ranking of network role similarity. In Proceedings of the 2011 KDD Conference (KDD’11). 922--930.
[28]
Nadav Kashtan, Shalev Itzkovitz, Ron Milo, and Uri Alon. 2004. Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20, 11, 1746--1758.
[29]
Leo Katz. 1953. A new status index derived from sociometric analysis. Psychometrika 18, 1, 39--43.
[30]
M. M. Kessler. 1963. Bibliographic coupling between scientific papers. American Documentation 14, 1, 10--25.
[31]
Angelos Kremyzas, Norman Jaklin, and Roland Geraerts. 2016. Towards social behavior in virtual-agent navigation. Science China Information Sciences 59, 11, 112102.
[32]
Mitsuru Kusumoto, Takanori Maehara, and Ken-Ichi Kawarabayashi. 2014. Scalable similarity search for SimRank. In Proceedings of the 2014 SIGMOD Conference (SIGMOD’14). 325--336.
[33]
Pei Lee, Laks V. S. Lakshmanan, and Jeffrey Xu Yu. 2012. On top-k structural similarity search. In Proceedings of the 2012 ICDE Conference (ICDE’12). 774--785.
[34]
E. A. Leicht, P. Holme, and M. E. J. Newman. 2006. Vertex similarity in networks. Physical Review E 73, 2, 026120.
[35]
Y. Lim and U. Kang. 2015. Mascot: Memory-efficient and accurate sampling for counting local triangles in graph streams. In Proceedings of the 2015 SIGKDD Conference (SIGKDD’15). 685--694.
[36]
Tiancheng Lou and Jie Tang. 2013. Mining structural hole spanners through information diffusion in social networks. In Proceedings of the 2013 WWW Conference (WWW’13). 837--848.
[37]
Nagarajan Natarajan and Inderjit S. Dhillon. 2014. Inductive matrix completion for predicting gene--disease associations. Bioinformatics 30, 12, i60--i68.
[38]
Mark E. J. Newman. 2006. Finding community structure in networks using the eigenvectors of matrices. Physical Review E 74, 3, 036104.
[39]
Jia-Yu Pan, Hyung-Jeong Yang, Christos Faloutsos, and Pinar Duygulu. 2004. Automatic multimedia cross-modal correlation discovery. In Proceedings of the 2004 KDD Conference (KDD’04). 653--658.
[40]
Aduri Pavan, Kanat Tangwongsan, Srikanta Tirthapura, and Kun-Lung Wu. 2013. Counting and sampling triangles from a graph stream. Proceedings of the VLDB Endowment 6, 14, 1870--1881.
[41]
Mahmudur Rahman and Mohammad Al Hasan. 2013. Approximate triangle counting algorithms on multi-cores. In Proceedings of the 2013 IEEE International Conference on Big Data. 127--133.
[42]
Mahmudur Rahman, Mansurul Bhuiyan, and Mohammad Al Hasan. 2012. Graft: An approximate graphlet counting algorithm for large graph analysis. In Proceedings of the 2012 CIKM Conference (CIKM’12). 1467--1471.
[43]
Matteo Riondato and Evgenios M. Kornaropoulos. 2014. Fast approximation of betweenness centrality through sampling. In Proceedings of the 2014 WSDM Conference (WSDM’14). 413--422.
[44]
Ryan A. Rossi and Nesreen K. Ahmed. 2015. Role discovery in networks. IEEE Transactions on Knowledge and Data Engineering 27, 4, 1112--1131.
[45]
Ryan A. Rossi, Rong Zhou, and Nesreen K. Ahmed. 2017. Estimation of graphlet statisticsarXiv:1701.01772.
[46]
Purnamrita Sarkar and Andrew W. Moore. 2010. Fast nearest-neighbor search in disk-resident graphs. In Proceedings of the 2010 KDD Conference (KDD’10). 513--522.
[47]
Atish Das Sarma, Sreenivas Gollapudi, and Rina Panigrahy. 2011. Estimating PageRank on graph streams. Journal of the ACM 58, 3, 13.
[48]
Chuan Shi, Xiangnan Kong, Yue Huang, and Philip S. Yu. 2014. HeteSim: A general framework for relevance measure in heterogeneous networks. IEEE Transactions on Knowledge and Data Engineering 26, 10, 2479--2492.
[49]
Henry Small. 1973. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science 24, 4, 265--269.
[50]
Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, and Tianyi Wu. 2011. PathSim: Meta path-based top-k similarity search in heterogeneous information networks. In Proceedings of the 2011 VLDB Conference (VLDB’11). 992--1003.
[51]
Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. ArnetMiner: Extraction and mining of academic social networks. In Proceedings of the 2008 KDD Conference (KDD’08). 990--998.
[52]
Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. 2006. Fast random walk with restart and its applications. In Proceedings of the 2006 ICDM Conference (ICDM’06). 613--622.
[53]
Charalampos E. Tsourakakis. 2014. Toward quantifying vertex similarity in networks. Internet Mathematics 10, 3--4, 263--286.
[54]
Vladimir N. Vapnik and A. Ya Chervonenkis. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications 16, 2, 264--280.
[55]
Ingo Wald and Vlastimil Havran. 2006. On building fast kd-trees for ray tracing, and on doing that in O(N log N). In Proceedings of the 2006 IEEE Symposium on Interactive Ray Tracing. 61--69.
[56]
Sebastian Wernicke. 2006. Efficient detection of network motifs. IEEE/ACM Transactions on Computational Biology and Bioinformatics 3, 4, 347--359.
[57]
Yang Yang, Jie Tang, Cane Wing-Ki Leung, Yizhou Sun, Qicong Chen, Juanzi Li, and Qiang Yang. 2014. RAIN: Social role-aware information diffusion. In Proceedings of the 2014 AAAI Conference (AAAI’14). 367--373.
[58]
Xiong Yun, Yangyong Zhu, and S. Yu Philip. 2015. Top-k similarity join in heterogeneous information networks. IEEE Transactions on Knowledge and Data Engineering 27, 6, 1710--1723.
[59]
Jing Zhang, Jie Tang, Cong Ma, Hanghang Tong, Yu Jing, and Juanzi Li. 2015. Panther: Fast top-k similarity search on large networks. In Proceedings of the 2015 KDD Conference (KDD’15). 1445--1454.

Cited By

View all

Index Terms

  1. Fast and Flexible Top-k Similarity Search on Large Networks

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 36, Issue 2
    April 2018
    371 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/3133943
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 August 2017
    Accepted: 01 March 2017
    Revised: 01 March 2017
    Received: 01 August 2016
    Published in TOIS Volume 36, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Vertex similarity
    2. heterogeneous information network
    3. random path
    4. similarity search
    5. social network

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)REHREC: Review Effected Heterogeneous Information Network Recommendation SystemIEEE Access10.1109/ACCESS.2024.337927112(42751-42760)Online publication date: 2024
    • (2024)Verifiable Graph-Based Approximate Nearest Neighbor SearchAdvanced Data Mining and Applications10.1007/978-981-96-0821-8_1(3-17)Online publication date: 3-Dec-2024
    • (2023)Diffusing Colors: Image Colorization with Text Guided DiffusionSIGGRAPH Asia 2023 Conference Papers10.1145/3610548.3618180(1-11)Online publication date: 10-Dec-2023
    • (2023)PColorizor: Re-coloring Ancient Chinese Paintings with Ideorealm-congruent PoemsProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology10.1145/3586183.3606814(1-15)Online publication date: 29-Oct-2023
    • (2023)An Efficient and Robust Semantic Hashing Framework for Similar Text SearchACM Transactions on Information Systems10.1145/357072541:4(1-31)Online publication date: 30-Jan-2023
    • (2023)Deep Learning for Approximate Nearest Neighbour Search: A Survey and Future DirectionsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.322068335:9(8997-9018)Online publication date: 1-Sep-2023
    • (2023)Learned Probing Cardinality Estimation for High-Dimensional Approximate NN Search2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00246(3209-3221)Online publication date: Apr-2023
    • (2023)Image Colorization using CycleGAN with semantic and spatial rationalityMultimedia Tools and Applications10.1007/s11042-023-14675-982:14(21641-21655)Online publication date: 1-Jun-2023
    • (2023)Scholarly recommendation systems: a literature surveyKnowledge and Information Systems10.1007/s10115-023-01901-x65:11(4433-4478)Online publication date: 1-Nov-2023
    • (2022)Who Will Travel With Me? Personalized Ranking Using Attributed Network Embedding for PoolingIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2021.311366123:8(12311-12327)Online publication date: Aug-2022
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media