research-article

Fast and Flexible Top-k Similarity Search on Large Networks

Authors:

Marie-Francine MoensAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 36, Issue 2

Article No.: 13, Pages 1 - 30

https://doi.org/10.1145/3086695

Published: 21 August 2017 Publication History

Abstract

Similarity search is a fundamental problem in network analysis and can be applied in many applications, such as collaborator recommendation in coauthor networks, friend recommendation in social networks, and relation prediction in medical information networks. In this article, we propose a sampling-based method using random paths to estimate the similarities based on both common neighbors and structural contexts efficiently in very large homogeneous or heterogeneous information networks. We give a theoretical guarantee that the sampling size depends on the error-bound ε, the confidence level (1-δ), and the path length T of each random walk. We perform an extensive empirical study on a Tencent microblogging network of 1,000,000,000 edges. We show that our algorithm can return top-k similar vertices for any vertex in a network 300× faster than the state-of-the-art methods. We develop a prototype system of recommending similar authors to demonstrate the effectiveness of our method.

Supplementary Material

JPG File (a13-zhang.jpg)

Download
20.08 KB

MP4 File (a13-zhang.mp4)

Download
444.98 MB

References

[1]

Charu C. Aggarwal, Yuchen Zhao, and S. Yu Philip. 2011. Outlier detection in graph streams. In Proceedings of the 2011 ICDE Conference (ICDE’11). 399--409

Digital Library

[2]

Nesreen K. Ahmed, Nick Duffield, Jennifer Neville, and Ramana Kompella. 2014. Graph sample and hold: A framework for big-graph analytics. In Proceedings of the 2014 KDD Conference (IDD’14). 1446--1455.

Digital Library

[3]

Kazuo Aoyama, Kazumi Saito, Hiroshi Sawada, and Naonori Ueda. 2011. Fast approximate similarity search based on degree-reduced neighborhood graphs. In Proceedings of the 2011 KDD Conference (KDD’11). 1055--1063.

Digital Library

[4]

Ricardo Baeza-Yates and Berthier Ribeiro-Neto. 1999. Modern Information Retrieval. Vol. 463. ACM, New York, NY.

Digital Library

[5]

Vincent D. Blondel, Anahí Gajardo, Maureen Heymans, Pierre Senellart, and Paul Van Dooren. 2004. A measure of similarity between graph vertices: Applications to synonym extraction and Web searching. SIAM Review 46, 4, 647--666.

Digital Library

[6]

Luciana S. Buriol, Gereon Frahling, Stefano Leonardi, Alberto Marchetti-Spaccamela, and Christian Sohler. 2006. Counting triangles in data streams. In Proceedings of the 2006 PODS Conference (PODS’06). 253--262.

Digital Library

[7]

Ronald S. Burt. 1990. Detecting role equivalence. Social Networks 12, 1, 83--97.

[8]

Ronald S. Burt. 2009. Structural Holes: The Social Structure of Competition. Harvard University Press, Cambridge, MA.

[9]

Bin Chen, Ying Ding, and David J. Wild. 2012. Assessing drug target association using semantic linked data. PLoS Computational Biology 8, 7, e1002574.

[10]

G. Cormode and S. Muthukrishnan. 2005. Space efficient mining of multigraph streams. In Proceedings of the 2005 PODS Conference (PODS’05). 271--282.

Digital Library

[11]

Yuxiao Dong, Yang Yang, Jie Tang, Yang Yang, and Nitesh V. Chawla. 2014. Inferring user demographics and social strategies in mobile social networks. In Proceedings of the 2014 KDD Conference (KDD’14). 15--24.

Digital Library

[12]

Nick Duffield, Yunhong Xu, Liangzhen Xia, Nesreen Ahmed, and Minlan Yu. 2017. Stream aggregation through order samplingarXiv:1703.02693.

[13]

William Feller. 2008. An Introduction to Probability Theory and its Applications. Vol. 2. John Wiley 8 Sons.

[14]

Linton C. Freeman. 1977. A set of measures of centrality based on betweenness. Sociometry 40, 1, 35--41.

[15]

Linton C. Freeman. 1979. Centrality in social networks conceptual clarification. Social Networks 1, 3, 215--239.

[16]

Yasuhiro Fujiwara, Makoto Nakatsuji, Hiroaki Shiokawa, Takeshi Mishima, and Makoto Onizuka. 2013. Efficient ad-hoc search for personalized PageRank. In Proceedings of the 2013 SIGMOD Conference (SIGMOD’13). 445--456.

Digital Library

[17]

Kwang-Il Goh, Michael E. Cusick, David Valle, Barton Childs, Marc Vidal, and Albert-László Barabási. 2007. The human disease network. Proceedings of the National Academy of Sciences of the United States of America 104, 21, 8685--8690.

[18]

Wentao Han, Xiaowei Zhu, Ziyan Zhu, Wenguang Chen, Weimin Zheng, and Jianguo Lu. 2016. A comparative analysis on Weibo and Twitter. Tsinghua Science and Technology 21, 1, 1--16.

[19]

Keith Henderson, Brian Gallagher, Tina Eliassi-Rad, Hanghang Tong, Sugato Basu, Leman Akoglu, Danai Koutra, Christos Faloutsos, and Lei Li. 2012. RolX: Structural role extraction and mining in large graphs. In Proceedings of the 2012 KDD Conference (KDD’12). 1231--1239.

Digital Library

[20]

Keith Henderson, Brian Gallagher, Lei Li, Leman Akoglu, Tina Eliassi-Rad, Hanghang Tong, and Christos Faloutsos. 2011. It’s who you know: Graph mining using recursive structural features. In Proceedings of the 2011 KDD Conference (KDD’11). 663--671.

Digital Library

[21]

Paul W. Holland and Samuel Leinhardt. 1981. An exponential family of probability distributions for directed graphs. Journal of the American Statistical Association 76, 373, 33--50.

[22]

John Hopcroft, Tiancheng Lou, and Jie Tang. 2011. Who will follow you back? Reciprocal relationship prediction. In Proceedings of the 2011 CIKM Conference (CIKM’11). 1137--1146.

Digital Library

[23]

Paul Jaccard. 1901. Étude comparative de le distribution florale dans une portion de alpes et du jura. Bulletin de la Société Vaudoise des Sciences Naturelles 37, 547--579.

[24]

Glen Jeh and Jennifer Widom. 2002. SimRank: A measure of structural-context similarity. In Proceedings of the 2002 KDD Conference (KDD’02). 538--543.

Digital Library

[25]

M. Jha, C. Seshadhri, and A. Pinar. 2013. A space efficient streaming algorithm for triangle counting using the birthday paradox. In Proceedings of the 2013 KDD Conference (KDD’13). 589--597.

Digital Library

[26]

M. Jha, C. Seshadhri, and A. Pinar. 2015. Path sampling: A fast and provable method for estimating 4-vertex subgraph counts. In Proceedings of the 2015 WWW Conference (WWW’15). 495--505.

Digital Library

[27]

Ruoming Jin, Victor E. Lee, and Hui Hong. 2011. Axiomatic ranking of network role similarity. In Proceedings of the 2011 KDD Conference (KDD’11). 922--930.

Digital Library

[28]

Nadav Kashtan, Shalev Itzkovitz, Ron Milo, and Uri Alon. 2004. Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20, 11, 1746--1758.

Digital Library

[29]

Leo Katz. 1953. A new status index derived from sociometric analysis. Psychometrika 18, 1, 39--43.

[30]

M. M. Kessler. 1963. Bibliographic coupling between scientific papers. American Documentation 14, 1, 10--25.

[31]

Angelos Kremyzas, Norman Jaklin, and Roland Geraerts. 2016. Towards social behavior in virtual-agent navigation. Science China Information Sciences 59, 11, 112102.

[32]

Mitsuru Kusumoto, Takanori Maehara, and Ken-Ichi Kawarabayashi. 2014. Scalable similarity search for SimRank. In Proceedings of the 2014 SIGMOD Conference (SIGMOD’14). 325--336.

Digital Library

[33]

Pei Lee, Laks V. S. Lakshmanan, and Jeffrey Xu Yu. 2012. On top-k structural similarity search. In Proceedings of the 2012 ICDE Conference (ICDE’12). 774--785.

Digital Library

[34]

E. A. Leicht, P. Holme, and M. E. J. Newman. 2006. Vertex similarity in networks. Physical Review E 73, 2, 026120.

[35]

Y. Lim and U. Kang. 2015. Mascot: Memory-efficient and accurate sampling for counting local triangles in graph streams. In Proceedings of the 2015 SIGKDD Conference (SIGKDD’15). 685--694.

Digital Library

[36]

Tiancheng Lou and Jie Tang. 2013. Mining structural hole spanners through information diffusion in social networks. In Proceedings of the 2013 WWW Conference (WWW’13). 837--848.

Digital Library

[37]

Nagarajan Natarajan and Inderjit S. Dhillon. 2014. Inductive matrix completion for predicting gene--disease associations. Bioinformatics 30, 12, i60--i68.

[38]

Mark E. J. Newman. 2006. Finding community structure in networks using the eigenvectors of matrices. Physical Review E 74, 3, 036104.

[39]

Jia-Yu Pan, Hyung-Jeong Yang, Christos Faloutsos, and Pinar Duygulu. 2004. Automatic multimedia cross-modal correlation discovery. In Proceedings of the 2004 KDD Conference (KDD’04). 653--658.

Digital Library

[40]

Aduri Pavan, Kanat Tangwongsan, Srikanta Tirthapura, and Kun-Lung Wu. 2013. Counting and sampling triangles from a graph stream. Proceedings of the VLDB Endowment 6, 14, 1870--1881.

Digital Library

[41]

Mahmudur Rahman and Mohammad Al Hasan. 2013. Approximate triangle counting algorithms on multi-cores. In Proceedings of the 2013 IEEE International Conference on Big Data. 127--133.

[42]

Mahmudur Rahman, Mansurul Bhuiyan, and Mohammad Al Hasan. 2012. Graft: An approximate graphlet counting algorithm for large graph analysis. In Proceedings of the 2012 CIKM Conference (CIKM’12). 1467--1471.

Digital Library

[43]

Matteo Riondato and Evgenios M. Kornaropoulos. 2014. Fast approximation of betweenness centrality through sampling. In Proceedings of the 2014 WSDM Conference (WSDM’14). 413--422.

Digital Library

[44]

Ryan A. Rossi and Nesreen K. Ahmed. 2015. Role discovery in networks. IEEE Transactions on Knowledge and Data Engineering 27, 4, 1112--1131.

Digital Library

[45]

Ryan A. Rossi, Rong Zhou, and Nesreen K. Ahmed. 2017. Estimation of graphlet statisticsarXiv:1701.01772.

[46]

Purnamrita Sarkar and Andrew W. Moore. 2010. Fast nearest-neighbor search in disk-resident graphs. In Proceedings of the 2010 KDD Conference (KDD’10). 513--522.

Digital Library

[47]

Atish Das Sarma, Sreenivas Gollapudi, and Rina Panigrahy. 2011. Estimating PageRank on graph streams. Journal of the ACM 58, 3, 13.

Digital Library

[48]

Chuan Shi, Xiangnan Kong, Yue Huang, and Philip S. Yu. 2014. HeteSim: A general framework for relevance measure in heterogeneous networks. IEEE Transactions on Knowledge and Data Engineering 26, 10, 2479--2492.

[49]

Henry Small. 1973. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science 24, 4, 265--269.

[50]

Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, and Tianyi Wu. 2011. PathSim: Meta path-based top-k similarity search in heterogeneous information networks. In Proceedings of the 2011 VLDB Conference (VLDB’11). 992--1003.

[51]

Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. ArnetMiner: Extraction and mining of academic social networks. In Proceedings of the 2008 KDD Conference (KDD’08). 990--998.

Digital Library

[52]

Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. 2006. Fast random walk with restart and its applications. In Proceedings of the 2006 ICDM Conference (ICDM’06). 613--622.

Digital Library

[53]

Charalampos E. Tsourakakis. 2014. Toward quantifying vertex similarity in networks. Internet Mathematics 10, 3--4, 263--286.

[54]

Vladimir N. Vapnik and A. Ya Chervonenkis. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications 16, 2, 264--280.

[55]

Ingo Wald and Vlastimil Havran. 2006. On building fast kd-trees for ray tracing, and on doing that in O(N log N). In Proceedings of the 2006 IEEE Symposium on Interactive Ray Tracing. 61--69.

[56]

Sebastian Wernicke. 2006. Efficient detection of network motifs. IEEE/ACM Transactions on Computational Biology and Bioinformatics 3, 4, 347--359.

Digital Library

[57]

Yang Yang, Jie Tang, Cane Wing-Ki Leung, Yizhou Sun, Qicong Chen, Juanzi Li, and Qiang Yang. 2014. RAIN: Social role-aware information diffusion. In Proceedings of the 2014 AAAI Conference (AAAI’14). 367--373.

Digital Library

[58]

Xiong Yun, Yangyong Zhu, and S. Yu Philip. 2015. Top-k similarity join in heterogeneous information networks. IEEE Transactions on Knowledge and Data Engineering 27, 6, 1710--1723.

[59]

Jing Zhang, Jie Tang, Cong Ma, Hanghang Tong, Yu Jing, and Juanzi Li. 2015. Panther: Fast top-k similarity search on large networks. In Proceedings of the 2015 KDD Conference (KDD’15). 1445--1454.

Digital Library

Cited By

Khalilzadeh FCicekli I(2024)REHREC: Review Effected Heterogeneous Information Network Recommendation SystemIEEE Access10.1109/ACCESS.2024.337927112(42751-42760)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3379271
Wang CZhang JLiu XWei KFeng B(2024)Verifiable Graph-Based Approximate Nearest Neighbor SearchAdvanced Data Mining and Applications10.1007/978-981-96-0821-8_1(3-17)Online publication date: 3-Dec-2024
https://dl.acm.org/doi/10.1007/978-981-96-0821-8_1
Zabari NAzulay AGorkor AHalperin TFried O(2023)Diffusing Colors: Image Colorization with Text Guided DiffusionSIGGRAPH Asia 2023 Conference Papers10.1145/3610548.3618180(1-11)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.1145/3610548.3618180
Show More Cited By

Index Terms

Fast and Flexible Top-k Similarity Search on Large Networks
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Relevance search in heterogeneous networks
EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology

Conventional research on similarity search focuses on measuring the similarity between objects with the same type. However, in many real-world applications, we need to measure the relatedness between objects with different types. For example, in ...
Top-k similarity search in heterogeneous information networks with x-star network schema

The efficiency improvement is evident for similarity computation.The effectiveness of returned result is good for similarity search.The pruning algorithm is presented for supporting fast online query processing.The accuracy loss of pruning algorithm can ...
Semantic Enhanced Top-k Similarity Search on Heterogeneous Information Networks
Database Systems for Advanced Applications
Abstract
Similarity search on heterogeneous information networks has attracted widely attention from both industrial and academic areas in recent years, for example, used as friend detection in social networks and collaborator recommendation in coauthor ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems

ACM Transactions on Information Systems Volume 36, Issue 2

April 2018

371 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/3133943

Editor:
Maarten de Rijke
University of Amsterdam, The Netherlands

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2017

Accepted: 01 March 2017

Revised: 01 March 2017

Received: 01 August 2016

Published in TOIS Volume 36, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

23
Total Citations
View Citations
560
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Khalilzadeh FCicekli I(2024)REHREC: Review Effected Heterogeneous Information Network Recommendation SystemIEEE Access10.1109/ACCESS.2024.337927112(42751-42760)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3379271
Wang CZhang JLiu XWei KFeng B(2024)Verifiable Graph-Based Approximate Nearest Neighbor SearchAdvanced Data Mining and Applications10.1007/978-981-96-0821-8_1(3-17)Online publication date: 3-Dec-2024
https://dl.acm.org/doi/10.1007/978-981-96-0821-8_1
Zabari NAzulay AGorkor AHalperin TFried O(2023)Diffusing Colors: Image Colorization with Text Guided DiffusionSIGGRAPH Asia 2023 Conference Papers10.1145/3610548.3618180(1-11)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.1145/3610548.3618180
Tang TWu YXia PWu WWang XWu Y(2023)PColorizor: Re-coloring Ancient Chinese Paintings with Ideorealm-congruent PoemsProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology10.1145/3586183.3606814(1-15)Online publication date: 29-Oct-2023
https://dl.acm.org/doi/10.1145/3586183.3606814
He LHuang ZChen ELiu QTong SWang HLian DWang S(2023)An Efficient and Robust Semantic Hashing Framework for Similar Text SearchACM Transactions on Information Systems10.1145/357072541:4(1-31)Online publication date: 30-Jan-2023
https://dl.acm.org/doi/10.1145/3570725
Li MWang YZhang PWang HFan LLi EWang W(2023)Deep Learning for Approximate Nearest Neighbour Search: A Survey and Future DirectionsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.322068335:9(8997-9018)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1109/TKDE.2022.3220683
Zheng BYue ZHu QYi XLuan XXie CZhou XJensen C(2023)Learned Probing Cardinality Estimation for High-Dimensional Approximate NN Search2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00246(3209-3221)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00246
Li BLu YPang WXu H(2023)Image Colorization using CycleGAN with semantic and spatial rationalityMultimedia Tools and Applications10.1007/s11042-023-14675-982:14(21641-21655)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1007/s11042-023-14675-9
Zhang ZPatra BYaseen AZhu JSabharwal RRoberts KCao TWu H(2023)Scholarly recommendation systems: a literature surveyKnowledge and Information Systems10.1007/s10115-023-01901-x65:11(4433-4478)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1007/s10115-023-01901-x
Tang LLiu ZZhang RDuan ZLiang Y(2022)Who Will Travel With Me? Personalized Ranking Using Attributed Network Embedding for PoolingIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2021.311366123:8(12311-12327)Online publication date: Aug-2022
https://doi.org/10.1109/TITS.2021.3113661
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents