Most similar maximal clique query on large graphs

Peng, Yun; Xu, Yitong; Zhao, Huawei; Zhou, Zhizheng; Han, Huimin

doi:10.1007/s11704-019-7235-0

Most similar maximal clique query on large graphs

Research Article
Published: 19 December 2019

Volume 14, article number 143601, (2020)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Yun Peng¹,
Yitong Xu²,
Huawei Zhao¹,
Zhizheng Zhou¹ &
…
Huimin Han³

69 Accesses
5 Citations
Explore all metrics

Abstract

This paper studies the most similar maximal clique query (MSMCQ). Given a graph G and a set of nodes Q, MSMCQ is to find the maximal clique of G having the largest similarity with Q. MSMCQ has many real applications including advertising industry, public security, task crowdsourcing and social network, etc. MSMCQ can be studied as a special case of the general set similarity query (SSQ). However, the MCs of G has several specialties from the general sets. Based on the specialties of MCs, we propose a novel index, namely MCIndex. MCIndex outperforms the state-of-the-art SSQ method significantly in terms of the number of candidates and the query time. Specifically, we first construct an inverted index I for all the MCs of G. Since the MCs in a posting list often have a lot of overlaps, MCIndex selects some pivots to cluster the MCs with a small radius. Given a query Q, we compute the distance from the pivots to Q. The clusters of the pivots assured not answer can be pruned by our distance based pruning rule. Since it is NP-hard to construct a minimum MCIndex, we propose to construct a minimal MCIndex on I(v) with an approximation ratio 1 + ln |I(v)|. Since the MCs have properties that are inherent of graph structure, we further propose a SIndex within each cluster of a MCIndex and a structure based pruning rule. SIndex can significantly reduce the number of candidates. Since the sizes of intersections between Q and many MCs need to be computed during the query evaluation, we also propose a binary representation of MCs to improve the efficiency of the intersection size computation. Our extensive experiments confirm the effectiveness and efficiency of our proposed techniques on several real-world datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visualization-Driven Graph Sampling Strategy for Exploring Large-Scale Networks

Mixed-integer programming techniques for the connected max-k-cut problem

Article Open access 30 April 2020

Sequential stratified regeneration: MCMC for large state spaces with an application to subgraph count estimation

Article 04 January 2022

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Hamann M, Röhrs E, Wagner D. Local community detection based on small cliques. Algorithms, 2017, 10(3): 1–22
Article MathSciNet Google Scholar
Cui W, Xiao Y, Wang H, Wang W. Local search of communities in large graphs. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. 2014, 991–1002
Uno T. An efficient algorithm for solving pseudo clique enumeration problem. Algorithmica, 2010, 56(1): 3–16
Article MathSciNet Google Scholar
Wu Y, Jin R, Li J, Zhang X. Robust local community detection: on free rider effect and its elimination. Proceedings of the VLDB Endowment, 2015, 8(7): 798–809
Article Google Scholar
Wang M, Wang C, Yu J X, Zhang J. Community detection in social networks: an in-depth benchmarking study with a procedure-oriented framework. Proceedings of the VLDB Endowment, 2015, 8(10): 998–1009
Article Google Scholar
Cai H, Zheng V W, Zhu F, Chang K C C, Huang Z. From community detection to community profiling. Proceedings of the VLDB Endowment, 2017, 10(7): 817–828
Article Google Scholar
Palsetia D, Patwary M M A, Hendrix W, Agrawal A, Choudhary A. Clique guided community detection. In: Proceedings of the 2014 IEEE International Conference on Big Data. 2014, 500–509
Boginski V, Butenko S, Pardalos P M. Mining market data: a network approach. Computers & Operations Research, 2006, 33(11): 3171–3184
Article Google Scholar
Berry N, Ko T, Moy T, Smrcka J, Turnley J, Wu B. Emergent clique formation in terrorist recruitment. In: Proceedings of the Workshop on Agent Organizations: Theory and Practice of AAAI’ 04. 2004, 1–8
Kose F, Weckwerth W, Linke T, Fiehn O. Visualizing plant metabolomic correlation network using clique-metabolite matrices. Bioinformatics, 2001, 17: 1198–1208
Article Google Scholar
Cheng J, Zhu L, Ke Y, Chu S. Fast algorithms for maximal clique enumeration with limited memory. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012, 1240–1248
Makino K, Uno T. New algorithms for enumerating all maximal cliques. In: Proceedings of Scandinavian Workshop on Algorithm Theory. 2004, 260–272
Chapter Google Scholar
Östergård P R. A fast algorithm for the maximum clique problem. Discrete Applied Mathematics, 2002, 120(1–3): 197–207
Article MathSciNet Google Scholar
Liang X, Lu R, Lin X, Shen X. Security and Privacy in Mobile Social Networks. Springer-Verlag New York, 2013
Book Google Scholar
Sarvari H, Abozinadah E, Mbaziira A, Mccoy D. Constructing and analyzing criminal networks. In: Proceedings of the 2014 IEEE Security and Privacy Workshops. 2014, 84–91
Schall D. Service-Oriented Crowdsourcing. Springer-Verlag New York, 2012
Book Google Scholar
Bacon K, Dewan P. Mixed-initiative friend-list creation. In: Proceedings of the 12th European Conference on Computer Supported Cooperative Work. 2011, 293–312
Google Scholar
Cui W, Xiao Y, Wang H, Lu Y, Wang W. Online search of overlapping communities. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 2013, 277–288
Matsunaga T, Yonemori C, Tomita E, Muramatsu M. Clique-based data mining for related genes in a biomedical database. BMC Bioinformatics, 2009, 10(1): 205
Article Google Scholar
Sarawagi S, Kirpal A. Efficient set joins on similarity predicates. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data. 2004, 743–754
Hadjieleftheriou M, Chandel A, Koudas N, Srivastava D. Fast indexes and algorithms for set similarity selection queries. In: Proceedings of the 24th IEEE International Conference on Data Engineering. 2008, 267–276
Hadjieleftheriou M, Srivastava D. Weighted set-based string similarity. IEEE Data Engineering Bulletin, 2010, 33(1): 25–36
Google Scholar
Culpepper J S, Moffat A. Efficient set intersection for inverted indexing. ACM Transactions on Information System, 2010, 29(1): 1–24
Article Google Scholar
Wu H, Li G, Zhou L. Ginix: generalized inverted index for keyword search. Tsinghua Science and Technology, 2013, 18(1): 77–87
Article Google Scholar
Deng D, Li G, Wen H, Feng J. An efficient partition based method for exact set similarity joins. Proceedings of the VLDB Endowment, 2015, 9(4): 360–371
Article Google Scholar
Yuan L, Qin L, Lin X, Chang L, Zhang W. Diversified top-k clique search. In: Proceedings of the 31st IEEE International Conference on Data Engineering. 2015, 387–398
Wang J, Cheng J, Fu A W C. Redundancy-aware maximal cliques. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2013, 122–130
Li C, Lu J, Lu Y. Efficient merging and filtering algorithms for approximate string searches. In: Proceedings of the 24th IEEE International Conference on Data Engineering. 2008, 257–266
Bayardo R J, Ma Y, Srikant R. Scaling up all pairs similarity search. In: Proceedings of the 16th International World Wide Web Conference. 2007, 131–140
Xiao C, Wang W, Lin X, Yu J X. Efficient similarity joins for near duplicate detection. In: Proceedings of the 17th International World Wide Web Conference. 2008, 131–140
Wang J, Li G, Feng J. Can we beat the prefix filtering?: An adaptive framework for similarity join and search. In: Proceedings of the 2012 ACM International Conference on Management of Data. 2012, 85–96
Xiao C, Wang W, Lin X, Shang H. Top-k set similarity joins. In: Proceedings of the 25th IEEE International Conference on Data Engineering. 2009, 916–927
Deng D, Li G, Feng J. A pivotal prefix based filtering algorithm for string similarity search. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. 2014, 673–684
Ao N, Zhang F, Wu D, Stones D S, Wang G, Liu X, Liu J, Lin S. Efficient parallel lists intersection and index compression algorithms using graphics processing units. Proceedings of the VLDB Endowment, 2011, 4(8): 470–481
Article Google Scholar
Inoue H, Ohara M, Taura K. Faster set intersection with simd instructions by reducing branch mispredictions. Proceedings of the VLDB Endowment, 2014, 8(3): 293–304
Article Google Scholar
Vernica R, Carey M J, Li C. Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 495–506
Bolin Ding A C K. Fast set intersection in memory. In: Proceedings of the 37th International Conference on Very Large Databases. 2011, 255–266
Article Google Scholar
Fan Z, Peng Y, Choi B, Xu J, Bhowmick S S. Towards efficient authenticated subgraph query service in outsourced graph databases. IEEE Transactions on Services Computing, 2014, 7(4): 696–713
Article Google Scholar
Chvatal V. A greedy heuristic for the set-covering problem. Mathematics of Operations Research, 1979, 4(3): 233–235
Article MathSciNet Google Scholar
Eppstein D, Löffler M, Strash D. Listing all maximal cliques in sparse graphs in near-optimal time. Algorithms and Computation, 2010, 403–414

Download references

Acknowledgements

Thanks to the support of NSF of China (61502258, 61806105), Major Technology Innovation Project of Shandong (2018CXGC0703), NSF of Shandong Province (ZR2014FQ007), the Project of Shandong Finance Society (2018SDJR31), Soft Science Fund of Shandong Province (2018RKB01373, 2016RKB01043). Also thanks to the Scientific Research Start-up Fund and Outstanding Young Scholars Program of Social Science at Qilu University of Technology.

Author information

Authors and Affiliations

Research Center of Big Data Application, Qilu University of Technology, Jinan, 250353, China
Yun Peng, Huawei Zhao & Zhizheng Zhou
Data Science Institute, Columbia University, New York, 10027, USA
Yitong Xu
College of Mechanics and Materials, Hohai University, Nanjing, 210098, China
Huimin Han

Authors

Yun Peng
View author publications
Search author on:PubMed Google Scholar
Yitong Xu
View author publications
Search author on:PubMed Google Scholar
Huawei Zhao
View author publications
Search author on:PubMed Google Scholar
Zhizheng Zhou
View author publications
Search author on:PubMed Google Scholar
Huimin Han
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Yun Peng.

Additional information

Yun Peng is an associate professor of the Research Center of Big Data Application, Qilu University of Technology, China. He received the PhD degree from the Hong Kong Baptist University, China in 2013 and received the BSci and MPhil degrees in computer science from Shandong University and the Harbin Institute of Technology (HIT), China in 2006 and 2008, respectively. His research interests include graph-structured data processing, data mining, and machine learning.

Yitong Xu is a data scientist at Amazon. He received his BA in math and statistics from the University of Minnesota, Twin Cities, USA and MS in data science from the Data Science Institute, Columbia University, USA. His research interests include information retrieval, graph data mining and recommendation system.

Huawei Zhao is a professor of the Qilu University of Technology, China. He received his MS in School of Computer Science, and PhD degrees in Institute of Information & Network Security from Shandong University, China in 2002 and 2006 respectively. His research interests include unstructured data processing, information security, and digital economy.

Zhizheng Zhou is an associate professor of Qilu University of Technology, China. His research fields include financial data analysis and FinTech.

Huimin Han is now a PhD student at the College of Mechanics and Materials, Hohai University, China. Her research interests include Big Data Analysis and Structural Health Monitoring.

Electronic Supplementary Material