skip to main content
10.1145/3394486.3403180acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

HOPS: Probabilistic Subtree Mining for Small and Large Graphs

Published: 20 August 2020 Publication History

Abstract

Frequent subgraph mining, i.e., the identification of relevant patterns in graph databases, is a well-known data mining problem with high practical relevance, since next to summarizing the data, the resulting patterns can also be used to define powerful domain-specific similarity functions for prediction. In recent years, significant progress has been made towards subgraph mining algorithms that scale to complex graphs by focusing on tree patterns and probabilistically allowing a small amount of incompleteness in the result. Nonetheless, the complexity of the pattern matching component used for deciding subtree isomorphism on arbitrary graphs has significantly limited the scalability of existing approaches. In this paper, we adapt sampling techniques from mathematical combinatorics to the problem of probabilistic subtree mining in arbitrary databases of many small to medium-size graphs or a single large graph. By restricting on tree patterns, we provide an algorithm that approximately counts or decides subtree isomorphism for arbitrary transaction graphs in sub-linear time with one-sided error. Our empirical evaluation on a range of benchmark graph datasets shows that the novel algorithm substantially outperforms state-of-the-art approaches both in the task of approximate counting of embeddings in single large graphs and in probabilistic frequent subtree mining in large databases of small to medium sized graphs.

Supplementary Material

MP4 File (3394486.3403180.mp4)
The HOPS embedding algorithm is used for importance sampling to estimates the number of embeddings of a tree pattern in a very large graph, or to estimate the frequency of a pattern in arbitrary databases of many small to medium-size graphs. The algorithm outperforms state-of-the-art estimation algorithms in terms of accuracy, and it is very fast: its runtime is independent of target graph size and thus allows to estimate the number of trees in graphs too large and patterns too large for state-of-the-art methods, and it finds frequent subtrees orders of magnitude faster than the state of the art.

References

[1]
Marco Bressan, Stefano Leucci, and Alessandro Panconesi. 2019. Motivo: Fast Motif Counting via Succinct Color Coding and Adaptive Sampling. PVLDB, Vol. 12, 11 (2019), 1651--1663. https://doi.org/10.14778/3342263.3342640
[2]
Mukund Deshpande, Michihiro Kuramochi, Nikil Wale, and George Karypis. 2005. Frequent substructure-based approaches for classifying chemical compounds. Transactions on Knowledge and Data Engineering, Vol. 17, 8 (Aug. 2005), 1036--1050. https://doi.org/10.1109/tkde.2005.127
[3]
Martin Fü rer and Shiva Prasad Kasiviswanathan. 2014. Approximately Counting Embeddings into Random Graphs. Combinatorics, Probability & Computing, Vol. 23, 6 (2014), 1028--1056. https://doi.org/10.1017/S0963548314000339
[4]
Michael R. Garey and David S. Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman.
[5]
Jiawei Han, Hong Cheng, Dong Xin, and Xifeng Yan. 2007. Frequent pattern mining: current status and future directions. Data Mining and Knowledge Discovery, Vol. 15, 1 (2007), 55--86. https://doi.org/10.1007/s10618-006-0059--1
[6]
Tamás Horváth and Jan Ramon. 2010. Efficient frequent connected subgraph mining in graphs of bounded tree-width. Theoretical Computer Science, Vol. 411, 31--33 (2010), 2784--2797. https://doi.org/10.1016/j.tcs.2010.03.030
[7]
Mark Jerrum, Alistair Sinclair, and Eric Vigoda. 2004. A polynomial-time approximation algorithm for the permanent of a matrix with nonnegative entries. J. ACM, Vol. 51, 4 (2004), 671--697. https://doi.org/10.1145/1008731.1008738
[8]
Ashraf M. Kibriya and Jan Ramon. 2013. Nearly exact mining of frequent trees in large networks. Data Mining and Knowledge Discovery, Vol. 27, 3 (2013), 478--504. https://doi.org/10.1007/s10618-013-0321--2
[9]
Donald E. Knuth. 1998. The art of computer programming, volume 2: (2nd ed.) seminumerical algorithms .Addison Wesley Longman Publishing Co., Inc., Redwood City, CA, USA.
[10]
Ioannis Koutis and Ryan Williams. 2009. Limits and Applications of Group Algebras for Parameterized Problems. In International Colloquium on Automata, Languages and Programming (ICALP) Proceedings, Part I (Lecture Notes in Computer Science, Vol. 5555). Springer, 653--664. https://doi.org/10.1007/978--3--642-02927--1_54
[11]
Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data .
[12]
Jure Leskovec and Rok Sosivc. 2016. SNAP: A General-Purpose Network Analysis and Graph-Mining Library. ACM Transactions on Intelligent Systems and Technology (TIST), Vol. 8, 1 (2016), 1.
[13]
Kirill Paramonov, Dmitry Shemetov, and James Sharpnack. 2019. Estimating Graphlet Statistics via Lifting. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, (KDD) Proceedings. ACM, 587--595. https://doi.org/10.1145/3292500.3330995
[14]
Irma Ravkic, Martin vZ nidarvs ivc, Jan Ramon, and Jesse Davis. 2018. Graph sampling with applications to estimating the number of pattern embeddings and the parameters of a statistical relational model. Data Mining and Knowledge Discovery, Vol. 32, 4 (2018), 913--948. https://doi.org/10.1007/s10618-018-0553--2
[15]
Pedro Ribeiro, Pedro Paredes, Miguel E. P. Silva, David Aparicio, and Fernando Silva. 2019. A Survey on Subgraph Counting: Concepts, Algorithms and Applications to Network Motifs and Graphlets. CoRR, Vol. abs/1910.13011 (2019), 1--35. arxiv: 1910.13011 http://arxiv.org/abs/1910.13011
[16]
Till Hendrik Schulz, Tamá s Horvá th, Pascal Welke, and Stefan Wrobel. 2018. Mining Tree Patterns with Partially Injective Homomorphisms. In European Conference on Machine Learning and Knowledge Discovery in Databases ECML PKDD Proceedings, Part II (Lecture Notes in Computer Science, Vol. 11052). Springer, 585--601. https://doi.org/10.1007/978--3-030--10928--8_35
[17]
Julian R. Ullmann. 1976. An Algorithm for Subgraph Isomorphism. J. ACM, Vol. 23, 1 (Jan. 1976), 31--42. https://doi.org/10.1145/321921.321925
[18]
Takeaki Uno. 1997. Algorithms for Enumerating All Perfect, Maximum and Maximal Matchings in Bipartite Graphs. In International Symposium on Algorithms and Computation (ISAAC) Proceedings (Lecture Notes in Computer Science, Vol. 1350). Springer, 92--101. https://doi.org/10.1007/3--540--63890--3_11
[19]
Pascal Welke, Tamás Horváth, and Stefan Wrobel. 2018. Probabilistic Frequent Subtrees for Efficient Graph Classification and Retrieval. Machine Learning, Vol. 107, 11 (2018), 1847--1873. https://doi.org/10.1007/s10994-017--5688--7
[20]
Pascal Welke, Tamás Horváth, and Stefan Wrobel. 2019. Probabilistic and Exact Frequent Subtree Mining in Graphs Beyond Forests. Machine Learning, Vol. 108, 7 (2019), 1137--1164. https://doi.org/10.1007/s10994-019-05779--1

Cited By

View all
  • (2023)Mining domain-specific edit operations from model repositories with applications to semantic lifting of model differences and change profilingAutomated Software Engineering10.1007/s10515-023-00381-130:2Online publication date: 26-Apr-2023
  • (2022)Parallel Frequent Subtrees Mining Method by an Effective Edge Division StrategyApplied Sciences10.3390/app1209477812:9(4778)Online publication date: 9-May-2022
  • (2021)Learning domain-specific edit operations from model repositories with frequent subgraph miningProceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering10.1109/ASE51524.2021.9678698(930-942)Online publication date: 15-Nov-2021

Index Terms

  1. HOPS: Probabilistic Subtree Mining for Small and Large Graphs

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
      August 2020
      3664 pages
      ISBN:9781450379984
      DOI:10.1145/3394486
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 20 August 2020

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      KDD '20
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

      Upcoming Conference

      KDD '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)28
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 14 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Mining domain-specific edit operations from model repositories with applications to semantic lifting of model differences and change profilingAutomated Software Engineering10.1007/s10515-023-00381-130:2Online publication date: 26-Apr-2023
      • (2022)Parallel Frequent Subtrees Mining Method by an Effective Edge Division StrategyApplied Sciences10.3390/app1209477812:9(4778)Online publication date: 9-May-2022
      • (2021)Learning domain-specific edit operations from model repositories with frequent subgraph miningProceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering10.1109/ASE51524.2021.9678698(930-942)Online publication date: 15-Nov-2021
      • (2020)Efficient Frequent Subgraph Mining in Transactional Databases2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA)10.1109/DSAA49011.2020.00044(307-314)Online publication date: Oct-2020

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media