research-article

HOPS: Probabilistic Subtree Mining for Small and Large Graphs

Authors:

Pascal Welke,

Florian Seiffarth,

Michael Kamp,

Stefan WrobelAuthors Info & Claims

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 1275 - 1284

https://doi.org/10.1145/3394486.3403180

Published: 20 August 2020 Publication History

Get Access

Abstract

Frequent subgraph mining, i.e., the identification of relevant patterns in graph databases, is a well-known data mining problem with high practical relevance, since next to summarizing the data, the resulting patterns can also be used to define powerful domain-specific similarity functions for prediction. In recent years, significant progress has been made towards subgraph mining algorithms that scale to complex graphs by focusing on tree patterns and probabilistically allowing a small amount of incompleteness in the result. Nonetheless, the complexity of the pattern matching component used for deciding subtree isomorphism on arbitrary graphs has significantly limited the scalability of existing approaches. In this paper, we adapt sampling techniques from mathematical combinatorics to the problem of probabilistic subtree mining in arbitrary databases of many small to medium-size graphs or a single large graph. By restricting on tree patterns, we provide an algorithm that approximately counts or decides subtree isomorphism for arbitrary transaction graphs in sub-linear time with one-sided error. Our empirical evaluation on a range of benchmark graph datasets shows that the novel algorithm substantially outperforms state-of-the-art approaches both in the task of approximate counting of embeddings in single large graphs and in probabilistic frequent subtree mining in large databases of small to medium sized graphs.

Supplementary Material

MP4 File (3394486.3403180.mp4)

The HOPS embedding algorithm is used for importance sampling to estimates the number of embeddings of a tree pattern in a very large graph, or to estimate the frequency of a pattern in arbitrary databases of many small to medium-size graphs. The algorithm outperforms state-of-the-art estimation algorithms in terms of accuracy, and it is very fast: its runtime is independent of target graph size and thus allows to estimate the number of trees in graphs too large and patterns too large for state-of-the-art methods, and it finds frequent subtrees orders of magnitude faster than the state of the art.

Download
96.04 MB

References

[1]

Marco Bressan, Stefano Leucci, and Alessandro Panconesi. 2019. Motivo: Fast Motif Counting via Succinct Color Coding and Adaptive Sampling. PVLDB, Vol. 12, 11 (2019), 1651--1663. https://doi.org/10.14778/3342263.3342640

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Probabilistic and exact frequent subtree mining in graphs beyond forests

Domination in planar graphs with small diameter

Large Induced Forests in Graphs

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations