Min-Hashing for Probabilistic Frequent Subtree Feature Spaces

Welke, Pascal; Horváth, Tamás; Wrobel, Stefan

doi:10.1007/978-3-319-46307-0_5

Pascal Welke¹⁶,
Tamás Horváth^16,17 &
Stefan Wrobel^16,17

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9956))

Included in the following conference series:

International Conference on Discovery Science

1581 Accesses
1 Citations

Abstract

We propose a fast algorithm for approximating graph similarities. For its advantageous semantic and algorithmic properties, we define the similarity between two graphs by the Jaccard-similarity of their images in a binary feature space spanned by the set of frequent subtrees generated for some training dataset. Since the feature space embedding is computationally intractable, we use a probabilistic subtree isomorphism operator based on a small sample of random spanning trees and approximate the Jaccard-similarity by min-hash sketches. The partial order on the feature set defined by subgraph isomorphism allows for a fast calculation of the min-hash sketch, without explicitly performing the feature space embedding. Experimental results on real-world graph datasets show that our technique results in a fast algorithm. Furthermore, the approximated similarities are well-suited for classification and retrieval tasks in large graph datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We note that the crucial property implying the negative complexity result in [6] is not necessarily the intractability of subgraph isomorphism; there are cases when efficient frequent subgraph mining is possible even for NP-hard pattern matching operators [7].
2.
In practice, we do not store the patterns in \({\textsc {Sketch}}_{\pi _1,\ldots ,\pi _K}(G)\) explicitly. Instead, we define some arbitrary total order on \(\mathcal {F}\) and represent each pattern by its position according to this order.

References

Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of the Compression and Complexity of Sequences, pp. 21–29. IEEE (1997)
Google Scholar
Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. J. Comput. Syst. Sci. 60(3), 630–659 (2000)
Article MathSciNet MATH Google Scholar
Deshpande, M., Kuramochi, M., Wale, N., Karypis, G.: Frequent substructure-based approaches for classifying chemical compounds. Trans. Knowl. Data Eng. 17(8), 1036–1050 (2005)
Article Google Scholar
Diestel, R.: Graph Theory. Graduate Texts in Mathematics, vol. 173, 4th edn. Springer, Heidelberg (2012). http://dblp.dagstuhl.de/rec/bib/books/daglib/0030488
MATH Google Scholar
Geppert, H., Horváth, T., Gärtner, T., Wrobel, S., Bajorath, J.: Support-vector-machine-based ranking significantly improves the effectiveness of similarity searching using 2D fingerprints and multiple reference compounds. J. Chem. Inf. Model. 48(4), 742–746 (2008)
Article Google Scholar
Horváth, T., Bringmann, B., Raedt, L.: Frequent hypergraph mining. In: Inoue, K., Ohwada, H., Yamamoto, A. (eds.) ILP 2006. LNCS (LNAI), vol. 4455, pp. 244–259. Springer, Heidelberg (2007). doi:10.1007/978-3-540-73847-3_26
Chapter Google Scholar
Horváth, T., Ramon, J.: Efficient frequent connected subgraph mining in graphs of bounded tree-width. Theor. Comput. Sci. 411(31–33), 2784–2797 (2010)
Article MathSciNet MATH Google Scholar
Ralaivola, L., Swamidass, S.J., Saigo, H., Baldi, P.: Graph kernels for chemical informatics. Neural Netw. 18(8), 1093–1110 (2005)
Article Google Scholar
Shamir, R., Tsur, D.: Faster subtree isomorphism. J. Algorithms 33(2), 267–280 (1999). doi:10.1006/jagm.1999.1044
Article MathSciNet MATH Google Scholar
Shi, Q., Petterson, J., Dror, G., Langford, J., Smola, A.J., Vishwanathan, S.V.N.: Hash kernels for structured data. J. Mach. Learn. Res. 10, 2615–2637 (2009)
MathSciNet MATH Google Scholar
Teixeira, C.H.C., Silva, A., Meira Jr., W.: Min-hash fingerprints for graph kernels: a trade-off among accuracy, efficiency, and compression. J. Inf. Data Manag. 3(3), 227–242 (2012)
Google Scholar
Welke, P., Horváth, T., Wrobel, S.: Probabilistic frequent subtree kernels. In: Ceci, M., Loglisci, C., Manco, G., Masciari, E., Ras, Z.W. (eds.) NFMCP 2015. LNCS (LNAI), vol. 9607, pp. 179–193. Springer, Heidelberg (2016). doi:10.1007/978-3-319-39315-5_12
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Bonn, Bonn, Germany
Pascal Welke, Tamás Horváth & Stefan Wrobel
Fraunhofer IAIS, Schloss Birlinghoven, Sankt Augustin, Germany
Tamás Horváth & Stefan Wrobel

Authors

Pascal Welke
View author publications
You can also search for this author in PubMed Google Scholar
Tamás Horváth
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Wrobel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pascal Welke .

Editor information

Editors and Affiliations

Campus Middelhe, M.G.103a, Universiteit Antwerpen Campus Middelhe, M.G.103a, Antwerp, Belgium
Toon Calders
Università degli Studi di Bari Aldo Moro, Bari, Italy
Michelangelo Ceci
Bari, Italy
Donato Malerba

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Welke, P., Horváth, T., Wrobel, S. (2016). Min-Hashing for Probabilistic Frequent Subtree Feature Spaces. In: Calders, T., Ceci, M., Malerba, D. (eds) Discovery Science. DS 2016. Lecture Notes in Computer Science(), vol 9956. Springer, Cham. https://doi.org/10.1007/978-3-319-46307-0_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-46307-0_5
Published: 21 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46306-3
Online ISBN: 978-3-319-46307-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics