Skip to main content
Log in

A theory of subtree matching and tree kernels based on the edit distance concept

  • Published:
Annals of Mathematics and Artificial Intelligence Aims and scope Submit manuscript

Abstract

Edit distances provide us with an established method to capture structural features of data, and a distance between data objects represents their dissimilarity. In contrast, kernels form a category of similarity functions, and a positive definite kernel enables us to leverage abundant techniques of multivariate analysis. This paper aims to fill the gap between distances and kernels. In the literature, we have several formulas that convert a negative definite distance function into a positive definite kernel. Edit distance functions, however, are not necessarily negative definite, and our first contribution is to introduce an alternative method to derive positive definite kernels from edit distance functions that are not necessarily negative definite. The method is equipped with an easy-to-check and strong sufficient condition for positive definiteness, and the condition turns out to be tightly related with the triangle inequality. In fact, to our knowledge, all of the edit distance functions in the literature that support the triangle inequality meet the condition for positive definiteness. Secondly, we apply this method to four well-known edit distance functions for trees to introduce four novel kernels and show that three of them are positive definite. Thirdly, we develop a theory of subtree matching to study these kernels. Our kernels count matchings between subtrees of the input trees with weights determined according to individual matchings. Although the number of such matchings is an exponential function of the size of the input trees (the number of vertices), our theory enables us to develop dynamic-programming-based algorithms, whose asymptotic computational complexities fall between a quadratic function and a cubic function of the size.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Augsten, N., Böhlen, M.H., Dyreson, C.E., Gamper, J.: Windowed pq-grams for approximate joins of data-centric XML. VLDB J. 21(4), 463–488 (2012)

    Article  Google Scholar 

  2. Augsten, N., Bhlen, M.H., Gamper, J.: The pq-gram distance between ordered labeled trees. ACM Trans. Database Syst. 35(1) (2010)

  3. Barnard, D., Clarke, G., Duncan, N.: Tree-to-tree correction for document trees, technical report 95-375. Queen’s University, Kingston (1995)

    Google Scholar 

  4. Berg, C., Christensen, J.P.R., Ressel, R.: Harmonic analysis on semigroups, theory of positive definite and related functions. Springer (1984)

  5. Bille, P.: A survey on tree edit distance and related problems. Theor. Comput. Sci. 337(1-3), 217–239 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  6. Bunke, H.: On a relation between graph edit distance and maximum common subgraph. Pattern Recognit. Lett. 18, 689–694 (1997)

    Article  MathSciNet  Google Scholar 

  7. Chang, C.-C., Lin, C.-J.: Libsvm: a library for support vector machines (2001). http://www.csie.ntu.edu.tw/~cjlin/libsvm/

  8. Cohen, S., Or, N.: A general algorithm for subtree similarity-search. In: IEEE 30th international conference on data engineering, pp 928–939 (2014)

  9. Marteau, P.-F., Gibet, S.: On recursive edit distance kernels with application to time series classification, IEEE Transactions on Neural Networks and Learning Systems (2014)

  10. Collins, M., Duffy, N.: Convolution kernels for natural language. In: Advances in neural information processing systems 14 [neural information processing systems: natural and synthetic, NIPS 2001], pp 625–632. MIT Press (2001)

  11. Cortes, C., Haffner, P., Mohri, M.: Rational kernels: theory and algorithms. J. Mach. Learn. Res. 1, 1–50 (2004)

    Google Scholar 

  12. Demaine, E.D., Mozes, S., Rossman, B., Weimann, O.: An optimal decomposition algorithm for tree edit distance. ACM Trans. Algo. (2006)

  13. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Theory 7, 1–30 (2006)

    MATH  Google Scholar 

  14. Dulucq, S., Touzet, H.: Analysis of tree edit distance algorithms. In: the 14th annual symposium on combinatorial pattern matching (CPM), pp 83–95 (2003)

  15. Garcia, H.F.S.: An extension on Statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. J. Mach. Learn. Theory 9, 2677–2694 (2008)

    MATH  Google Scholar 

  16. Gärtner, T.: A survey of kernels for structured data. SIGKDD Explorations 5(1), 49–58 (2003)

    Article  Google Scholar 

  17. Hashimoto, K., Goto, S., Kawano, S., Aoki-Kinoshita, K.F., Ueda, N.: Kegg as a glycome informatics resource. Glycobiology 16, 63R–70R (2006)

    Article  Google Scholar 

  18. Haussler, D.: Convolution kernels on discrete structures. UCSC-CRL 99-10, Dept. of Computer Science, University of California at Santa Cruz (1999)

  19. Hommel, G.: A stagewise rejective multiple test procedure based on a modified bonferroni tests. Biometrika 75, 383–386 (1988)

    Article  MATH  Google Scholar 

  20. Jiang, T., Wang, L., Zhang, K.: Alignment of trees — an alternative to tree edit. Theor. Comput. Sci. 143, 137–148 (1995)

    Article  MATH  MathSciNet  Google Scholar 

  21. Kashima, H., Koyanagi, T.: Kernels for semi-structured data. In: the 9th international conference on machine learning (ICML 2002), pp 291–298 (2002)

  22. Klein, P.N.: Computing the edit-distance between unrooted ordered trees. LNCS 1461, 91–102 (1998). ESA’98

    Google Scholar 

  23. Kuboyama, T., Shin, K., Kashima, H.: Flexible tree kernels based on counting the number of tree mappings. In: Proceeding of machine learning with graphs (2006)

  24. Kuboyama, T., Shin, K., Miyahara, T., Yasuda, H.: A theoretical analysis of alignment and edit problems for trees. In: Proceeding of theoretical computer science, the 9th Italian Conference, lecture notes in computer science, vol. 3701, pp 323–337 (2005)

  25. Kuboyama, T.: Matching and Learning in Trees. PhD thesis, Department of Advanced Interdisciplinary Studies, The University of Tokyo (2007)

  26. Lu, C. L., Su, Z.Y., Tang, G.Y.: A new measure of edit distance between labeled trees. In: LNCS, vol. 2108, pp 338–348. Springer, Heidelberg (2001)

  27. Lu, S.Y.: A tree-to-tree distance and its application to cluster analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 1, 219–224 (1979)

    Article  MATH  Google Scholar 

  28. Minos, G., Amit, K.: Xml stream processing using tree-edit distance embeddings. ACM Trans. Database Syst. 30(1), 279–332 (2005)

    Article  Google Scholar 

  29. Moschitti, A.: Example data for TREE KERNELS IN SVM-LIGHT. http://disi.unitn.it/moschitti/Tree-Kernel.htm

  30. Neuhaus, M., Bunke, H.: Edit distance-based kernel functions for structural pattern classification. Pattern Recogn. 39(10), 1852–1863 (2006)

    Article  MATH  Google Scholar 

  31. Neuhaus, M., Bunke, H.: Bridging the gap between graph edit distance and kernel machines. World Scientific (2007)

  32. Pawlik, M., Augsten, N.: Rted: A robust algorithm for the tree edit distance. In: Proceedings of the VLDB Endowment, vol. 5, pp 334–345 (2011)

  33. Richter, T.: A new measure of the distance between ordered trees and its applications, Technical Report 85166-CS, Dept. of Computer Science, Univ. of Bonn (1997)

  34. Riesen, K., Bunke, H.: Graph classification by means of lipschitz embedding. IEEE Trans. Syst. Man Cybern. Part B Cybern. 39(6), 1472–1483 (2009)

    Article  Google Scholar 

  35. Schoenberg, I.J.: Metric spaces and positive definite functions. Trans. Amer. Math. Soc. 44, 522–536 (1938)

    Article  MathSciNet  Google Scholar 

  36. Schölkopf, B.: The kernel trick for distances. In: Advances in neural information processing systems 13 (NIPS 2000), pp 301–307 (2000)

  37. Shin, K., Cuturi, M., Kuboyama, T.: Mapping kernels for trees. In: ICML 2011 (2011)

  38. Shin, K., Kuboyama, T.: A generalization of Haussler’s convolution kernel - mapping kernel. In: ICML 2008 (2008)

  39. Shin, K., Kuboyama, T.: Generalization of haussler’s convolution kernel - mapping kernel and its application to tree kernels. J. Comput. Sci. Technol. 25(5), 1040–1054 (2010)

    Article  MathSciNet  Google Scholar 

  40. Shin, K.: Tree edit distance and maximum agreement subtree. Inf. Process. Lett. 115(1), 69–73 (2015)

    Article  MATH  Google Scholar 

  41. Taï, K.C.: The tree-to-tree correction problem. J. ACM 26(3), 422–433 (1979)

    Article  MATH  MathSciNet  Google Scholar 

  42. Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. J. ACM 21(1), 168–173 (1974)

    Article  MATH  MathSciNet  Google Scholar 

  43. Wang, J.T.L., Zhang, K.: Finding similar consensus between trees: an algorithm and a distance hierarchy. Pattern Recogn. 34, 127–137 (2001)

    Article  MATH  Google Scholar 

  44. Zaki, M.J., Aggarwal, C.C.: XRules: an effective algorithm for structural classification of XML data. Mach. Learn. 62, 137–170 (2006)

    Article  Google Scholar 

  45. Zhang, K.: Algorithms for the constrained editing distance between ordered labeled trees and related problems. Pattern Recogn. 28(3), 463–474 (1995)

    Article  Google Scholar 

  46. Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6), 1245–1262 (1989)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kilho Shin.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shin, K. A theory of subtree matching and tree kernels based on the edit distance concept. Ann Math Artif Intell 75, 419–460 (2015). https://doi.org/10.1007/s10472-015-9467-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10472-015-9467-5

Keywords

Mathematics Subject Classification (2010)

Navigation