Abstract
Edit distances provide us with an established method to capture structural features of data, and a distance between data objects represents their dissimilarity. In contrast, kernels form a category of similarity functions, and a positive definite kernel enables us to leverage abundant techniques of multivariate analysis. This paper aims to fill the gap between distances and kernels. In the literature, we have several formulas that convert a negative definite distance function into a positive definite kernel. Edit distance functions, however, are not necessarily negative definite, and our first contribution is to introduce an alternative method to derive positive definite kernels from edit distance functions that are not necessarily negative definite. The method is equipped with an easy-to-check and strong sufficient condition for positive definiteness, and the condition turns out to be tightly related with the triangle inequality. In fact, to our knowledge, all of the edit distance functions in the literature that support the triangle inequality meet the condition for positive definiteness. Secondly, we apply this method to four well-known edit distance functions for trees to introduce four novel kernels and show that three of them are positive definite. Thirdly, we develop a theory of subtree matching to study these kernels. Our kernels count matchings between subtrees of the input trees with weights determined according to individual matchings. Although the number of such matchings is an exponential function of the size of the input trees (the number of vertices), our theory enables us to develop dynamic-programming-based algorithms, whose asymptotic computational complexities fall between a quadratic function and a cubic function of the size.
Similar content being viewed by others
References
Augsten, N., Böhlen, M.H., Dyreson, C.E., Gamper, J.: Windowed pq-grams for approximate joins of data-centric XML. VLDB J. 21(4), 463–488 (2012)
Augsten, N., Bhlen, M.H., Gamper, J.: The pq-gram distance between ordered labeled trees. ACM Trans. Database Syst. 35(1) (2010)
Barnard, D., Clarke, G., Duncan, N.: Tree-to-tree correction for document trees, technical report 95-375. Queen’s University, Kingston (1995)
Berg, C., Christensen, J.P.R., Ressel, R.: Harmonic analysis on semigroups, theory of positive definite and related functions. Springer (1984)
Bille, P.: A survey on tree edit distance and related problems. Theor. Comput. Sci. 337(1-3), 217–239 (2005)
Bunke, H.: On a relation between graph edit distance and maximum common subgraph. Pattern Recognit. Lett. 18, 689–694 (1997)
Chang, C.-C., Lin, C.-J.: Libsvm: a library for support vector machines (2001). http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Cohen, S., Or, N.: A general algorithm for subtree similarity-search. In: IEEE 30th international conference on data engineering, pp 928–939 (2014)
Marteau, P.-F., Gibet, S.: On recursive edit distance kernels with application to time series classification, IEEE Transactions on Neural Networks and Learning Systems (2014)
Collins, M., Duffy, N.: Convolution kernels for natural language. In: Advances in neural information processing systems 14 [neural information processing systems: natural and synthetic, NIPS 2001], pp 625–632. MIT Press (2001)
Cortes, C., Haffner, P., Mohri, M.: Rational kernels: theory and algorithms. J. Mach. Learn. Res. 1, 1–50 (2004)
Demaine, E.D., Mozes, S., Rossman, B., Weimann, O.: An optimal decomposition algorithm for tree edit distance. ACM Trans. Algo. (2006)
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Theory 7, 1–30 (2006)
Dulucq, S., Touzet, H.: Analysis of tree edit distance algorithms. In: the 14th annual symposium on combinatorial pattern matching (CPM), pp 83–95 (2003)
Garcia, H.F.S.: An extension on Statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. J. Mach. Learn. Theory 9, 2677–2694 (2008)
Gärtner, T.: A survey of kernels for structured data. SIGKDD Explorations 5(1), 49–58 (2003)
Hashimoto, K., Goto, S., Kawano, S., Aoki-Kinoshita, K.F., Ueda, N.: Kegg as a glycome informatics resource. Glycobiology 16, 63R–70R (2006)
Haussler, D.: Convolution kernels on discrete structures. UCSC-CRL 99-10, Dept. of Computer Science, University of California at Santa Cruz (1999)
Hommel, G.: A stagewise rejective multiple test procedure based on a modified bonferroni tests. Biometrika 75, 383–386 (1988)
Jiang, T., Wang, L., Zhang, K.: Alignment of trees — an alternative to tree edit. Theor. Comput. Sci. 143, 137–148 (1995)
Kashima, H., Koyanagi, T.: Kernels for semi-structured data. In: the 9th international conference on machine learning (ICML 2002), pp 291–298 (2002)
Klein, P.N.: Computing the edit-distance between unrooted ordered trees. LNCS 1461, 91–102 (1998). ESA’98
Kuboyama, T., Shin, K., Kashima, H.: Flexible tree kernels based on counting the number of tree mappings. In: Proceeding of machine learning with graphs (2006)
Kuboyama, T., Shin, K., Miyahara, T., Yasuda, H.: A theoretical analysis of alignment and edit problems for trees. In: Proceeding of theoretical computer science, the 9th Italian Conference, lecture notes in computer science, vol. 3701, pp 323–337 (2005)
Kuboyama, T.: Matching and Learning in Trees. PhD thesis, Department of Advanced Interdisciplinary Studies, The University of Tokyo (2007)
Lu, C. L., Su, Z.Y., Tang, G.Y.: A new measure of edit distance between labeled trees. In: LNCS, vol. 2108, pp 338–348. Springer, Heidelberg (2001)
Lu, S.Y.: A tree-to-tree distance and its application to cluster analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 1, 219–224 (1979)
Minos, G., Amit, K.: Xml stream processing using tree-edit distance embeddings. ACM Trans. Database Syst. 30(1), 279–332 (2005)
Moschitti, A.: Example data for TREE KERNELS IN SVM-LIGHT. http://disi.unitn.it/moschitti/Tree-Kernel.htm
Neuhaus, M., Bunke, H.: Edit distance-based kernel functions for structural pattern classification. Pattern Recogn. 39(10), 1852–1863 (2006)
Neuhaus, M., Bunke, H.: Bridging the gap between graph edit distance and kernel machines. World Scientific (2007)
Pawlik, M., Augsten, N.: Rted: A robust algorithm for the tree edit distance. In: Proceedings of the VLDB Endowment, vol. 5, pp 334–345 (2011)
Richter, T.: A new measure of the distance between ordered trees and its applications, Technical Report 85166-CS, Dept. of Computer Science, Univ. of Bonn (1997)
Riesen, K., Bunke, H.: Graph classification by means of lipschitz embedding. IEEE Trans. Syst. Man Cybern. Part B Cybern. 39(6), 1472–1483 (2009)
Schoenberg, I.J.: Metric spaces and positive definite functions. Trans. Amer. Math. Soc. 44, 522–536 (1938)
Schölkopf, B.: The kernel trick for distances. In: Advances in neural information processing systems 13 (NIPS 2000), pp 301–307 (2000)
Shin, K., Cuturi, M., Kuboyama, T.: Mapping kernels for trees. In: ICML 2011 (2011)
Shin, K., Kuboyama, T.: A generalization of Haussler’s convolution kernel - mapping kernel. In: ICML 2008 (2008)
Shin, K., Kuboyama, T.: Generalization of haussler’s convolution kernel - mapping kernel and its application to tree kernels. J. Comput. Sci. Technol. 25(5), 1040–1054 (2010)
Shin, K.: Tree edit distance and maximum agreement subtree. Inf. Process. Lett. 115(1), 69–73 (2015)
Taï, K.C.: The tree-to-tree correction problem. J. ACM 26(3), 422–433 (1979)
Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. J. ACM 21(1), 168–173 (1974)
Wang, J.T.L., Zhang, K.: Finding similar consensus between trees: an algorithm and a distance hierarchy. Pattern Recogn. 34, 127–137 (2001)
Zaki, M.J., Aggarwal, C.C.: XRules: an effective algorithm for structural classification of XML data. Mach. Learn. 62, 137–170 (2006)
Zhang, K.: Algorithms for the constrained editing distance between ordered labeled trees and related problems. Pattern Recogn. 28(3), 463–474 (1995)
Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6), 1245–1262 (1989)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shin, K. A theory of subtree matching and tree kernels based on the edit distance concept. Ann Math Artif Intell 75, 419–460 (2015). https://doi.org/10.1007/s10472-015-9467-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10472-015-9467-5