Parameterized Mapping Distances for Semi-Structured Data

Shin, Kilho; Niiyama, Taro

doi:10.1007/978-3-030-05453-3_21

Kilho Shin¹⁴ &
Taro Niiyama¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11352))

Included in the following conference series:

International Conference on Agents and Artificial Intelligence

900 Accesses

Abstract

The edit distances have been widely used as an effective method to analyze similarity of semi-structured data such as strings, trees and graphs. For example, the Levenshtein distance for strings is known to be effective to analyze DNA and proteins, and the Taï distance and its variations are attracting wide attention of researchers who study tree-type data such as glycan, HTML-DOM-trees, parse trees of natural language processing and so on. The problem that we recognize here is that the way of engineering new edit distances was ad-hoc and lacked a unified view. To solve the problem, we introduce the concept of the mapping distance and a hyper-parameter that controls costs of label mismatch. One of the most important advantages of our parameterized mapping distances consists in the fact that the distances can be defined for arbitrary finite sets in a consistent manner and some important properties such as satisfaction of the axioms of metrics can be discussed abstractly regardless of the structures of data. The second important advantage is that mapping distances themselves can be parameterized, and therefore, we can identify the best distance to a particular application by parameter search. The mapping distance framework can provide a unified view over various distance measures for semi-structured data focusing on partial one-to-one mappings between data. These partial one-to-one mappings are a generalization of what are known as mappings of edit paths in the legacy study of edit distances. This is a clear contrast to the legacy edit distance framework, which defines distances through edit operations and edit paths. Our framework enables us to design new distance measures in a consistent manner, and also, various distance measures can be described using a small number of parameters. In fact, in this paper, we take ordered rooted trees as an example and introduce three independent dimensions to parameterize mapping distance measures. Through intensive experiments using ten datasets, we identify two important mapping distances that can exhibit good classification performance when used with the k-NN classifier. These mapping distances are novel and have not been discussed in the literature.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B 57(1), 289–300 (1995)
MathSciNet MATH Google Scholar
Bille, P.: A survey on tree edit distance and related problems. Theoret. Comput. Sci. 337(1–3), 217–239 (2005)
Article MathSciNet Google Scholar
Collins, M., Duffy, N.: Convolution kernels for natural language. In: Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic NIPS, vol. 2001], pp. 625–632. MIT Press, Boca Raton (2001)
Google Scholar
Demaine, E.D., Mozes, S., Rossman, B., Weimann, O.: An optimal decomposition algorithm for tree edit distance. ACM Trans. Algo. 6, 2 (2006)
MathSciNet MATH Google Scholar
Dulucq, S., Touzet, H.: Analysis of tree edit distance algorithms. In: The 14th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 83–95 (2003)
Chapter Google Scholar
Hashimoto, K., Goto, S., Kawano, S., Aoki-Kinoshita, K.F., Ueda, N.: KEGG as a glycome informatics resource. Glycobiology 16, 63R–70R (2006)
Article Google Scholar
Kao, M.Y., Lam, T.W., Sung, W.K., Ting, H.F.: An even faster and more unifying algorithm for comparing trees via unbalanced bipartite matchings, July 2007
Google Scholar
Klein, P.N.: Computing the edit-distance between unrooted ordered trees. In: Bilardi, G., Italiano, G.F., Pietracaprina, A., Pucci, G. (eds.) ESA 1998. LNCS, vol. 1461, pp. 91–102. Springer, Heidelberg (1998). https://doi.org/10.1007/3-540-68530-8_8
Chapter Google Scholar
Kuboyama, T., Shin, K., Miyahara, T., Yasuda, H.: A theoretical analysis of alignment and edit problems for trees. In: Coppo, M., Lodi, E., Pinna, G.M. (eds.) ICTCS 2005. LNCS, vol. 3701, pp. 323–337. Springer, Heidelberg (2005). https://doi.org/10.1007/11560586_26
Chapter MATH Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)
MathSciNet Google Scholar
Lu, C.L., Su, Z.-Y., Tang, C.Y.: A new measure of edit distance between labeled trees. In: Wang, J. (ed.) COCOON 2001. LNCS, vol. 2108, pp. 338–348. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44679-6_37
Chapter Google Scholar
Moschitti, A.: Example data for tree kernels in SVM-light. http://disi.unitn.it/moschitti/Tree-Kernel.htm
Neuhaus, M., Bunke, H.: Bridging the gap between graph edit distance and kernel machines. World Scientific (2007)
Google Scholar
Pawlik, M., Augsten, N.: Rted: a robust algorithm for the tree edit distance. Proc. VLDB Endowment. 5, 334–345 (2011)
Article Google Scholar
Pyysalo, S., Airola, A., Heimonen, J., Bjorne, J., Ginter, F., Salakoski, T.: Comparative analysis of five protein-protein interaction corpora. BMC Bioinform. 9(S–3), S6 (2008)
Article Google Scholar
Richter, T.: A new measure of the distance between ordered trees and its applications. Tech. Rep. 85166-CS, Dept. of Computer Science, Univ. of Bonn (1997). http://citeseer.ist.psu.edu/richter97new.html
Shin, K.: Tree edit distance and maximum agreement subtree. Inf. Process. Lett. 115(1), 69–73 (2015). https://doi.org/10.1016/j.ipl.2014.09.002
Article MathSciNet MATH Google Scholar
Shin, K., Niiyama, T.: The mapping distance - a generalization of the edit distance - and its application to trees. In: Proceedings of the 10th International Conference on Agent and Artificial Intelligence ICAART 2018, vol. 2, pp. 266–275. SciTePress (2018)
Google Scholar
Taï, K.C.: The tree-to-tree correction problem. J. ACM 26(3), 422–433 (1979)
Article MathSciNet Google Scholar
Wang, J.T.L., Zhang, K.: Finding similar consensus between trees: an algorithm and a distance hierarchy. Pattern Recognit. 34, 127–137 (2001)
Article Google Scholar
Zaki, M.J., Aggarwal, C.C.: XRules: an effective algorithm for structural classification of XML data. Mach. Learn. 62, 137–170 (2006)
Article Google Scholar
Zhang, K.: Algorithms for the constrained editing distance between ordered labeled trees and related problems. Pattern Recognit. 28(3), 463–474 (1995)
Article Google Scholar
Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6), 1245–1262 (1989)
Article MathSciNet Google Scholar
Zhang, K., Wang, J.T.L., Shasha, D.: On the editing distance between undirected acyclic graphs. Int. J. Found. Comput. Sci. 7(1), 43–58 (1996). http://citeseer.ist.psu.edu/article/zhang95editing.html
Article Google Scholar

Download references

Acknowledgement

This work was supported by JSPS KAKENHI Grant Number JP17H007623 and JP16K12491.

Author information

Authors and Affiliations

Graduate School of Applied Informatics, University of Hyogo, Kobe, Japan
Kilho Shin
NTT DoCoMo, Tokyo, Japan
Taro Niiyama

Authors

Kilho Shin
View author publications
You can also search for this author in PubMed Google Scholar
Taro Niiyama
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kilho Shin .

Editor information

Editors and Affiliations

Leiden University, Leiden, The Netherlands
Jaap van den Herik
University of Porto, Porto, Portugal
Ana Paula Rocha

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shin, K., Niiyama, T. (2019). Parameterized Mapping Distances for Semi-Structured Data. In: van den Herik, J., Rocha, A. (eds) Agents and Artificial Intelligence. ICAART 2018. Lecture Notes in Computer Science(), vol 11352. Springer, Cham. https://doi.org/10.1007/978-3-030-05453-3_21

Download citation

DOI: https://doi.org/10.1007/978-3-030-05453-3_21
Published: 30 December 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-05452-6
Online ISBN: 978-3-030-05453-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics