Abstract
Similarity join on XML documents which are usually modeled as rooted ordered labeled trees is widely applied, due to the ambiguity of references to the real-world objects. The conventional method dealing with this issue is based on tree edit distance, which is shortage of flexibility and efficiency. In this paper, we propose two novel edit operations together with extended tree edit distance, which can achieve good performance in similarity matching with hierarchical data structures [the run-time is \(O(n^{3})\) in the worst case]. And then, we propose \(k\)-generation set distance as a good approximation of the tree edit distance to further improve the join efficiency with quadric time complexity. Experiments on real and synthetic databases demonstrate the benefit of our method in efficiency and scalability.
Similar content being viewed by others
References
Algergawy A, Nayak R, Saake R (2010) Element similarity measures in XML schema matching. J Inf Sci 180(24):4975–4998
Augsten N, Bohlen M, Gamper J (2005) Approximate matching of hierarchical data using pq-grams. In: Proceeding of the 31st VLDB conferences, Trondheim, pp 301–312
Augsten N, Bohlen M, Gamper J (2010) The pq-gram distance between ordered labeled trees. ACM Trans Database Syst 35(1):4(1)–4(36)
Bille P (2005) A survey on tree edit distance and related problems. Theor Comput Sci 337(1–3):217–239
Demaine ED, Mozes S, Rossman B, Weimann O (2007) An optimal decomposition algorithm for tree edit distance. In: Arge L, Cachin C, Jurdzinski T, Tarlecki A (eds) ICALP, LNCS, vol 4596. Springer, Heidelberg, pp 146–157
Dulucq S, Touzet H (2003) Analysis of tree edit distance algorithms. In: Proceeding of the 14th annual symposium on Combinatorial Pattern Matching (CPM), pp 83–95
Garofalakis M, Kumar A (2005) XML stream processing using tree-edit distance embeddings. ACM Trans Database Syst 30(1):279–332
Guha S, Jagadish HV, Koudas N et al (2002) Approximate XML joins. ACM SIGMOD, Madison, Wisconsin
Guha S, Jagadish HV, Koudas N et al (2006) Integrating XML data sources using approximate joins. ACM Trans Database Syst 31(1):161–207
Han Z, Wang H, Gao H et al (2009) Clustering-based approximate join method on XML documents. J Comput Res Dev. ISSN1000-1239/CN 11–1177/TP46(Suppl.): 81–86
Kailing K, Kriegel H, Schonauer S et al (2004) Efficient similarity search for hierarchical data in large databases. In: Bertino E, Christodoulakis S, Plexousakis D, Vassilis C, Koubarakis M, Böhm K, Ferrari E (eds) Advances in database technology - EDBT 2004. Lecture notes in computer science, vol 2992. Springer, Heidelberg, pp 676–693
Klein PH (1998) Computing the edit-distance between unrooted ordered trees. ESA’98, LNCS 1461:91–102
Li F, Wang H, Hao L et al (2010) pq-hash: an efficient method for approximate XML joins. WAIM 2010 Workshops LNCS 6185:125–134
Li F, Wang H, Zhang C et al (2010) Approximate joins for XML using \(g\)-string. XSym 2010, LNCS 6309, pp. 3–17
Mozes S (2008) Some lower and upper bounds for tree edit distance. Department of Computer Science, Brown University, Providence, RI
Tai KC (1979) The tree-to-tree correction problem. J ACM 26(3):422–433
Tatikonda S, Parthasarathy S (2010) Hashing tree-structured data: methods and applications. In: IEEE 26th international conference on data engineering (ICDE), pp 429–440
Wang Y, Wang H, Wang Y et al (2012) Similarity join on XML based on \(k\)-generation set distance. WAIM 2011 Workshops, LNCS 7142, Springer, Heidelberg, pp 124–135
Yang Z, Yang G (2004) A near-optimal similarity join algorithm and performance evaluation. J Inf Sci 167(1–4):87–108
Yi S, Huang B, Chan WT (2005) XML application schema matching using similarity measure and relaxation labeling. J Inf Sci 169(1–2):27–46
Zhang K, Shasha D (1989) Simple fast algorithms for the editing distance between trees and related problems. SIAM J Comput 18(6):1245–1262
Shasha D, Zhang K (1995) Approximate tree pattern matching. Pattern matching in strings, trees and arrays. chapter 14, Oxford University Press
Chawathe SS (1999) Comparing hierarchical data in external memory. In: Proceedings of the twenty-fifth International conference on very large data bases (VLDB), pp 90–101
Nierman A, Jagadish HV (2002) Evaluating structural similarity in XML documents. In Proceedings of the 5th international workshop on the Web and Databases
Flesca S, Manco G, Masciari G et al (2002) Detecting structural similarities between XML documents. In: Proceedings of WebDB, pp 55–60
Acknowledgments
This paper was partially supported by NGFR 973 Grant 2012CB316200, NSFC Grant 61472099 and National Sci-Tech Support Plan 2015BAH10F00.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, Y., Wang, H., Zhang, L. et al. Extend tree edit distance for effective object identification. Knowl Inf Syst 46, 629–656 (2016). https://doi.org/10.1007/s10115-014-0816-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-014-0816-1