Extend tree edit distance for effective object identification

Wang, Yue; Wang, Hongzhi; Zhang, Liyan; Wang, Yang; Li, Jianzhong; Gao, Hong

doi:10.1007/s10115-014-0816-1

Extend tree edit distance for effective object identification

Regular Paper
Published: 15 January 2015

Volume 46, pages 629–656, (2016)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Yue Wang¹,
Hongzhi Wang¹,
Liyan Zhang²,
Yang Wang¹,
Jianzhong Li¹ &
…
Hong Gao¹

320 Accesses
2 Citations
Explore all metrics

Abstract

Similarity join on XML documents which are usually modeled as rooted ordered labeled trees is widely applied, due to the ambiguity of references to the real-world objects. The conventional method dealing with this issue is based on tree edit distance, which is shortage of flexibility and efficiency. In this paper, we propose two novel edit operations together with extended tree edit distance, which can achieve good performance in similarity matching with hierarchical data structures [the run-time is \(O(n^{3})\) in the worst case]. And then, we propose \(k\)-generation set distance as a good approximation of the tree edit distance to further improve the join efficiency with quadric time complexity. Experiments on real and synthetic databases demonstrate the benefit of our method in efficiency and scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SETJoin: a novel top-k similarity join algorithm

Article 06 March 2020

A Compact and Efficient Labeling Scheme for XML Documents

MSQL: efficient similarity search in metric spaces using SQL

Article 06 October 2017

References

Algergawy A, Nayak R, Saake R (2010) Element similarity measures in XML schema matching. J Inf Sci 180(24):4975–4998
Article Google Scholar
Augsten N, Bohlen M, Gamper J (2005) Approximate matching of hierarchical data using pq-grams. In: Proceeding of the 31st VLDB conferences, Trondheim, pp 301–312
Augsten N, Bohlen M, Gamper J (2010) The pq-gram distance between ordered labeled trees. ACM Trans Database Syst 35(1):4(1)–4(36)
Bille P (2005) A survey on tree edit distance and related problems. Theor Comput Sci 337(1–3):217–239
Article MathSciNet MATH Google Scholar
Demaine ED, Mozes S, Rossman B, Weimann O (2007) An optimal decomposition algorithm for tree edit distance. In: Arge L, Cachin C, Jurdzinski T, Tarlecki A (eds) ICALP, LNCS, vol 4596. Springer, Heidelberg, pp 146–157
Google Scholar
Dulucq S, Touzet H (2003) Analysis of tree edit distance algorithms. In: Proceeding of the 14th annual symposium on Combinatorial Pattern Matching (CPM), pp 83–95
Garofalakis M, Kumar A (2005) XML stream processing using tree-edit distance embeddings. ACM Trans Database Syst 30(1):279–332
Article Google Scholar
Guha S, Jagadish HV, Koudas N et al (2002) Approximate XML joins. ACM SIGMOD, Madison, Wisconsin
Guha S, Jagadish HV, Koudas N et al (2006) Integrating XML data sources using approximate joins. ACM Trans Database Syst 31(1):161–207
Article Google Scholar
Han Z, Wang H, Gao H et al (2009) Clustering-based approximate join method on XML documents. J Comput Res Dev. ISSN1000-1239/CN 11–1177/TP46(Suppl.): 81–86
Kailing K, Kriegel H, Schonauer S et al (2004) Efficient similarity search for hierarchical data in large databases. In: Bertino E, Christodoulakis S, Plexousakis D, Vassilis C, Koubarakis M, Böhm K, Ferrari E (eds) Advances in database technology - EDBT 2004. Lecture notes in computer science, vol 2992. Springer, Heidelberg, pp 676–693
Klein PH (1998) Computing the edit-distance between unrooted ordered trees. ESA’98, LNCS 1461:91–102
Li F, Wang H, Hao L et al (2010) pq-hash: an efficient method for approximate XML joins. WAIM 2010 Workshops LNCS 6185:125–134
Google Scholar
Li F, Wang H, Zhang C et al (2010) Approximate joins for XML using \(g\)-string. XSym 2010, LNCS 6309, pp. 3–17
Mozes S (2008) Some lower and upper bounds for tree edit distance. Department of Computer Science, Brown University, Providence, RI
Google Scholar
Tai KC (1979) The tree-to-tree correction problem. J ACM 26(3):422–433
Article MathSciNet MATH Google Scholar
Tatikonda S, Parthasarathy S (2010) Hashing tree-structured data: methods and applications. In: IEEE 26th international conference on data engineering (ICDE), pp 429–440
Wang Y, Wang H, Wang Y et al (2012) Similarity join on XML based on \(k\)-generation set distance. WAIM 2011 Workshops, LNCS 7142, Springer, Heidelberg, pp 124–135
Yang Z, Yang G (2004) A near-optimal similarity join algorithm and performance evaluation. J Inf Sci 167(1–4):87–108
Article MATH Google Scholar
Yi S, Huang B, Chan WT (2005) XML application schema matching using similarity measure and relaxation labeling. J Inf Sci 169(1–2):27–46
Article MATH Google Scholar
Zhang K, Shasha D (1989) Simple fast algorithms for the editing distance between trees and related problems. SIAM J Comput 18(6):1245–1262
Article MathSciNet MATH Google Scholar
Shasha D, Zhang K (1995) Approximate tree pattern matching. Pattern matching in strings, trees and arrays. chapter 14, Oxford University Press
Chawathe SS (1999) Comparing hierarchical data in external memory. In: Proceedings of the twenty-fifth International conference on very large data bases (VLDB), pp 90–101
Nierman A, Jagadish HV (2002) Evaluating structural similarity in XML documents. In Proceedings of the 5th international workshop on the Web and Databases
Flesca S, Manco G, Masciari G et al (2002) Detecting structural similarities between XML documents. In: Proceedings of WebDB, pp 55–60

Download references

Acknowledgments

This paper was partially supported by NGFR 973 Grant 2012CB316200, NSFC Grant 61472099 and National Sci-Tech Support Plan 2015BAH10F00.

Author information

Authors and Affiliations

The School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
Yue Wang, Hongzhi Wang, Yang Wang, Jianzhong Li & Hong Gao
Donald Bren School of Information and Computer Sciences, University of California, Irvine, CA, USA
Liyan Zhang

Authors

Yue Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hongzhi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Liyan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jianzhong Li
View author publications
You can also search for this author in PubMed Google Scholar
Hong Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongzhi Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, Y., Wang, H., Zhang, L. et al. Extend tree edit distance for effective object identification. Knowl Inf Syst 46, 629–656 (2016). https://doi.org/10.1007/s10115-014-0816-1

Download citation

Received: 04 April 2014
Revised: 16 October 2014
Accepted: 25 December 2014
Published: 15 January 2015
Issue Date: March 2016
DOI: https://doi.org/10.1007/s10115-014-0816-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Extend tree edit distance for effective object identification

Abstract

Access this article

Similar content being viewed by others

SETJoin: a novel top-k similarity join algorithm

A Compact and Efficient Labeling Scheme for XML Documents

MSQL: efficient similarity search in metric spaces using SQL

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Extend tree edit distance for effective object identification

Abstract

Access this article

Similar content being viewed by others

SETJoin: a novel top-k similarity join algorithm

A Compact and Efficient Labeling Scheme for XML Documents

MSQL: efficient similarity search in metric spaces using SQL

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation