XML Data Integration Based on Content and Structure Similarity Using Keys

Viyanon, Waraporn; Madria, Sanjay K.; Bhowmick, Sourav S.

doi:10.1007/978-3-540-88871-0_35

Waraporn Viyanon³,
Sanjay K. Madria³ &
Sourav S. Bhowmick⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5331))

Included in the following conference series:

OTM Confederated International Conferences "On the Move to Meaningful Internet Systems"

1254 Accesses
13 Citations

Abstract

This paper proposes a technique for approximately matching XML data based on the content and structure by detecting the similarity of subtrees clustered semantically using leaf-node parents. The leaf-node parents are considered as a root of a subtree which is then recursively traversed bottom-up for matching. First, we take advantage of the “key” for matching subtrees which reduces the number of comparisons dramatically. Second, we measure the similarity degree based on data and structures of the two XML documents. The results show that our approach finds much more accurate matches with or without the presence of keys in XML subtrees. Other approaches experience problems with similarity matching thresholds as they either ignore semantic information available or have problems in handling complex XML data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bille, P.: Tree Edit Distance, Alignment Distance and Inclusion, ISBN 87-7949-032-8
Google Scholar
Buneman, P., Davidson, S., Fan, W., Hara, C., Tan, W.: Keys for XML. Computer Networks 39(5), 473–487 (2002)
Article MATH Google Scholar
Buttler, D.: A Short Survey of Document Structure Similarity Algorithms. In: International Conference on Internet Computing 2004, pp. 3–9 (2004)
Google Scholar
Liang, W., Yokota, H.: A Path-sequence Based Discrimination for Subtree Matching in Approximate XML Joins. In: Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDEW 2006), pp. 23–28. IEEE, Los Alamitos (2006)
Google Scholar
Liang, W., Yokota, H.: LAX: An Efficient Approximate XML Join Based on Clustered Leaf Nodes for XML Data Integration. In: Jackson, M., Nelson, D., Stirk, S. (eds.) BNCOD 2005. LNCS, vol. 3567, pp. 82–97. Springer, Heidelberg (2005)
Chapter Google Scholar
Liang, W., Yokota, H.: SLAX: An Improved Leaf-Clustering Based Approximate XML Join Algorithm for Integrating XML Data at Subtree Classes. In: Proceedings of DBWeb 2005, IPSJ Symposium Series (16), pp. 41–48 (2005)
Google Scholar
Rafiei, D.: Fourier-Transform Based Techniques in Efficient Retrieval of Similar Time Sequences. Thesis of University of Toronto (1999)
Google Scholar
Yoshikawa, M., Amagasa, T.: XRel: A Path-based Approach to Storage and Retrieval of XML Documents. In: Proceedings of the 19th IEEE International Conference of Data Engineering (ICDE), India, pp. 519–530 (2003)
Google Scholar
ACM SIGMOD Record in XML, http://www.acm.org/sigmod/record/xml
XML Version of DBLP, http://dblp.uni-trier.de/xml/

Download references

Author information

Authors and Affiliations

Department of Computer Science, Missouri University of Science and Technology, Rolla, Missouri, USA
Waraporn Viyanon & Sanjay K. Madria
School of Computer Engineering, Nanyang Technological University, Singapore
Sourav S. Bhowmick

Authors

Waraporn Viyanon
View author publications
You can also search for this author in PubMed Google Scholar
Sanjay K. Madria
View author publications
You can also search for this author in PubMed Google Scholar
Sourav S. Bhowmick
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

STARLab, Bldg G/10, Vrije Universiteit Brussel (VUB), Pleinlaan 2, 1050, Brussels, Belgium
Robert Meersman
School of Computer Science and Information Technology, Bld 10.10, RMIT University, 376-392 Swanston Street, VIC 3001, Melbourne, Australia
Zahir Tari

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Viyanon, W., Madria, S.K., Bhowmick, S.S. (2008). XML Data Integration Based on Content and Structure Similarity Using Keys. In: Meersman, R., Tari, Z. (eds) On the Move to Meaningful Internet Systems: OTM 2008. OTM 2008. Lecture Notes in Computer Science, vol 5331. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88871-0_35

Download citation

DOI: https://doi.org/10.1007/978-3-540-88871-0_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88870-3
Online ISBN: 978-3-540-88871-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics