Skip to main content

XML Data Integration Based on Content and Structure Similarity Using Keys

  • Conference paper
Book cover On the Move to Meaningful Internet Systems: OTM 2008 (OTM 2008)

Abstract

This paper proposes a technique for approximately matching XML data based on the content and structure by detecting the similarity of subtrees clustered semantically using leaf-node parents. The leaf-node parents are considered as a root of a subtree which is then recursively traversed bottom-up for matching. First, we take advantage of the “key” for matching subtrees which reduces the number of comparisons dramatically. Second, we measure the similarity degree based on data and structures of the two XML documents. The results show that our approach finds much more accurate matches with or without the presence of keys in XML subtrees. Other approaches experience problems with similarity matching thresholds as they either ignore semantic information available or have problems in handling complex XML data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bille, P.: Tree Edit Distance, Alignment Distance and Inclusion, ISBN 87-7949-032-8

    Google Scholar 

  2. Buneman, P., Davidson, S., Fan, W., Hara, C., Tan, W.: Keys for XML. Computer Networks 39(5), 473–487 (2002)

    Article  MATH  Google Scholar 

  3. Buttler, D.: A Short Survey of Document Structure Similarity Algorithms. In: International Conference on Internet Computing 2004, pp. 3–9 (2004)

    Google Scholar 

  4. Liang, W., Yokota, H.: A Path-sequence Based Discrimination for Subtree Matching in Approximate XML Joins. In: Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDEW 2006), pp. 23–28. IEEE, Los Alamitos (2006)

    Google Scholar 

  5. Liang, W., Yokota, H.: LAX: An Efficient Approximate XML Join Based on Clustered Leaf Nodes for XML Data Integration. In: Jackson, M., Nelson, D., Stirk, S. (eds.) BNCOD 2005. LNCS, vol. 3567, pp. 82–97. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  6. Liang, W., Yokota, H.: SLAX: An Improved Leaf-Clustering Based Approximate XML Join Algorithm for Integrating XML Data at Subtree Classes. In: Proceedings of DBWeb 2005, IPSJ Symposium Series (16), pp. 41–48 (2005)

    Google Scholar 

  7. Rafiei, D.: Fourier-Transform Based Techniques in Efficient Retrieval of Similar Time Sequences. Thesis of University of Toronto (1999)

    Google Scholar 

  8. Yoshikawa, M., Amagasa, T.: XRel: A Path-based Approach to Storage and Retrieval of XML Documents. In: Proceedings of the 19th IEEE International Conference of Data Engineering (ICDE), India, pp. 519–530 (2003)

    Google Scholar 

  9. ACM SIGMOD Record in XML, http://www.acm.org/sigmod/record/xml

  10. XML Version of DBLP, http://dblp.uni-trier.de/xml/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Viyanon, W., Madria, S.K., Bhowmick, S.S. (2008). XML Data Integration Based on Content and Structure Similarity Using Keys. In: Meersman, R., Tari, Z. (eds) On the Move to Meaningful Internet Systems: OTM 2008. OTM 2008. Lecture Notes in Computer Science, vol 5331. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88871-0_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-88871-0_35

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-88870-3

  • Online ISBN: 978-3-540-88871-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics