Skip to main content

Structural Similarity Evaluation of XML Documents Based on Basic Statistics

  • Conference paper
  • 2681 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7529))

Abstract

The similarity evaluation between XML documents is the basis of XML structural mining, and it is a crucial factor of the mining result. After introducing the popular XML tree edit method and frequent pattern method for XML data mining, in this paper, we use 10 basic statistics to describe the structural information of the XML documents, and then using the improved Euclidean distance to evaluate the similarity of XML documents. Moreover, in order to verify the performance of the proposed evaluation method, it is applied to XML documents clustering. The experimental results show that our method is superior to the methods based on edit tree or frequent pattern.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. W3C Recommendation, Document Object Model (DOM) Level 3 Core Specification (2004), http://www.w3.org/TR/DOM-Level-3-Core/

  2. Bille, P.: A survey on tree edit distance and related problem. Theoretical Computer Science 337, 217–239 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  3. Selkow, S.M.: The tree-to-tree edit problem. Information Processing Letter 6, 184–186 (1997)

    Article  MathSciNet  Google Scholar 

  4. Chawathe, S.S.: Comparing Hierarchical Data in External Memory. In: Proceedings of the 25th VLDB, pp. 90–101 (1999)

    Google Scholar 

  5. Dalamagas, T., Cheng, T., Winkel, K., Sellis, T.K.: A Methodology for Clustering XML Documents by Structure. Information Systems 31(3), 187–228 (2006)

    Article  Google Scholar 

  6. Nayak, R., Iryadi, W.: XML schema clustering with semantic and hierarchical similarity measures. Knowledge-Based Systems Archive 20(4), 336–349 (2007)

    Article  Google Scholar 

  7. Leung, H.-P., Chung, F.-L., et al.: XML Document clustering using Common XPath. In: Proc. of the Internation Workshop on Challenges in Web Information Retrieval and Integration, pp. 91–96 (2005)

    Google Scholar 

  8. Hwang, J.H., Gu, M.S.: Clustering XML Documents Based on the Weight of Frequent Structures. In: Proc. of the 2007 International Conference on Convergence Information Technology, pp. 845–849 (2007)

    Google Scholar 

  9. Zhang, H., Yuan, X., Yang, N., Liu, Z.: Similarity Computation for XML Documents by XML Element Sequence Patterns. In: Zhang, Y., Yu, G., Bertino, E., Xu, G. (eds.) APWeb 2008. LNCS, vol. 4976, pp. 227–232. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  10. SIGMOD Record Datasets (2007), http://www.sigmod.org/record/xml/

  11. INEX. INitiative for the Evaluation of XML Retrieval (2007), http://inex.is.informatik.uni-duisburg.de/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wang, CY., Wu, XJ., Li, J., Ge, Y. (2012). Structural Similarity Evaluation of XML Documents Based on Basic Statistics. In: Wang, F.L., Lei, J., Gong, Z., Luo, X. (eds) Web Information Systems and Mining. WISM 2012. Lecture Notes in Computer Science, vol 7529. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33469-6_86

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-33469-6_86

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-33468-9

  • Online ISBN: 978-3-642-33469-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics