Skip to main content

Structure and Content Similarity for Clustering XML Documents

  • Conference paper
Web-Age Information Management (WAIM 2010)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6185))

Included in the following conference series:

Abstract

XML has been extensively used in many information retrieval related applications. As an important data mining technique, clustering has been used to analyze XML data. The key issue of XML clustering is how to measure the similarity between XML documents. Traditionally, document clustering methods use the content information to measure the document similarity, the structural information contained in XML documents is ignored. In this paper, we propose a model called Structure and Content Vector Model(SCVM) to represent the structure and content information in XML documents. Based on the model, we define similarity measure that can be used to cluster XML documents. Our experimental results show that the proposed model and similarity measure are effective in identifying similar documents when the structure information contained in XML documents is meaningful. This method can be used to improve the precision and efficiency in XML information retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Xing, G., Guo, J., Xia, Z.: Classifying XML documents based on structure/Content similarity. In: Fuhr, N., Lalmas, M., Trotman, A. (eds.) INEX 2006. LNCS, vol. 4518, pp. 444–457. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  2. Dalamagas, T., Cheng, T., Winkel, K.J., Sellis, T.: Clustering XML documents using structural summaries. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 547–556. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  3. Zhang, K., Statman, R., Shasha, D.: On the editing distance between unordered labeled trees. Inf. Process. Lett. 42(3), 133–139 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  4. Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in web documents (2003)

    Google Scholar 

  5. Leung, H., Chung, F., Chan, S.C.F., Luk, R.: XML document clustering using common xpath. In: Proceedings. International Workshop on Challenges in Web Information Retrieval and Integration, WIRI 2005, pp. 91–96 (2005)

    Google Scholar 

  6. Kim, T.S., Lee, J.H., Song, J.W.: Semantic structural similarity for clustering XML documents. In: Lee, G., Ahn, T.N., Howard, D., Slezak, D. (eds.) International Conference on Convergence and Hybrid Information Technology, Daejeon, South Korea, pp. 552–557. IEEE Computer Soc., Los Alamitos (2008)

    Google Scholar 

  7. Yang, J., Cheung, W.K., Chen, X.: Integrating element and term semantics for similarity-based XML document clustering. In: Proceedings of The 2005 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 222–228 (2005)

    Google Scholar 

  8. Yang, J., Chen, X.: A semi-structured document model for text mining. Journal of Computer Science and Technology (05) (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhang, L., Li, Z., Chen, Q., Li, N. (2010). Structure and Content Similarity for Clustering XML Documents. In: Shen, H.T., et al. Web-Age Information Management. WAIM 2010. Lecture Notes in Computer Science, vol 6185. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16720-1_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-16720-1_12

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-16719-5

  • Online ISBN: 978-3-642-16720-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics