Structure and Content Similarity for Clustering XML Documents

Zhang, Lijun; Li, Zhanhuai; Chen, Qun; Li, Ning

doi:10.1007/978-3-642-16720-1_12

Lijun Zhang²⁵,
Zhanhuai Li²⁵,
Qun Chen²⁵ &
…
Ning Li²⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6185))

Included in the following conference series:

International Conference on Web-Age Information Management

1429 Accesses
5 Citations

Abstract

XML has been extensively used in many information retrieval related applications. As an important data mining technique, clustering has been used to analyze XML data. The key issue of XML clustering is how to measure the similarity between XML documents. Traditionally, document clustering methods use the content information to measure the document similarity, the structural information contained in XML documents is ignored. In this paper, we propose a model called Structure and Content Vector Model(SCVM) to represent the structure and content information in XML documents. Based on the model, we define similarity measure that can be used to cluster XML documents. Our experimental results show that the proposed model and similarity measure are effective in identifying similar documents when the structure information contained in XML documents is meaningful. This method can be used to improve the precision and efficiency in XML information retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Xing, G., Guo, J., Xia, Z.: Classifying XML documents based on structure/Content similarity. In: Fuhr, N., Lalmas, M., Trotman, A. (eds.) INEX 2006. LNCS, vol. 4518, pp. 444–457. Springer, Heidelberg (2007)
Chapter Google Scholar
Dalamagas, T., Cheng, T., Winkel, K.J., Sellis, T.: Clustering XML documents using structural summaries. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 547–556. Springer, Heidelberg (2004)
Chapter Google Scholar
Zhang, K., Statman, R., Shasha, D.: On the editing distance between unordered labeled trees. Inf. Process. Lett. 42(3), 133–139 (1992)
Article MathSciNet MATH Google Scholar
Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in web documents (2003)
Google Scholar
Leung, H., Chung, F., Chan, S.C.F., Luk, R.: XML document clustering using common xpath. In: Proceedings. International Workshop on Challenges in Web Information Retrieval and Integration, WIRI 2005, pp. 91–96 (2005)
Google Scholar
Kim, T.S., Lee, J.H., Song, J.W.: Semantic structural similarity for clustering XML documents. In: Lee, G., Ahn, T.N., Howard, D., Slezak, D. (eds.) International Conference on Convergence and Hybrid Information Technology, Daejeon, South Korea, pp. 552–557. IEEE Computer Soc., Los Alamitos (2008)
Google Scholar
Yang, J., Cheung, W.K., Chen, X.: Integrating element and term semantics for similarity-based XML document clustering. In: Proceedings of The 2005 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 222–228 (2005)
Google Scholar
Yang, J., Chen, X.: A semi-structured document model for text mining. Journal of Computer Science and Technology (05) (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Northwestern Polytechnical University, Xi’an, 710072, China
Lijun Zhang, Zhanhuai Li, Qun Chen & Ning Li

Authors

Lijun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhanhuai Li
View author publications
You can also search for this author in PubMed Google Scholar
Qun Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ning Li
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Information Technology and Electrical Engineering, The University of Queensland, Brisbane, QLD, Australia
Heng Tao Shen
School of Computing Science, Simon Fraser University, 8888 University Drive, V5A 1S6, Burnaby, BC, Canada
Jian Pei
David R. Cheriton School of Computer Science, University of Waterloo, Canada
M. Tamer Özsu
Peking University, China
Lei Zou
Renmin University of China, China
Jiaheng Lu
National University of Singapore, Singapore
Tok-Wang Ling
Northeastern University, 110004, Shenyang, China
Ge Yu
College of Computer Science, Zhejiang University, 310027, Hangzhou, P.R. China
Yi Zhuang
University of Melbourne, Australia
Jie Shao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, L., Li, Z., Chen, Q., Li, N. (2010). Structure and Content Similarity for Clustering XML Documents. In: Shen, H.T., et al. Web-Age Information Management. WAIM 2010. Lecture Notes in Computer Science, vol 6185. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16720-1_12

Download citation

DOI: https://doi.org/10.1007/978-3-642-16720-1_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16719-5
Online ISBN: 978-3-642-16720-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics