Abstract
This paper addresses the issue of semantically clustering the increasing number of the schemaless XML documents. In our approach, each document in a document collection is firstly represented by a macro-path sequence. Secondly, the similarity matrix for a document collection is constructed by computing the similarity value among these macro-path sequences. Finally, the desired clusters are constructed by utilizing the hierarchical clustering technique. Experimental results are also shown in this paper.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Mignet, L., Barbosa, D., Veltri, P.: The XML web: a first study. In: The Proceedings of the twelfth international conference on World Wide Web, pp. 500–510 (2003)
W3C:Extensible Markup Language (1999), http://www.w3.org/XML/
W3C: XML Schema (2001), http://www.w3.org/XML/Schema
Anderberg, M.R.: Clustering analysis for Applications. Academic Press, New York (1973)
Baeza-Yates, R.: Modern Information Retrieval. ACM Press, New York (1999)
Xyleme, L.: A dynamic warehouse for XML Data of the Web. IEEE Data Engineering Bullet 24(2), 40–47 (1991)
Doucet, A., Ahonen-Myka, H.: Naive clustering of a large XML document collection. In: The Proceedings of the First Annual Workshop of the Initiative for the Evaluation of XML retrieval, INEX (2002)
Yoon, J.P., Raghavan, V., Chakilam, V., Kerschberg, L.: BitCube: A Three-Dimensional Bitmap Indexing for XML Documents. Journal of Intelligent Information Systems 17(2-3), 241–254 (2001)
Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Detecting Structural Similarities between XML Documents. In: Fifth International Workshop on the Web and Databases, WebDB 2002 (2002)
Lee, M.L., Yang, L.H., Hsu, W., Yang, X.: XClust: clustering XML schemas for effective integration. In: The Proceedings of the eleventh international conference on Information and knowledge management, pp. 292–299 (2002)
Shen, Y., Wang, B.: Path Join For Retrieving Data From XML Documents. Technical Report 02–03 (2003)
Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)
Abiteboul, S., Buneman, P., Suciu, D.: Data On The Web: From relations to Semistructured Data and XML. Morgan Kaufmann Publishers, San Francisco (2000)
The XML C parser and toolkit for Gnome, http://xmlsoft.org/
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys 31 (1999)
The Business Process Management Initiative, BPMI (2002), http://www.bpmi.org/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Shen, Y., Wang, B. (2003). Clustering Schemaless XML Documents. In: Meersman, R., Tari, Z., Schmidt, D.C. (eds) On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE. OTM 2003. Lecture Notes in Computer Science, vol 2888. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39964-3_49
Download citation
DOI: https://doi.org/10.1007/978-3-540-39964-3_49
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20498-5
Online ISBN: 978-3-540-39964-3
eBook Packages: Springer Book Archive