Clustering Schemaless XML Documents

Shen, Yun; Wang, Bing

doi:10.1007/978-3-540-39964-3_49

Yun Shen⁷ &
Bing Wang⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2888))

Included in the following conference series:

OTM Confederated International Conferences "On the Move to Meaningful Internet Systems"

2941 Accesses
7 Citations

Abstract

This paper addresses the issue of semantically clustering the increasing number of the schemaless XML documents. In our approach, each document in a document collection is firstly represented by a macro-path sequence. Secondly, the similarity matrix for a document collection is constructed by computing the similarity value among these macro-path sequences. Finally, the desired clusters are constructed by utilizing the hierarchical clustering technique. Experimental results are also shown in this paper.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Mignet, L., Barbosa, D., Veltri, P.: The XML web: a first study. In: The Proceedings of the twelfth international conference on World Wide Web, pp. 500–510 (2003)
Google Scholar
W3C:Extensible Markup Language (1999), http://www.w3.org/XML/
W3C: XML Schema (2001), http://www.w3.org/XML/Schema
Anderberg, M.R.: Clustering analysis for Applications. Academic Press, New York (1973)
Google Scholar
Baeza-Yates, R.: Modern Information Retrieval. ACM Press, New York (1999)
Google Scholar
Xyleme, L.: A dynamic warehouse for XML Data of the Web. IEEE Data Engineering Bullet 24(2), 40–47 (1991)
Google Scholar
Doucet, A., Ahonen-Myka, H.: Naive clustering of a large XML document collection. In: The Proceedings of the First Annual Workshop of the Initiative for the Evaluation of XML retrieval, INEX (2002)
Google Scholar
Yoon, J.P., Raghavan, V., Chakilam, V., Kerschberg, L.: BitCube: A Three-Dimensional Bitmap Indexing for XML Documents. Journal of Intelligent Information Systems 17(2-3), 241–254 (2001)
Article MATH Google Scholar
Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Detecting Structural Similarities between XML Documents. In: Fifth International Workshop on the Web and Databases, WebDB 2002 (2002)
Google Scholar
Lee, M.L., Yang, L.H., Hsu, W., Yang, X.: XClust: clustering XML schemas for effective integration. In: The Proceedings of the eleventh international conference on Information and knowledge management, pp. 292–299 (2002)
Google Scholar
Shen, Y., Wang, B.: Path Join For Retrieving Data From XML Documents. Technical Report 02–03 (2003)
Google Scholar
Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)
MATH Google Scholar
Abiteboul, S., Buneman, P., Suciu, D.: Data On The Web: From relations to Semistructured Data and XML. Morgan Kaufmann Publishers, San Francisco (2000)
Google Scholar
The XML C parser and toolkit for Gnome, http://xmlsoft.org/
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys 31 (1999)
Google Scholar
The Business Process Management Initiative, BPMI (2002), http://www.bpmi.org/

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Hull, Hull, HU6 7RX, UK
Yun Shen & Bing Wang

Authors

Yun Shen
View author publications
You can also search for this author in PubMed Google Scholar
Bing Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

STARLab, Vrije Universiteit Brussel (VUB), Bldg G/10, Pleinlaan 2, 1050, Brussels, Belgium
Robert Meersman
School of Computer Science and Information Technology, RMIT University, Bld 10.10, 376-392 Swanston Street, VIC 3001, Melbourne, Australia
Zahir Tari
Department of Electrical Engineering and Computer Science, Vanderbilt University, TN 37203, Nashville, USA
Douglas C. Schmidt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shen, Y., Wang, B. (2003). Clustering Schemaless XML Documents. In: Meersman, R., Tari, Z., Schmidt, D.C. (eds) On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE. OTM 2003. Lecture Notes in Computer Science, vol 2888. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39964-3_49

Download citation

DOI: https://doi.org/10.1007/978-3-540-39964-3_49
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20498-5
Online ISBN: 978-3-540-39964-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics