Clustering and Retrieval of XML Documents by Structure

Hwang, Jeong Hee; Ryu, Keun Ho

doi:10.1007/11424826_100

Clustering and Retrieval of XML Documents by Structure

Jeong Hee Hwang²⁴ &
Keun Ho Ryu²⁴

Conference paper

1613 Accesses
5 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3481))

Abstract

We not only propose a method for XML document clustering using common structures but also show the application of our technique to XML retrieval. Our approach first extracts the frequent structures from XML documents by the decomposed method of tree. And then, we perform a new XML document clustering algorithm using common structures, which does not use measure of pairwise similarity between XML documents. The high speed and cluster cohesion of our clustering algorithm are shown in our experiment results.

This work was supported by Ubiquitous Bio-Information Technology Research Institute in Korea.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Kotasek, P., Zendulka, J.: An XML Framework Proposal for Knowledge Discovery in Database. In: The 4th European Conference on Principles and Practice Knowledge Discovery in Databases (2000)
Google Scholar
Widom, J.: Data Management for XML: Research Directions. IEEE Computer Society Technical Committee on Data Engineering (1999)
Google Scholar
Nayak, R., Witt, R., Tonev, A.: Data Mining and XML Documents. In: International Conference on Internet Computing (2002)
Google Scholar
Francesca, F.D., Gordano, G., Manco, G., Ortale, R., Tagarelli, A.: A General Framework for XML Document Clustering. Technical report, n(8), ICAR-CNR (2003)
Google Scholar
Wang, K., Liu, H.: Discovery Typical Structures of Documents: A Road Map Approach. In: ACM SIGIR (1998)
Google Scholar
Asai, T., Abe, K., Kawasoe, S., Arimura, H., Sakamoto, H.: Efficient Substructure Discovery from Large Semi-structured Data. In: The proceedings of the Second SIAM international conference on Data Mining (2002)
Google Scholar
Termier, A., Rouster, M.C., Sebag, M.: TreeFinder: A First Step towards XML Data Mining. In: IEEE international conference on Data Mining, ICDM (2002)
Google Scholar
Lee, M.L., Yang, L.H., Hsu, W., Yang, X.: XClust: Clustering XML Schemas for Effective Integration. In: Proc. 11th ACM international conference on Information and Knowledge Management (2002)
Google Scholar
Shen, Y., Wang, B.: Clustering Schemaless XML Document. In: The proceedings of the 11th international conference on Cooperative Information System (2003)
Google Scholar
Yoon, J., Raghavan, V., Chakilam, V.: BitCube: Clustering and Statistical Analysis for XML Documents. In: The proceedings of the 13th international conference on Scientific and Statistical Database Management (2001)
Google Scholar
Doucet, A., Myka, H.A.: Naïve Clustering of a Large XML Document Collection. In: The Proceedings of the 1st INEX, Germany (2002)
Google Scholar
Lee, J.W., Lee, K., Kim, W.: Preparation for Semantics-Based XML Mining. In: IEEE International Conference on Data Mining(ICDM) (2001)
Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: a review. ACM Computing Surveys 31 (1999)
Google Scholar
Yang, Y., Guan, X., You, J.: CLOPE: A Fast and Effective Clustering Algorithm for Transaction Data. In: The proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2002)
Google Scholar
Wang, K., Xu, C.: Clustering Transactions Using Large Items. In: Proceedings of ACM CIKM 1999 (1999)
Google Scholar
Mignet, L., Barbosa, D., Veltri, P.: The XML web: a first study. In: Proceedings of the twelfth international conference on World Wide Web (2003)
Google Scholar
http://sourceforge.net/projects/javawn
Pei, J., Han, J., Asi, B.M., Pinto, H.: PrefixSpan: Mining Sequential Pattern Efficiently by Prefix-Projected Pattern Growth. In: Proceedings of the International Conference on Data Engineering(ICDE) (2001)
Google Scholar
NIAGARA query engine, http://www.cs.wisc.edu/niagara/data.html

Download references

Author information

Authors and Affiliations

Database Laboratory, Chungbuk National University, Korea
Jeong Hee Hwang & Keun Ho Ryu

Authors

Jeong Hee Hwang
View author publications
You can also search for this author in PubMed Google Scholar
Keun Ho Ryu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Mathematics and Computer Science, University of Perugia, via Vanvitelli, 1, I-06123, Perugia, Italy
Osvaldo Gervasi
Department of Computer Science, University of Calgary, 2500 University Drive N.W., T2N 1N4, Calgary, AB, Canada
Marina L. Gavrilova
William Norris Professor, Head of the Computer Science and Engineering Department, University of Minnesota, USA
Vipin Kumar
Department of Chemistry, University of Perugia, Via Elce di Sotto, 8, P.O. Box, I-06123, Perugia, Italy
Antonio Laganà
Institute of High Performance Computing, IHCP, 1 Science Park Road, 01-01 The Capricorn, Singapore Science Park II, 117528, Singapore
Heow Pueh Lee
School of Computing, Soongsil University, Seoul, Korea
Youngsong Mun
Clayton School of IT, Monash University, 3800, Clayton, Australia
David Taniar
OptimaNumerics Ltd, P.O. Box, Belfast, United Kingdom
Chih Jeng Kenneth Tan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hwang, J.H., Ryu, K.H. (2005). Clustering and Retrieval of XML Documents by Structure. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2005. ICCSA 2005. Lecture Notes in Computer Science, vol 3481. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11424826_100

Download citation

DOI: https://doi.org/10.1007/11424826_100
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25861-2
Online ISBN: 978-3-540-32044-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics