Abstract
XML becomes increasingly important in data exchange and information management. Starting point for retrieving the information and integrating the documents efficiently is clustering the documents that have similar structure. Thus, in this paper, we propose a new XML document clustering method based on similar structure. Our approach first extracts the representative structures of XML documents by sequential pattern mining. And then we cluster XML documents of similar structure using the clustering algorithm for transactional data, assuming that an XML document as a transaction and the frequent structure of documents as the items of the transaction. We also apply our technique to XML retrieval. Our experiments show the efficiency and good performance of the proposed clustering method.
This work was supported by University IT Research Center Project and ETRI in Korea.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Kotasek, P., Zendulka, J.: An XML Framework Proposal for Knowledge Discovery in Database. In: 4th European Conference on Principles and Practice Knowledge Discovery in Databases (2000)
Wang, K., Liu, H.: Discovery Typical Structures of Documents: A Road Map Approach. In: Prof. of the ACM SIGIR (1998)
Widom, J.: Data Management for XML: Research Directions. In: IEEE Computer Society Technical Commitee on Data Engineering (1999)
Nayak, R., Witt, R., Tonev, A.: Data Mining and XML Documents. In: International Conference on Internet Computing (2002)
Shasha, D., Wang, J.T.L., Shan, H., Zhang, K.: TreeGrep: Approximate Searching in Unordered Trees. In: Proc. of the 14th International Conference on Scientific and Statistical Database Management (2002)
Cole, R., Hariharan, R., Indyk, P.: Tree Pattern Matching and Subset Matching in Deterministic O(nlog 3 m) Time. In: Prof. of the 10th Annual ACM-SIAM symposium on discrete algorithms (1999)
Wang, J.T., Shasha, D., Chang, G.J.S.: Structural Matching and Discovery in Document Databases. In: International Conference ACM SIGMOD on Management of Data (1997)
Pei, J., Han, J., Asi, B.M., Pinto, H.: PrefixSpan: Mining Sequential Pattern Efficiently by Prefix-Projected Pattern Growth. In: International Conference Data Engineering(ICDE) (2001)
Yang, Y., Guan, X., You, J.: CLOPE: A Fast and Effective Clustering Algorithm for Transaction Data. In: Proc. of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2002)
Wang, K., Xu, C.: Clustering Transactions Using Large Items. In: Proc. of ACM CIKM 1999 (1999)
Lee, J.W., Lee, K., Kim, W.: Preparation for Semantics-Based XML Mining. In: IEEE International Conference on Data Mining(ICDM) (2001)
Doucet, A., Myka, H.A.: Naive Clustering of a Large XML Document Collection. In: Proc. of the 1st INEX, Germany, (2002)
Asai, T., Abe, K., Kawasoe, S., Arimura, H., Sakamoto, H.: Efficient Substructure Discovery from Large Semi-structured Data. In: Proc. of the Second SIAM International Conference on Data Mining (2002)
Lee, M.L., Yang, L.H., Hsu, W., Yang, X.: XClust: Clustering XML Schemas for Effective Integration. In: Proc. 11th ACM International Conference on Information and Knowledge Management (2002)
Zaki, M.: Efficiently Mining Frequent Tree in a Forest. In: 6th ACM SIGKDD International Conference (2002)
Termier, A., Rouster, M.C., Sebag, M.: TreeFinder: A First Step towards XML Data Mining. In: IEEE International Conference on Data Mining, ICDM (2002)
Yoon, J., Raghavan, V., Chakilam, V.: BitCube: Clustering and Statistical Analysis for XML Documents. In: Proc. of the 13th International Conference on Scientific and Statistical Database Management (2001)
NIAGARA query engine, http://www.cs.wisc.edu/niagara/data.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hwang, J.H., Ryu, K.H. (2004). A New XML Clustering for Structural Retrieval. In: Atzeni, P., Chu, W., Lu, H., Zhou, S., Ling, TW. (eds) Conceptual Modeling – ER 2004. ER 2004. Lecture Notes in Computer Science, vol 3288. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30464-7_30
Download citation
DOI: https://doi.org/10.1007/978-3-540-30464-7_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23723-5
Online ISBN: 978-3-540-30464-7
eBook Packages: Springer Book Archive