Abstract
When XML documents are clustered, the high dimensionality problem will occur. Independent Component Analysis (ICA) can reduce dimensionality and in the meanwhile find the underlying latent variables of XML structures to improve the quality of the clustering. This paper proposes a novel strategy to cluster XML documents based on ICA. According to D_path extracted from XML trees, the document was at first represented as Vector Space Model (VSM).Then ICA is applied to reduce the dimensionality of document vectors. Furthermore, document vectors are clustered on this reduced Euclidean Space spanned by the independent components. The experiments show that ICA can enhance the accuracy of the clustering with stable performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Faloutsos, C., Oard, D.: A survey of information retrieval and filtering methods. Department of Computer Science. University of Maryland, Technical Report, CS-TR-35l4 (August 1995)
Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: Proc. Fifth Int’l Workshop Web and Databases, June 1-16 (2002)
Costa, G., Manco, G., Ortale, R., Tagarelli, A.: A Tree-Based Approach to Clustering XML Documents by Structure. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 137–148. Springer, Heidelberg (2004)
Beyer, K., Goldstein., J., Ramakrishnan., R., Shaft, U.: When is the Nearest Neighbour Meaningful? In: Proc.of the 7th International Conference on Database Theory, pp. 217–235 (1999)
Parsons, L., Hague, E., Liu, H.: Subspace clustering for high dimensional data: a review. ACM SIGKDD Explorations Newsletter, Special issue on learning from imbalanced datasets 6(1), 90–105 (2004)
Liu, J., Wang, J.T., Hsu, W., Herbert, K.G.: XML Clustering by Principal Component Analysis. In: Proc. of ICTAI 2004, pp. 658–662 (2004)
Hyvärinen, A., Oja, E.: A fast fixed-point algorithm for independent component analysis. Neural Computation 9, 1483–1492 (1997)
Bock, H.H.: Probabilistic aspects in clustering analysis. In: Conceptual and numerical analysis of data, pp. 12–44. Springer, Berlin (1989)
Honkela, T., Hyvarinen, A.: Linguistic feature extraction using independent component analysis. In: Proc. of IJCNN 2004, Budapest, Hungary (2004)
Baeza-Yates, R., Ribeiro, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Bingham, E., Kabán, A., Girolami, M.: Topic identification in dynamical text by complexity pursuit. Neural Processing Letters 17(1), 69–83 (2003)
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)
Hyvärinen, A., Oja, E.: Independent component analysis: algorithms and applications. Neural Networks 13(4-5), 411–430 (2000)
Kolenda, T., Hansen, L.K., Sigurdsson, S.: Indepedent Components in Text. In: Advances in Independent Component Analysis, pp. 229–250. Springer, Heidelberg (2000)
Tang, B., Shepherd, M., Milios, E., Heywood, M.I.: Comparing and Combining Dimension Reduction Techniques for Efficient Text Clustering.Proc. of International Conference on Data Mining, April 23, Newport Beach, California (2005)
DBLP Computer Science Bibliography (2004), http://www.informatik.uni-trier.de/~ley/db/
Selim, S.Z., Ismail, M.A.: K-means type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Trans. Pattern Anal. Mach. Intell. 6, 81–87 (1984)
Abiteboul, S., Buneman, P., Suciu, D.: Data On The Web: From relations to Semistructured Data and XML. Morgan Kaufmann Publishers, San Francisco (2000)
Al-Sultan, K.S., Khan, M.M.: Computational experience on four algorithms forthe hard clustering problem. Pattern Recogn. Lett. 17(3), 295–308 (1996)
Zhang, S., Wang, J.T.L., Herbert, K.G.: Xml query by example. International Journal of Computational Intelligence and Applications 2(3), 329–337 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wang, T., Liu, DX., Lin, XZ. (2006). XML Document Clustering by Independent Component Analysis. In: Nayak, R., Zaki, M.J. (eds) Knowledge Discovery from XML Documents. KDXD 2006. Lecture Notes in Computer Science, vol 3915. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11730262_4
Download citation
DOI: https://doi.org/10.1007/11730262_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33180-3
Online ISBN: 978-3-540-33181-0
eBook Packages: Computer ScienceComputer Science (R0)