XML Document Clustering by Independent Component Analysis

Wang, Tong; Liu, Da-Xin; Lin, Xuan-Zuo

doi:10.1007/11730262_4

Tong Wang¹⁸,
Da-Xin Liu¹⁸ &
Xuan-Zuo Lin¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3915))

Included in the following conference series:

International Workshop on Knowledge Discovery from XML Documents

295 Accesses
3 Citations

Abstract

When XML documents are clustered, the high dimensionality problem will occur. Independent Component Analysis (ICA) can reduce dimensionality and in the meanwhile find the underlying latent variables of XML structures to improve the quality of the clustering. This paper proposes a novel strategy to cluster XML documents based on ICA. According to D_path extracted from XML trees, the document was at first represented as Vector Space Model (VSM).Then ICA is applied to reduce the dimensionality of document vectors. Furthermore, document vectors are clustered on this reduced Euclidean Space spanned by the independent components. The experiments show that ICA can enhance the accuracy of the clustering with stable performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Faloutsos, C., Oard, D.: A survey of information retrieval and filtering methods. Department of Computer Science. University of Maryland, Technical Report, CS-TR-35l4 (August 1995)
Google Scholar
Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: Proc. Fifth Int’l Workshop Web and Databases, June 1-16 (2002)
Google Scholar
Costa, G., Manco, G., Ortale, R., Tagarelli, A.: A Tree-Based Approach to Clustering XML Documents by Structure. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 137–148. Springer, Heidelberg (2004)
Chapter Google Scholar
Beyer, K., Goldstein., J., Ramakrishnan., R., Shaft, U.: When is the Nearest Neighbour Meaningful? In: Proc.of the 7th International Conference on Database Theory, pp. 217–235 (1999)
Google Scholar
Parsons, L., Hague, E., Liu, H.: Subspace clustering for high dimensional data: a review. ACM SIGKDD Explorations Newsletter, Special issue on learning from imbalanced datasets 6(1), 90–105 (2004)
Article Google Scholar
Liu, J., Wang, J.T., Hsu, W., Herbert, K.G.: XML Clustering by Principal Component Analysis. In: Proc. of ICTAI 2004, pp. 658–662 (2004)
Google Scholar
Hyvärinen, A., Oja, E.: A fast fixed-point algorithm for independent component analysis. Neural Computation 9, 1483–1492 (1997)
Article Google Scholar
Bock, H.H.: Probabilistic aspects in clustering analysis. In: Conceptual and numerical analysis of data, pp. 12–44. Springer, Berlin (1989)
Chapter Google Scholar
Honkela, T., Hyvarinen, A.: Linguistic feature extraction using independent component analysis. In: Proc. of IJCNN 2004, Budapest, Hungary (2004)
Google Scholar
Baeza-Yates, R., Ribeiro, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Google Scholar
Bingham, E., Kabán, A., Girolami, M.: Topic identification in dynamical text by complexity pursuit. Neural Processing Letters 17(1), 69–83 (2003)
Article Google Scholar
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)
Article Google Scholar
Hyvärinen, A., Oja, E.: Independent component analysis: algorithms and applications. Neural Networks 13(4-5), 411–430 (2000)
Article Google Scholar
Kolenda, T., Hansen, L.K., Sigurdsson, S.: Indepedent Components in Text. In: Advances in Independent Component Analysis, pp. 229–250. Springer, Heidelberg (2000)
Google Scholar
Tang, B., Shepherd, M., Milios, E., Heywood, M.I.: Comparing and Combining Dimension Reduction Techniques for Efficient Text Clustering.Proc. of International Conference on Data Mining, April 23, Newport Beach, California (2005)
Google Scholar
DBLP Computer Science Bibliography (2004), http://www.informatik.uni-trier.de/~ley/db/
Selim, S.Z., Ismail, M.A.: K-means type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Trans. Pattern Anal. Mach. Intell. 6, 81–87 (1984)
Article MATH Google Scholar
Abiteboul, S., Buneman, P., Suciu, D.: Data On The Web: From relations to Semistructured Data and XML. Morgan Kaufmann Publishers, San Francisco (2000)
Google Scholar
Al-Sultan, K.S., Khan, M.M.: Computational experience on four algorithms forthe hard clustering problem. Pattern Recogn. Lett. 17(3), 295–308 (1996)
Article Google Scholar
Zhang, S., Wang, J.T.L., Herbert, K.G.: Xml query by example. International Journal of Computational Intelligence and Applications 2(3), 329–337 (2002)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, Harbin Engineering University, China
Tong Wang & Da-Xin Liu
Northeast Agriculture University, Harbin, China
Xuan-Zuo Lin

Authors

Tong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Da-Xin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xuan-Zuo Lin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Information Technology, Queensland University of Technology, Brisbane, Australia
Richi Nayak
Computer Science Department, Rensselaer Polytechnic Institute, USA
Mohammed J. Zaki

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, T., Liu, DX., Lin, XZ. (2006). XML Document Clustering by Independent Component Analysis. In: Nayak, R., Zaki, M.J. (eds) Knowledge Discovery from XML Documents. KDXD 2006. Lecture Notes in Computer Science, vol 3915. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11730262_4

Download citation

DOI: https://doi.org/10.1007/11730262_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33180-3
Online ISBN: 978-3-540-33181-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics