Skip to main content

XML Document Clustering by Independent Component Analysis

  • Conference paper
Knowledge Discovery from XML Documents (KDXD 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3915))

Included in the following conference series:

Abstract

When XML documents are clustered, the high dimensionality problem will occur. Independent Component Analysis (ICA) can reduce dimensionality and in the meanwhile find the underlying latent variables of XML structures to improve the quality of the clustering. This paper proposes a novel strategy to cluster XML documents based on ICA. According to D_path extracted from XML trees, the document was at first represented as Vector Space Model (VSM).Then ICA is applied to reduce the dimensionality of document vectors. Furthermore, document vectors are clustered on this reduced Euclidean Space spanned by the independent components. The experiments show that ICA can enhance the accuracy of the clustering with stable performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Faloutsos, C., Oard, D.: A survey of information retrieval and filtering methods. Department of Computer Science. University of Maryland, Technical Report, CS-TR-35l4 (August 1995)

    Google Scholar 

  2. Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: Proc. Fifth Int’l Workshop Web and Databases, June 1-16 (2002)

    Google Scholar 

  3. Costa, G., Manco, G., Ortale, R., Tagarelli, A.: A Tree-Based Approach to Clustering XML Documents by Structure. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 137–148. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  4. Beyer, K., Goldstein., J., Ramakrishnan., R., Shaft, U.: When is the Nearest Neighbour Meaningful? In: Proc.of the 7th International Conference on Database Theory, pp. 217–235 (1999)

    Google Scholar 

  5. Parsons, L., Hague, E., Liu, H.: Subspace clustering for high dimensional data: a review. ACM SIGKDD Explorations Newsletter, Special issue on learning from imbalanced datasets 6(1), 90–105 (2004)

    Article  Google Scholar 

  6. Liu, J., Wang, J.T., Hsu, W., Herbert, K.G.: XML Clustering by Principal Component Analysis. In: Proc. of ICTAI 2004, pp. 658–662 (2004)

    Google Scholar 

  7. Hyvärinen, A., Oja, E.: A fast fixed-point algorithm for independent component analysis. Neural Computation 9, 1483–1492 (1997)

    Article  Google Scholar 

  8. Bock, H.H.: Probabilistic aspects in clustering analysis. In: Conceptual and numerical analysis of data, pp. 12–44. Springer, Berlin (1989)

    Chapter  Google Scholar 

  9. Honkela, T., Hyvarinen, A.: Linguistic feature extraction using independent component analysis. In: Proc. of IJCNN 2004, Budapest, Hungary (2004)

    Google Scholar 

  10. Baeza-Yates, R., Ribeiro, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)

    Google Scholar 

  11. Bingham, E., Kabán, A., Girolami, M.: Topic identification in dynamical text by complexity pursuit. Neural Processing Letters 17(1), 69–83 (2003)

    Article  Google Scholar 

  12. Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)

    Article  Google Scholar 

  13. Hyvärinen, A., Oja, E.: Independent component analysis: algorithms and applications. Neural Networks 13(4-5), 411–430 (2000)

    Article  Google Scholar 

  14. Kolenda, T., Hansen, L.K., Sigurdsson, S.: Indepedent Components in Text. In: Advances in Independent Component Analysis, pp. 229–250. Springer, Heidelberg (2000)

    Google Scholar 

  15. Tang, B., Shepherd, M., Milios, E., Heywood, M.I.: Comparing and Combining Dimension Reduction Techniques for Efficient Text Clustering.Proc. of International Conference on Data Mining, April 23, Newport Beach, California (2005)

    Google Scholar 

  16. DBLP Computer Science Bibliography (2004), http://www.informatik.uni-trier.de/~ley/db/

  17. Selim, S.Z., Ismail, M.A.: K-means type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Trans. Pattern Anal. Mach. Intell. 6, 81–87 (1984)

    Article  MATH  Google Scholar 

  18. Abiteboul, S., Buneman, P., Suciu, D.: Data On The Web: From relations to Semistructured Data and XML. Morgan Kaufmann Publishers, San Francisco (2000)

    Google Scholar 

  19. Al-Sultan, K.S., Khan, M.M.: Computational experience on four algorithms forthe hard clustering problem. Pattern Recogn. Lett. 17(3), 295–308 (1996)

    Article  Google Scholar 

  20. Zhang, S., Wang, J.T.L., Herbert, K.G.: Xml query by example. International Journal of Computational Intelligence and Applications 2(3), 329–337 (2002)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wang, T., Liu, DX., Lin, XZ. (2006). XML Document Clustering by Independent Component Analysis. In: Nayak, R., Zaki, M.J. (eds) Knowledge Discovery from XML Documents. KDXD 2006. Lecture Notes in Computer Science, vol 3915. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11730262_4

Download citation

  • DOI: https://doi.org/10.1007/11730262_4

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-33180-3

  • Online ISBN: 978-3-540-33181-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics