Skip to main content

Comparing Dimension Reduction Techniques for Document Clustering

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3501))

Abstract

In this research, a systematic study is conducted of four dimension reduction techniques for the text clustering problem, using five benchmark data sets. Of the four methods – Independent Component Analysis (ICA), Latent Semantic Indexing (LSI), Document Frequency (DF) and Random Projection (RP) – ICA and LSI are clearly superior when the k-means clustering algorithm is applied, irrespective of the data sets. Random projection consistently returns the worst results, where this appears to be due to the noise distribution characterizing the document clustering task.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Fodor, I.K.: A survey of Dimension Reduction Techniques. LLNL technical report, UCRL ID-148494 (2002), http://www.llnl.gov/CASC/sapphire/pubs.html

  2. Parsons, L., et al.: Subspace Clustering for High Dimensional Data: a Review. ACM SIGKDD Explorations Newsletter 6(1), 90–105 (2004)

    Article  Google Scholar 

  3. Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. Proc. ICML, 412-420 (1997)

    Google Scholar 

  4. Berry, M.W., Dumais, S.T., O’Brien, G.W.: Using linear algebra for intelligent information retrieval. SIAM Review 37(4), 573–595 (1995)

    Article  MATH  MathSciNet  Google Scholar 

  5. Bingham, E., Mannila, H.: Random Projection in Dimensionality Reduction: Applications to Image and Text Data. In: Proc. SIGKDD, pp. 245–250 (2001)

    Google Scholar 

  6. Hyvärinen, A., Oja, E.: Independent Component Analysis: Algorithms and Applications. Neural Networks (4-5), pp. 411-430 (2000), FastICA package: http://www.cis.hut.fi/~aapo/

  7. Tang, B., Luo, X., Heywood, M.I., Shepherd, M.: A Comparative Study of Dimension Reduction Techniques for Document Clustering. TR # CS-2004-14, Faculty of Computer Science, Dalhousie University (2004), http://www.cs.dal.ca/research/techreports/2004/CS-2004-14.shtml

  8. Buckley, C., Salton, G., Allan, J.,, S.: Automatic Query Expansion using SMART: TREC-3. In: Proc. TREC-3, pp. 500–225 (1995)

    Google Scholar 

  9. Zhao, Y., Karypis, G.: Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering. Machine Learning 55(3), 311–331 (2004)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Tang, B., Shepherd, M., Heywood, M.I., Luo, X. (2005). Comparing Dimension Reduction Techniques for Document Clustering. In: Kégl, B., Lapalme, G. (eds) Advances in Artificial Intelligence. Canadian AI 2005. Lecture Notes in Computer Science(), vol 3501. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11424918_30

Download citation

  • DOI: https://doi.org/10.1007/11424918_30

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25864-3

  • Online ISBN: 978-3-540-31952-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics