Abstract
In this research, a systematic study is conducted of four dimension reduction techniques for the text clustering problem, using five benchmark data sets. Of the four methods – Independent Component Analysis (ICA), Latent Semantic Indexing (LSI), Document Frequency (DF) and Random Projection (RP) – ICA and LSI are clearly superior when the k-means clustering algorithm is applied, irrespective of the data sets. Random projection consistently returns the worst results, where this appears to be due to the noise distribution characterizing the document clustering task.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Fodor, I.K.: A survey of Dimension Reduction Techniques. LLNL technical report, UCRL ID-148494 (2002), http://www.llnl.gov/CASC/sapphire/pubs.html
Parsons, L., et al.: Subspace Clustering for High Dimensional Data: a Review. ACM SIGKDD Explorations Newsletter 6(1), 90–105 (2004)
Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. Proc. ICML, 412-420 (1997)
Berry, M.W., Dumais, S.T., O’Brien, G.W.: Using linear algebra for intelligent information retrieval. SIAM Review 37(4), 573–595 (1995)
Bingham, E., Mannila, H.: Random Projection in Dimensionality Reduction: Applications to Image and Text Data. In: Proc. SIGKDD, pp. 245–250 (2001)
Hyvärinen, A., Oja, E.: Independent Component Analysis: Algorithms and Applications. Neural Networks (4-5), pp. 411-430 (2000), FastICA package: http://www.cis.hut.fi/~aapo/
Tang, B., Luo, X., Heywood, M.I., Shepherd, M.: A Comparative Study of Dimension Reduction Techniques for Document Clustering. TR # CS-2004-14, Faculty of Computer Science, Dalhousie University (2004), http://www.cs.dal.ca/research/techreports/2004/CS-2004-14.shtml
Buckley, C., Salton, G., Allan, J.,, S.: Automatic Query Expansion using SMART: TREC-3. In: Proc. TREC-3, pp. 500–225 (1995)
Zhao, Y., Karypis, G.: Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering. Machine Learning 55(3), 311–331 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Tang, B., Shepherd, M., Heywood, M.I., Luo, X. (2005). Comparing Dimension Reduction Techniques for Document Clustering. In: Kégl, B., Lapalme, G. (eds) Advances in Artificial Intelligence. Canadian AI 2005. Lecture Notes in Computer Science(), vol 3501. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11424918_30
Download citation
DOI: https://doi.org/10.1007/11424918_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25864-3
Online ISBN: 978-3-540-31952-8
eBook Packages: Computer ScienceComputer Science (R0)