Abstract
Distributional Clustering has showed to be an effective and powerful approach to supervised term extraction aimed at reducing the original indexing space dimensionality for Automatic Text Categorization [2]. In a recent paper [1] we introduced a new Signal Processing approach to Distributional Clustering which reached categorization results on 20 Newsgroups dataset similar to those obtained by other information-theoretic approaches [3][4][5] . Here we re-validate our method by showing that the 90-categories Reuters-21578 benchmark collection can be indexed with a minimum loss of categorization accuracy (around 2% with Naïve Bayes categorizer) with only 50 clusters.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Capdevila, M., Márquez, O.W.: A signal processing approach to distributional clustering of terms in automatic text categorization. In: Proceedings of INSCIT2006, I Int. Conf. on Multidisciplinary Information Sci. and Tech. (2006)
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Baker, L.D., McCallum, A.K.: Distributional Clustering of Words for Text Classification. In: Proceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval, ACM Press, New York (1998)
Slonim, N., Tishby, N.: The Power of Word Clusters for Text Classification. In: 23rd European Colloquium on Information Retrieval Research (2001)
Dhillon, I.S., Mallela, S., Kumar, R.: A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification. Journal of Machine Learning Research 3, 1265–1287 (2003)
Bruce Carlson, A.: Communication Systems an Introduction to Signals and Noise in Electrical Communications, 3rd edn. McGraw-Hill, New York (1986)
Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Capdevila Dalmau, M., Márquez Flórez, O.W. (2007). Experimental Results of the Signal Processing Approach to Distributional Clustering of Terms on Reuters-21578 Collection. In: Amati, G., Carpineto, C., Romano, G. (eds) Advances in Information Retrieval. ECIR 2007. Lecture Notes in Computer Science, vol 4425. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71496-5_67
Download citation
DOI: https://doi.org/10.1007/978-3-540-71496-5_67
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71494-1
Online ISBN: 978-3-540-71496-5
eBook Packages: Computer ScienceComputer Science (R0)