Abstract
Document clustering or unsupervised document classification has been used to enhance information retrieval. Recently this has become an intense area of research due to its practical importance. Outliers are the elements whose similarity to the centroid of the corresponding category is below some threshold value. In this paper, we show that excluding outliers from the noisy training data significantly improves the performance of the centroid-based classifier which is the best known method. The proposed method performs about 10% better than the centroid-based classifier.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Cohen, W.W., Hirsh, H.: Joins that generalize: Text Classification using WHIRL. In: Proc. of the Fourth Int’l. Conference on Knowledge Discovery and Data Mining (1998)
Han, E.-H(S.), Karypis, G.: Centroid-based document classification: Analysis and experimental results. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS, vol. 1910, pp. 424–431. Springer, Heidelberg (2000)
Lewis, D., Gale, W.: A sequential algorithm for training text classifiers. In: SIGIR 1994 (1994)
Ross Quinlan, J.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)
Salton, G., McGill, M.J.: Introduction to Modern Retrieval. McGraw-Hill, New York (1983)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Dhillon, I.S., Fan, J., Guan, Y.: Efficient Clustering of Very Large Document Collections. In: Data Mining for Scientific and Engineering Applications. Kluwer, Dordrecht (2001)
MacLeod, K.: An application specific neural model for document clustering. In: Proceedings of the Fourth Annual Parallel Processing Symposium, vol. 1, pp. 5–16 (1990)
Svingen, B.: Using genetic programming for document classification. In: FLAIRS 1998, Proceedings of the Eleventh International Florida Artificial Intelligence Research, pp. 63–67 (1998)
Hyotyniemi, H.: Text document classification with self-organizing maps. In: STeP 1996 - Genes, Nets and Symbols, Finnish Artificial Intelligence Conference, pp. 64–72 (1996)
Lam, W., Low, K.-F.: Automatic document classification based on probabilistic reasoning: Model and performance analysis. In: Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, vol. 3, pp. 2719–2723 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Shin, K., Abraham, A., Han, S. (2006). Enhanced Centroid-Based Classification Technique by Filtering Outliers. In: Sojka, P., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2006. Lecture Notes in Computer Science(), vol 4188. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11846406_20
Download citation
DOI: https://doi.org/10.1007/11846406_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-39090-9
Online ISBN: 978-3-540-39091-6
eBook Packages: Computer ScienceComputer Science (R0)