Abstract
This paper studies the structure of vectors obtained by using term selection methods in high-dimensional text collection. We found that the distance to transition point (DTP) method omits commonly occurring terms, which are poor discriminators between documents, but which convey important information about a collection. Experimental results obtained on the Reuters-21578 collection with the k-NN classifier show that feature selection by DTP combined with common terms outperforms slightly simple document frequency.
This work was supported by VIEP-BUAP, grant III9-04/ING/G.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Booth, A.: A Law of Occurrences for Words of Low Frequency. Information and Control 10(4), 386–393 (1967)
Debole, F., Sebastiani, F.: An Analysis of the Relative Difficulty of Reuters-21578 Subsets. In: Proceedings of LREC 2004, 4th International Conference on Language Resources and Evaluation, Lisbon, PT, pp. 971–974 (2004)
Dhillon, I.S., Modha, D.S.: Concept Decompositions for Large Sparse Text Data using Clustering. Mach. Learn. 42(1-2), 143–175 (2001)
Galavotti, L., Sebastiani, F., Simi, M.: Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization. In: Borbinha, J.L., Baker, T. (eds.) ECDL 2000. LNCS, vol. 1923, pp. 59–68. Springer, Heidelberg (2000)
Moyotl, E., Jiménez, H.: An Analysis on Frequency of Terms for Text Categorization. In: Proc. of SEPLN 2004 (2004)
Moyotl, E., Jiménez, H.: Experiments in Text Categorization using Term Selection by Distance to Transition Point. In: Proc. of CIC 2004 (2004)
Salton, G., Wong, A., Yang, C.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18(11), 613–620 (1975)
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Urbizagástegui-Alvarado, R.: Las posibilidades de la ley de Zipf en la indización automática. Reporte de la Universidad de California Riverside (1999)
Yang, Y., Pedersen, P.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of ICML 1997, 14th Int. Conf. on Machine Learning, pp. 412–420 (1997)
Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In: Proc. of SIGIR 1999, 22nd ACM Int. Conf. on Research and Development in Information Retrieval, pp. 42–49 (1999)
Zipf, G.K.: Human Behaviour and the Principle of Least Effort. Addison-Wesley, Reading (1949)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Moyotl-Hernández, E., Jiménez-Salazar, H. (2005). Enhancement of DTP Feature Selection Method for Text Categorization. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2005. Lecture Notes in Computer Science, vol 3406. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30586-6_80
Download citation
DOI: https://doi.org/10.1007/978-3-540-30586-6_80
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24523-0
Online ISBN: 978-3-540-30586-6
eBook Packages: Computer ScienceComputer Science (R0)