Abstract
Term selection process is a very necessary component for most natural language processing tasks. Although different unsupervised techniques have been proposed, the best results are obtained with a high computational cost, for instance, those based on the use of entropy. The aim of this paper is to propose an unsupervised term selection technique based on the use of a bigram-enriched version of the transition point. Our approach reduces the corpus vocabulary size by using the transition point technique and, thereafter, it expands the reduced corpus with bigrams obtained from the same corpus, i.e., without external knowledge sources. This approach provides a considerable dimensionality reduction of the TREC-5 collection and, also has shown to improve precision for some entropy-based methods.
This work has been partially supported by the BUAP-701 PROMEP/103.5/05/1536 grant and FCC-VIEP-BUAP.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Baeza-Yates, R., Ribeiro, N.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Booth, A.: A law of occurrence of words of low frequency. Information and Control 10(4), 383–396 (1967)
Shannon, C.E.: The Bell System Technical Journal 27, 379 (1948)
Gelbukh, A., Sidorov, G., Guzman-Arenas, A.: Use of a weighted topic hierarchy for text retrieval and classification. In: Matoušek, V., Mautner, P., Ocelíková, J., Sojka, P. (eds.) TSD 1999. LNCS (LNAI), vol. 1692, pp. 130–135. Springer, Heidelberg (1999)
Jiménez-Salazar, H., Castro, M., Rojas, F., Miñón, E., Pinto, D., Carcedo, F.: Unsupervised Term Selection using Entropy. In: Research on Computing Science 14, México, pp. 163–172 (2005)
Montemurro, M.A., Zanette, D.H.: Entropic Analysis of the role of the words in literaty texts, CoRR, arXiv:cond-mat/0109218, v1 12 (Sept. 2001)
Moyotl, E.: DPT: un método de selección de términos para categorización de textos, Master in Computer Science Thesis, FCC-BUAP (In spanish) (2005)
Moyotl, E., Jiménez, H.: An Analysis on Frequency of Terms for Text Categorization. In: Procesamiento del Lenguaje Natural, España, pp. 141–146.
Moyotl, E., Jiménez, H.: Enhancement of DPT Feature Selection Method for Text Categorization. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 706–709. Springer, Heidelberg (2005)
Pérez-Carballo, J., Strzalkowski, T.: Natural Language Information Retrieval: progress report. Information Processing and Management 36(1), 155–178 (2000)
Pinto, D., Jiménez-Salazar, H., Rosso, P., Sanchis, E.: BUAP-UPV TPIRS: A System for Document Indexing Reduction at WebCLEF. In: Peters, C., Gey, F.C., Gonzalo, J., Müller, H., Jones, G.J.F., Kluck, M., Magnini, B., de Rijke, M., Giampiccolo, D. (eds.) CLEF 2005. LNCS, vol. 4022, Springer, Heidelberg (2006)
Pinto, D., Jiménez-Salazar, H.: Paolo Rosso: Clustering Abstracts of Scientific Texts using the Transition Point Technique. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 536–546. Springer, Heidelberg (2006)
Rojas, F., Jiménez, H., Pinto, D., López, A.: Dimensionality reduction for Information Retrieval. Research on Computing Science 20, 107–112 (2006)
Rojas, F., Jiménez, H., Pinto, D.: Text Reduction-Enrichment at WebCLEF. In: Proceedings of CLEF 2006, p. 53 (2006)
Salton, G., Wong, A., Yang, C.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18(11), 613–620 (1975)
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Urbizagástegui, A.R.: Las Posibilidades de la Ley de Zipf en la Indización Automática (In spanish) (1999), http://www.geocities.com/ResearchTriangle/2851/RUBEN2.htm
Yang, Y., Pedersen, P.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of ICML-97, 14th Int. Conf. on Machine Learning, pp. 412–420 (1997)
Zipf, G.K.: Human Behaviour and the Principle of Least Effort. Addison-Wesley, Reading (1949)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
López, F.R., Jiménez-Salazar, H., Pinto, D. (2007). A Competitive Term Selection Method for Information Retrieval. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2007. Lecture Notes in Computer Science, vol 4394. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70939-8_41
Download citation
DOI: https://doi.org/10.1007/978-3-540-70939-8_41
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-70938-1
Online ISBN: 978-3-540-70939-8
eBook Packages: Computer ScienceComputer Science (R0)