A Competitive Term Selection Method for Information Retrieval

López, Franco Rojas; Jiménez-Salazar, Héctor; Pinto, David

doi:10.1007/978-3-540-70939-8_41

Franco Rojas López¹,
Héctor Jiménez-Salazar¹ &
David Pinto^1,2

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4394))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1585 Accesses

Abstract

Term selection process is a very necessary component for most natural language processing tasks. Although different unsupervised techniques have been proposed, the best results are obtained with a high computational cost, for instance, those based on the use of entropy. The aim of this paper is to propose an unsupervised term selection technique based on the use of a bigram-enriched version of the transition point. Our approach reduces the corpus vocabulary size by using the transition point technique and, thereafter, it expands the reduced corpus with bigrams obtained from the same corpus, i.e., without external knowledge sources. This approach provides a considerable dimensionality reduction of the TREC-5 collection and, also has shown to improve precision for some entropy-based methods.

This work has been partially supported by the BUAP-701 PROMEP/103.5/05/1536 grant and FCC-VIEP-BUAP.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Methods for automatic term recognition in domain-specific text collections: A survey

Article 15 November 2015

Evaluation and analysis of term scoring methods for term extraction

Article Open access 10 August 2016

Pseudo-Relevance Feedback Based on Locally-Built Co-occurrence Graphs

References

Baeza-Yates, R., Ribeiro, N.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Google Scholar
Booth, A.: A law of occurrence of words of low frequency. Information and Control 10(4), 383–396 (1967)
Article Google Scholar
Shannon, C.E.: The Bell System Technical Journal 27, 379 (1948)
Google Scholar
Gelbukh, A., Sidorov, G., Guzman-Arenas, A.: Use of a weighted topic hierarchy for text retrieval and classification. In: Matoušek, V., Mautner, P., Ocelíková, J., Sojka, P. (eds.) TSD 1999. LNCS (LNAI), vol. 1692, pp. 130–135. Springer, Heidelberg (1999)
Google Scholar
Jiménez-Salazar, H., Castro, M., Rojas, F., Miñón, E., Pinto, D., Carcedo, F.: Unsupervised Term Selection using Entropy. In: Research on Computing Science 14, México, pp. 163–172 (2005)
Google Scholar
Montemurro, M.A., Zanette, D.H.: Entropic Analysis of the role of the words in literaty texts, CoRR, arXiv:cond-mat/0109218, v1 12 (Sept. 2001)
Google Scholar
Moyotl, E.: DPT: un método de selección de términos para categorización de textos, Master in Computer Science Thesis, FCC-BUAP (In spanish) (2005)
Google Scholar
Moyotl, E., Jiménez, H.: An Analysis on Frequency of Terms for Text Categorization. In: Procesamiento del Lenguaje Natural, España, pp. 141–146.
Google Scholar
Moyotl, E., Jiménez, H.: Enhancement of DPT Feature Selection Method for Text Categorization. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 706–709. Springer, Heidelberg (2005)
Google Scholar
Pérez-Carballo, J., Strzalkowski, T.: Natural Language Information Retrieval: progress report. Information Processing and Management 36(1), 155–178 (2000)
Article Google Scholar
Pinto, D., Jiménez-Salazar, H., Rosso, P., Sanchis, E.: BUAP-UPV TPIRS: A System for Document Indexing Reduction at WebCLEF. In: Peters, C., Gey, F.C., Gonzalo, J., Müller, H., Jones, G.J.F., Kluck, M., Magnini, B., de Rijke, M., Giampiccolo, D. (eds.) CLEF 2005. LNCS, vol. 4022, Springer, Heidelberg (2006)
Chapter Google Scholar
Pinto, D., Jiménez-Salazar, H.: Paolo Rosso: Clustering Abstracts of Scientific Texts using the Transition Point Technique. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 536–546. Springer, Heidelberg (2006)
Chapter Google Scholar
Rojas, F., Jiménez, H., Pinto, D., López, A.: Dimensionality reduction for Information Retrieval. Research on Computing Science 20, 107–112 (2006)
Google Scholar
Rojas, F., Jiménez, H., Pinto, D.: Text Reduction-Enrichment at WebCLEF. In: Proceedings of CLEF 2006, p. 53 (2006)
Google Scholar
Salton, G., Wong, A., Yang, C.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Urbizagástegui, A.R.: Las Posibilidades de la Ley de Zipf en la Indización Automática (In spanish) (1999), http://www.geocities.com/ResearchTriangle/2851/RUBEN2.htm
Yang, Y., Pedersen, P.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of ICML-97, 14th Int. Conf. on Machine Learning, pp. 412–420 (1997)
Google Scholar
Zipf, G.K.: Human Behaviour and the Principle of Least Effort. Addison-Wesley, Reading (1949)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Computer Science, BUAP, Puebla, 72570 Ciudad Universitaria, Mexico
Franco Rojas López, Héctor Jiménez-Salazar & David Pinto
Department of Information Systems and Computation, UPV, Valencia 46022, Camino de Vera s/n, Spain
David Pinto

Authors

Franco Rojas López
View author publications
You can also search for this author in PubMed Google Scholar
Héctor Jiménez-Salazar
View author publications
You can also search for this author in PubMed Google Scholar
David Pinto
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

López, F.R., Jiménez-Salazar, H., Pinto, D. (2007). A Competitive Term Selection Method for Information Retrieval. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2007. Lecture Notes in Computer Science, vol 4394. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70939-8_41

Download citation

DOI: https://doi.org/10.1007/978-3-540-70939-8_41
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-70938-1
Online ISBN: 978-3-540-70939-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics