Skip to main content

A Competitive Term Selection Method for Information Retrieval

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2007)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4394))

  • 1585 Accesses

Abstract

Term selection process is a very necessary component for most natural language processing tasks. Although different unsupervised techniques have been proposed, the best results are obtained with a high computational cost, for instance, those based on the use of entropy. The aim of this paper is to propose an unsupervised term selection technique based on the use of a bigram-enriched version of the transition point. Our approach reduces the corpus vocabulary size by using the transition point technique and, thereafter, it expands the reduced corpus with bigrams obtained from the same corpus, i.e., without external knowledge sources. This approach provides a considerable dimensionality reduction of the TREC-5 collection and, also has shown to improve precision for some entropy-based methods.

This work has been partially supported by the BUAP-701 PROMEP/103.5/05/1536 grant and FCC-VIEP-BUAP.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Baeza-Yates, R., Ribeiro, N.: Modern Information Retrieval. Addison-Wesley, Reading (1999)

    Google Scholar 

  2. Booth, A.: A law of occurrence of words of low frequency. Information and Control 10(4), 383–396 (1967)

    Article  Google Scholar 

  3. Shannon, C.E.: The Bell System Technical Journal 27, 379 (1948)

    Google Scholar 

  4. Gelbukh, A., Sidorov, G., Guzman-Arenas, A.: Use of a weighted topic hierarchy for text retrieval and classification. In: Matoušek, V., Mautner, P., Ocelíková, J., Sojka, P. (eds.) TSD 1999. LNCS (LNAI), vol. 1692, pp. 130–135. Springer, Heidelberg (1999)

    Google Scholar 

  5. Jiménez-Salazar, H., Castro, M., Rojas, F., Miñón, E., Pinto, D., Carcedo, F.: Unsupervised Term Selection using Entropy. In: Research on Computing Science 14, México, pp. 163–172 (2005)

    Google Scholar 

  6. Montemurro, M.A., Zanette, D.H.: Entropic Analysis of the role of the words in literaty texts, CoRR, arXiv:cond-mat/0109218, v1 12 (Sept. 2001)

    Google Scholar 

  7. Moyotl, E.: DPT: un método de selección de términos para categorización de textos, Master in Computer Science Thesis, FCC-BUAP (In spanish) (2005)

    Google Scholar 

  8. Moyotl, E., Jiménez, H.: An Analysis on Frequency of Terms for Text Categorization. In: Procesamiento del Lenguaje Natural, España, pp. 141–146.

    Google Scholar 

  9. Moyotl, E., Jiménez, H.: Enhancement of DPT Feature Selection Method for Text Categorization. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 706–709. Springer, Heidelberg (2005)

    Google Scholar 

  10. Pérez-Carballo, J., Strzalkowski, T.: Natural Language Information Retrieval: progress report. Information Processing and Management 36(1), 155–178 (2000)

    Article  Google Scholar 

  11. Pinto, D., Jiménez-Salazar, H., Rosso, P., Sanchis, E.: BUAP-UPV TPIRS: A System for Document Indexing Reduction at WebCLEF. In: Peters, C., Gey, F.C., Gonzalo, J., Müller, H., Jones, G.J.F., Kluck, M., Magnini, B., de Rijke, M., Giampiccolo, D. (eds.) CLEF 2005. LNCS, vol. 4022, Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  12. Pinto, D., Jiménez-Salazar, H.: Paolo Rosso: Clustering Abstracts of Scientific Texts using the Transition Point Technique. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 536–546. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  13. Rojas, F., Jiménez, H., Pinto, D., López, A.: Dimensionality reduction for Information Retrieval. Research on Computing Science 20, 107–112 (2006)

    Google Scholar 

  14. Rojas, F., Jiménez, H., Pinto, D.: Text Reduction-Enrichment at WebCLEF. In: Proceedings of CLEF 2006, p. 53 (2006)

    Google Scholar 

  15. Salton, G., Wong, A., Yang, C.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

  16. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  17. Urbizagástegui, A.R.: Las Posibilidades de la Ley de Zipf en la Indización Automática (In spanish) (1999), http://www.geocities.com/ResearchTriangle/2851/RUBEN2.htm

  18. Yang, Y., Pedersen, P.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of ICML-97, 14th Int. Conf. on Machine Learning, pp. 412–420 (1997)

    Google Scholar 

  19. Zipf, G.K.: Human Behaviour and the Principle of Least Effort. Addison-Wesley, Reading (1949)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

López, F.R., Jiménez-Salazar, H., Pinto, D. (2007). A Competitive Term Selection Method for Information Retrieval. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2007. Lecture Notes in Computer Science, vol 4394. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70939-8_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-70939-8_41

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-70938-1

  • Online ISBN: 978-3-540-70939-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics