Skip to main content

Simple TF·IDF Is Not the Best You Can Get for Regionalism Classification

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2014)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8403))

Abstract

In broadly spoken languages such as English or Spanish, there are words akin to a particular region. For example, there are words typically used in the UK such as cooker, while stove is preferred for that concept in the US. Identifying the particular words a region cultivates involves discriminating them from the set of common words to all regions. This yields the problem where a term’s frequency should be salient enough to be considered of importance, while being a common term tames this salience. This is the known problem of Term Frequency versus the Inverse Document Frequency; nevertheless, typical TF·IDF applications do not include weighting factors. In this work we propose several alternative formulae empirically, and then we conclude that we need to dig in a broader search space; thereby, we propose using Genetic Programming to find a suitable expression composed of TF and IDF terms that maximizes the discrimination of such terms given a reduced bootstrapping set of examples labeled for each region (400). We present performance examples for the Spanish variations across the Americas and Spain.

Work done with support from CONACyT-SNI, Mexico, and SIP project IPN 20121202.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Sánchez, F., Porta, J., Sancho, J.L., Nieto, A., Ballester, A., Fernández, A., Gómez, J., Gómez, L., Raigal, E., Ruiz, R.: La anotación de los corpus CREA y CORDE. In: Proceedings of SEPLN, vol. 99 (1999)

    Google Scholar 

  2. Kilgarriff, A., Rychly, P., Smrz, P., Tugwell, D.: ITRI-04-08 The Sketch Engine. Information Technology 105, 116 (2004)

    Google Scholar 

  3. Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 11-21, 60, 493–502 (1972, 2004)

    Google Scholar 

  4. Koza, J.R.: Non-Linear Genetic Algorithms for Solving Problems. United States Patent and Trademark Office (1988)

    Google Scholar 

  5. Koza, J.R.: Genetic evolution and co-evolution of computer programs. In: Artificial Life II, pp. 603–629 (1990)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Calvo, H. (2014). Simple TF·IDF Is Not the Best You Can Get for Regionalism Classification. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2014. Lecture Notes in Computer Science, vol 8403. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54906-9_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-54906-9_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-54905-2

  • Online ISBN: 978-3-642-54906-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics