Automatic Detection of Regional Words for Pan-Hispanic Spanish on Twitter

Jimenez, Sergio; Dueñas, George; Gelbukh, Alexander; Rodriguez-Diaz, Carlos A.; Mancera, Sergio

doi:10.1007/978-3-030-03928-8_33

Sergio Jimenez¹⁷,
George Dueñas¹⁷,
Alexander Gelbukh¹⁸,
Carlos A. Rodriguez-Diaz¹⁷ &
…
Sergio Mancera^17,18

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11238))

Included in the following conference series:

Ibero-American Conference on Artificial Intelligence

1319 Accesses
3 Citations

Abstract

Languages, such as Spanish, spoken by hundreds of millions of people in large geographic areas are subject to a high degree of regional variation. Regional words are frequently used in informal contexts, but their meaning is shared only by a relatively small group of people. Dealing with these regionalisms is a challenge for most applications in the field of Natural Language Processing. We propose a novel method to identify regional words and provide their meaning based on a large corpus of geolocated ‘tweets’. The method combines the notions of specificity (tf-idf), space correlation (HSIC) and neural word embedding (word2vec) to produce a list of words ranked by their degree of regionalism along with their meaning represented by a set of words semantically related and examples of use. The method was evaluated against lists of regional words taken from regional dictionaries produced by lexicographers and from collaborative websites where users contribute freely with regional words. We tested the effectiveness of the proposed method and produced a new resource for 21 Spanish-speaking countries composed of 5,000 regional words per country along with similar words and example ‘tweets’.

Supported by Asociación de Amigos del Instituto Caro y Cuervo. S. Mancera was supported by a scholarship given by CONACYT, Mexico.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
For example, in some regions of Colombia, the word galería refers to a marketplace, but in general Spanish, that word means an art gallery or a covered path.
2.
For example, in Colombia, ajiaco refer to a type of soup particular of that country.
3.
https://www.datos.gov.co/browse?q=F-TWITTER.
4.
https://www.datos.gov.co/browse?q=word2vec.
5.
https://www.datos.gov.co/browse?q=regionalismos%20cercanas.
6.
https://www.datos.gov.co/browse?q=regionalismos%20ejemplos.
7.
https://github.com/sgjimenezv/spanish_regional_words_benchmark.

References

Baeza-Yates, R., et al.: Modern Information Retrieval, vol. 463. ACM press, New York (1999)
Google Scholar
Calvo, H.: Simple TF\(\cdot \) IDF is not the best you can get for regionalism classification. In: Gelbukh, A. (ed.) CICLing 2014. LNCS, vol. 8403, pp. 92–101. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54906-9_8
Chapter Google Scholar
Donoso, G., Sanchez, D.: Dialectometric analysis of language variation in twitter. In: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pp. 16–25. Association for Computational Linguistics, Valencia, Spain (April 2017)
Google Scholar
Gretton, A., Fukumizu, K., Teo, C.H., Song, L., Schölkopf, B., Smola, A.J.: A kernel statistical test of independence. In: Advances in Neural Information Processing Systems, pp. 585–592 (2008)
Google Scholar
Grieve, J., Speelman, D., Geeraerts, D.: A statistical method for the identification and aggregation of regional linguistic variation. Lang. Var. Change 23(2), 193–221 (2011)
Article Google Scholar
Hofmann, T., Schölkopf, B., Smola, A.J.: Kernel methods in machine learning. Ann. Stat., pp. 1171–1220 (2008)
Article MathSciNet Google Scholar
Huang, Y., Guo, D., Kasakoff, A., Grieve, J.: Understanding us regional linguistic variation with twitter data analysis. Comput. Environ. Urban Syst. 59, 244–255 (2016)
Article Google Scholar
Lee, J., Kretzschmar Jr., W.A.: Spatial analysis of linguistic data with GIS functions. Int. J. Geogr. Inf. Sci. 7(6), 541–560 (1993)
Article Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Nguyen, D., Eisenstein, J.: A kernel independence test for geographical language variation. Comput. Linguist. 43(3), 567–592 (2017)
Article MathSciNet Google Scholar
Rodriguez-Diaz, C.A., Jimenez, S., Dueñas, G., Bonilla, J.E., Gelbukh, A.: Dialectones: Finding statistically significant dialectal boundaries using twitter data. In: International Conference on Intelligent Text Processing and Computational Linguistics Springer (2018). (in press)
Google Scholar
Scherrer, Y.: Recovering dialect geography from an unaligned comparable corpus. In: Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, pp. 63–71. Association for Computational Linguistics (2012)
Google Scholar
Spärck Jones, K.: IDF term weighting and IR research lessons. J. Doc. 60(5), 521–523 (2004)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Instituto Caro y Cuervo, Bogotá D.C., Colombia
Sergio Jimenez, George Dueñas, Carlos A. Rodriguez-Diaz & Sergio Mancera
Centro de Investigación en Computación, Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh & Sergio Mancera

Authors

Sergio Jimenez
View author publications
You can also search for this author in PubMed Google Scholar
George Dueñas
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Gelbukh
View author publications
You can also search for this author in PubMed Google Scholar
Carlos A. Rodriguez-Diaz
View author publications
You can also search for this author in PubMed Google Scholar
Sergio Mancera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sergio Jimenez .

Editor information

Editors and Affiliations

Universidad Nacional del Sur, Bahía Blanca, Buenos Aires, Argentina
Guillermo R. Simari
University of Madeira, Funchal, Portugal
Eduardo Fermé
Universidad Nacional de Piura, Castilla-Piura, Peru
Flabio Gutiérrez Segura
Universidad Nacional de Trujillo, Trujillo, Peru
José Antonio Rodríguez Melquiades

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jimenez, S., Dueñas, G., Gelbukh, A., Rodriguez-Diaz, C.A., Mancera, S. (2018). Automatic Detection of Regional Words for Pan-Hispanic Spanish on Twitter. In: Simari, G., Fermé, E., Gutiérrez Segura, F., Rodríguez Melquiades, J. (eds) Advances in Artificial Intelligence - IBERAMIA 2018. IBERAMIA 2018. Lecture Notes in Computer Science(), vol 11238. Springer, Cham. https://doi.org/10.1007/978-3-030-03928-8_33

Download citation

DOI: https://doi.org/10.1007/978-3-030-03928-8_33
Published: 09 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03927-1
Online ISBN: 978-3-030-03928-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics