Processing Quechua and Guarani Historical Texts Query Expansion at Character and Word Level for Information Retrieval

Cordova, Johanna; Boidin, Capucine; Itier, César; Moreaux, Marie-Anne; Nouvel, Damien

doi:10.1007/978-3-030-11680-4_20

Johanna Cordova^11,12,
Capucine Boidin¹²,
César Itier¹³,
Marie-Anne Moreaux¹¹ &
…
Damien Nouvel¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 898))

Included in the following conference series:

Annual International Symposium on Information Management and Big Data

748 Accesses
1 Citations

Abstract

The LANGAS project provides an online database containing historical (16th–19th) texts in Quechua, Guarani and Tupi, for sociolinguistic studies. Querying texts for such low-resourced languages raises several questions, issues and challenges. Among them, our work addresses word variation (diacritization, typographic variations) as an optional query expansion mechanism of the search engine. For such processing, taking into account the peculiarities of considered languages is unavoidable. This paper describes the morphology of considered languages, collected linguistic resources, implemented modules (regular expressions, stemming, word clusters) and some preliminary evaluations. Our work will be an opportunity to release resources for those languages. We plan to deepen this work in the near future and hopefully expect it to be useful for other researchers interested in the matter.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.langas.cnrs.fr/#/recherche_corpus.
2.
Unix system’s Spanish dictionary.
3.
http://unitexgramlab.org.
4.
Dictionary of the Ayacucho dialect provided by the Peruvian Ministry of Education.

References

Baron, A., Rayson, P.: VARD2: a tool for dealing with spelling variation in historical corpora. In: Postgraduate Conference in Corpus Linguistics (2008)
Google Scholar
Barteld, F.: Detecting spelling variants in non-standard texts. In: Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics, pp. 11–22 (2017)
Google Scholar
Cerrón-Palomino, D.R.: Quechua sureño: diccionario unificado (1994). http://www.illa-a.org/wp/diccionarios/quechua-cerron-palomino/. Accessed 17 Apr 2018
Duran, M.: Morphological and syntactic grammars for recognition of verbal lemmas in Quechua. In: Formalising Natural Languages with Nooj 2014, p. 28 (2015)
Google Scholar
Gasser, M.: Antimorfo 1.0 user’s guide (2009)
Google Scholar
Giusti, R., Candido, A., Muniz, M., Cucatto, L., Aluísio, S.: Automatic detection of spelling variation in historical corpus. In: Proceedings of the Corpus Linguistics Conference (CL) (2007)
Google Scholar
Jacobs, P.: Vocabulary (2006). http://www.runasimi.de/runaengl.htm
Koolen, M., Adriaans, F., Kamps, J., de Rijke, M.: A cross-language approach to historic document retrieval. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 407–419. Springer, Heidelberg (2006). https://doi.org/10.1007/11735106_36
Chapter Google Scholar
Piotrowski, M.: Natural Language Processing for Historical Texts. Morgan & Claypool, San Rafael (2012)
Book Google Scholar
Rios, A.: A basic language technology toolkit for Quechua. Ph.D. thesis, Faculty of Arts, University of Zurich (2015)
Google Scholar
Rios, A., Göhring, A., Volk, M.: A Quechua-Spanish parallel treebank (12 2008)
Google Scholar
Rios, A., Mamani, R.: Allin Qillqay! a free online web spell checking service for Quechua (11 2014)
Google Scholar
Rios, A., Mamani, R.: Morphological disambiguation and text normalization for southern Quechua varieties. In: Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, pp. 39–47 (2014)
Google Scholar
Torero, A.: Idiomas de los Andes. Lingüística e historia. Editorial horizonte (2002)
Google Scholar

Download references

Acknowledgments

The LANGAS project was funded by the French National Research Agency (ANR). This work benefited from the support of Université Sorbonne Paris Cité (USPC) and National Institute for Oriental Languages and Civilizations (INALCO). Many thanks to Joséphine Castaing and Elégant Mateus who developed the site and database.

Author information

Authors and Affiliations

INALCO ERTIM, 2 rue de Lille, 75007, Paris, France
Johanna Cordova, Marie-Anne Moreaux & Damien Nouvel
Paris 3 IHEAL, 28 Rue Saint-Guillaume, 75007, Paris, France
Johanna Cordova & Capucine Boidin
INALCO CERLOM, 2 rue de Lille, 75007, Paris, France
César Itier

Authors

Johanna Cordova
View author publications
You can also search for this author in PubMed Google Scholar
Capucine Boidin
View author publications
You can also search for this author in PubMed Google Scholar
César Itier
View author publications
You can also search for this author in PubMed Google Scholar
Marie-Anne Moreaux
View author publications
You can also search for this author in PubMed Google Scholar
Damien Nouvel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Johanna Cordova .

Editor information

Editors and Affiliations

Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
Juan Antonio Lossio-Ventura
Fondazione Bruno Kessler, Trento, Italy
Denisse Muñante
Facultad de Ingeniería, University of the Pacific, Jesús María, Lima, Peru
Hugo Alatrista-Salas

Appendices

A Quechua Suffixes Chains

B Unitex Graphs

See Figs. 1 and 2.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cordova, J., Boidin, C., Itier, C., Moreaux, MA., Nouvel, D. (2019). Processing Quechua and Guarani Historical Texts Query Expansion at Character and Word Level for Information Retrieval. In: Lossio-Ventura, J., Muñante, D., Alatrista-Salas, H. (eds) Information Management and Big Data. SIMBig 2018. Communications in Computer and Information Science, vol 898. Springer, Cham. https://doi.org/10.1007/978-3-030-11680-4_20

Download citation

DOI: https://doi.org/10.1007/978-3-030-11680-4_20
Published: 08 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11679-8
Online ISBN: 978-3-030-11680-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics