Abstract
The LANGAS project provides an online database containing historical (16th–19th) texts in Quechua, Guarani and Tupi, for sociolinguistic studies. Querying texts for such low-resourced languages raises several questions, issues and challenges. Among them, our work addresses word variation (diacritization, typographic variations) as an optional query expansion mechanism of the search engine. For such processing, taking into account the peculiarities of considered languages is unavoidable. This paper describes the morphology of considered languages, collected linguistic resources, implemented modules (regular expressions, stemming, word clusters) and some preliminary evaluations. Our work will be an opportunity to release resources for those languages. We plan to deepen this work in the near future and hopefully expect it to be useful for other researchers interested in the matter.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
Unix system’s Spanish dictionary.
- 3.
- 4.
Dictionary of the Ayacucho dialect provided by the Peruvian Ministry of Education.
References
Baron, A., Rayson, P.: VARD2: a tool for dealing with spelling variation in historical corpora. In: Postgraduate Conference in Corpus Linguistics (2008)
Barteld, F.: Detecting spelling variants in non-standard texts. In: Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics, pp. 11–22 (2017)
Cerrón-Palomino, D.R.: Quechua sureño: diccionario unificado (1994). http://www.illa-a.org/wp/diccionarios/quechua-cerron-palomino/. Accessed 17 Apr 2018
Duran, M.: Morphological and syntactic grammars for recognition of verbal lemmas in Quechua. In: Formalising Natural Languages with Nooj 2014, p. 28 (2015)
Gasser, M.: Antimorfo 1.0 user’s guide (2009)
Giusti, R., Candido, A., Muniz, M., Cucatto, L., Aluísio, S.: Automatic detection of spelling variation in historical corpus. In: Proceedings of the Corpus Linguistics Conference (CL) (2007)
Jacobs, P.: Vocabulary (2006). http://www.runasimi.de/runaengl.htm
Koolen, M., Adriaans, F., Kamps, J., de Rijke, M.: A cross-language approach to historic document retrieval. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 407–419. Springer, Heidelberg (2006). https://doi.org/10.1007/11735106_36
Piotrowski, M.: Natural Language Processing for Historical Texts. Morgan & Claypool, San Rafael (2012)
Rios, A.: A basic language technology toolkit for Quechua. Ph.D. thesis, Faculty of Arts, University of Zurich (2015)
Rios, A., Göhring, A., Volk, M.: A Quechua-Spanish parallel treebank (12 2008)
Rios, A., Mamani, R.: Allin Qillqay! a free online web spell checking service for Quechua (11 2014)
Rios, A., Mamani, R.: Morphological disambiguation and text normalization for southern Quechua varieties. In: Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, pp. 39–47 (2014)
Torero, A.: Idiomas de los Andes. Lingüística e historia. Editorial horizonte (2002)
Acknowledgments
The LANGAS project was funded by the French National Research Agency (ANR). This work benefited from the support of Université Sorbonne Paris Cité (USPC) and National Institute for Oriental Languages and Civilizations (INALCO). Many thanks to Joséphine Castaing and Elégant Mateus who developed the site and database.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Cordova, J., Boidin, C., Itier, C., Moreaux, MA., Nouvel, D. (2019). Processing Quechua and Guarani Historical Texts Query Expansion at Character and Word Level for Information Retrieval. In: Lossio-Ventura, J., Muñante, D., Alatrista-Salas, H. (eds) Information Management and Big Data. SIMBig 2018. Communications in Computer and Information Science, vol 898. Springer, Cham. https://doi.org/10.1007/978-3-030-11680-4_20
Download citation
DOI: https://doi.org/10.1007/978-3-030-11680-4_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11679-8
Online ISBN: 978-3-030-11680-4
eBook Packages: Computer ScienceComputer Science (R0)