Skip to main content

Processing Quechua and Guarani Historical Texts Query Expansion at Character and Word Level for Information Retrieval

  • Conference paper
  • First Online:
Information Management and Big Data (SIMBig 2018)

Abstract

The LANGAS project provides an online database containing historical (16th–19th) texts in Quechua, Guarani and Tupi, for sociolinguistic studies. Querying texts for such low-resourced languages raises several questions, issues and challenges. Among them, our work addresses word variation (diacritization, typographic variations) as an optional query expansion mechanism of the search engine. For such processing, taking into account the peculiarities of considered languages is unavoidable. This paper describes the morphology of considered languages, collected linguistic resources, implemented modules (regular expressions, stemming, word clusters) and some preliminary evaluations. Our work will be an opportunity to release resources for those languages. We plan to deepen this work in the near future and hopefully expect it to be useful for other researchers interested in the matter.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.langas.cnrs.fr/#/recherche_corpus.

  2. 2.

    Unix system’s Spanish dictionary.

  3. 3.

    http://unitexgramlab.org.

  4. 4.

    Dictionary of the Ayacucho dialect provided by the Peruvian Ministry of Education.

References

  1. Baron, A., Rayson, P.: VARD2: a tool for dealing with spelling variation in historical corpora. In: Postgraduate Conference in Corpus Linguistics (2008)

    Google Scholar 

  2. Barteld, F.: Detecting spelling variants in non-standard texts. In: Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics, pp. 11–22 (2017)

    Google Scholar 

  3. Cerrón-Palomino, D.R.: Quechua sureño: diccionario unificado (1994). http://www.illa-a.org/wp/diccionarios/quechua-cerron-palomino/. Accessed 17 Apr 2018

  4. Duran, M.: Morphological and syntactic grammars for recognition of verbal lemmas in Quechua. In: Formalising Natural Languages with Nooj 2014, p. 28 (2015)

    Google Scholar 

  5. Gasser, M.: Antimorfo 1.0 user’s guide (2009)

    Google Scholar 

  6. Giusti, R., Candido, A., Muniz, M., Cucatto, L., Aluísio, S.: Automatic detection of spelling variation in historical corpus. In: Proceedings of the Corpus Linguistics Conference (CL) (2007)

    Google Scholar 

  7. Jacobs, P.: Vocabulary (2006). http://www.runasimi.de/runaengl.htm

  8. Koolen, M., Adriaans, F., Kamps, J., de Rijke, M.: A cross-language approach to historic document retrieval. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 407–419. Springer, Heidelberg (2006). https://doi.org/10.1007/11735106_36

    Chapter  Google Scholar 

  9. Piotrowski, M.: Natural Language Processing for Historical Texts. Morgan & Claypool, San Rafael (2012)

    Book  Google Scholar 

  10. Rios, A.: A basic language technology toolkit for Quechua. Ph.D. thesis, Faculty of Arts, University of Zurich (2015)

    Google Scholar 

  11. Rios, A., Göhring, A., Volk, M.: A Quechua-Spanish parallel treebank (12 2008)

    Google Scholar 

  12. Rios, A., Mamani, R.: Allin Qillqay! a free online web spell checking service for Quechua (11 2014)

    Google Scholar 

  13. Rios, A., Mamani, R.: Morphological disambiguation and text normalization for southern Quechua varieties. In: Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, pp. 39–47 (2014)

    Google Scholar 

  14. Torero, A.: Idiomas de los Andes. Lingüística e historia. Editorial horizonte (2002)

    Google Scholar 

Download references

Acknowledgments

The LANGAS project was funded by the French National Research Agency (ANR). This work benefited from the support of Université Sorbonne Paris Cité (USPC) and National Institute for Oriental Languages and Civilizations (INALCO). Many thanks to Joséphine Castaing and Elégant Mateus who developed the site and database.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Johanna Cordova .

Editor information

Editors and Affiliations

Appendices

A Quechua Suffixes Chains

figure a

B Unitex Graphs

See Figs. 1 and 2.

Fig. 1.
figure 1

Noun graph

Fig. 2.
figure 2

Derivators subgraph

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cordova, J., Boidin, C., Itier, C., Moreaux, MA., Nouvel, D. (2019). Processing Quechua and Guarani Historical Texts Query Expansion at Character and Word Level for Information Retrieval. In: Lossio-Ventura, J., Muñante, D., Alatrista-Salas, H. (eds) Information Management and Big Data. SIMBig 2018. Communications in Computer and Information Science, vol 898. Springer, Cham. https://doi.org/10.1007/978-3-030-11680-4_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-11680-4_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-11679-8

  • Online ISBN: 978-3-030-11680-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics