Skip to main content
Log in

Nenek: a cloud-based collaboration platform for the management of Amerindian language resources

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

This article presents Nenek: A cloud-based collaboration platform for language documentation of underresourced languages. Nenek is based on a crowdsourcing scheme that supports native speakers, indigenous associations, government agencies and researchers in the creation of virtual communities of minority language speakers on the Internet. Nenek includes a set of web tools that enables users to work collaboratively on language documentation tasks, build lexicographic assets and produce new language resources. This platform includes a three-stage management model to control the acquisition of existent language resources, the manufacturing of new resources, as well as their distribution within the virtual community and to the general public. In the acquisition stage, existent language resources are either automatically extracted from the web by a crawler or received through donations from users who participate in a monolingual social network. In the manufacturing stage, lexicographic and collaborative tools enable users to build new resources while the acquired and manufactured resources are published in the diffusion stage, either within the virtual community or publicly. We present a life cycle mapping scheme that registers the transformations of the language resources at each of the three stages of language resource management. This scheme also traces the utilization and diffusion of each resource produced by the virtual community. The paper includes a case study in which we present the use of the Nenek platform in a language documentation project of a Mayan language spoken in Mexico's Gulf coast region called Huastec. This case study reveals Nenek's efficiency in terms of acquisition, annotation, manufacturing and diffusion of language resources; it also discusses the participation of the members of the virtual community.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. This is the case, for example, in Mexico, where languages such as Maya or some variants of Nahuatl are notably more present online than other minority languages. Although the online resources of these prominent minority languages are insignificant when compared to the internet presence of majority languages such as English or Spanish, there is a considerable amount of online repositories, texts, transcriptions and dictionaries available, for example in the already mentioned archives. In turn, for languages with less number of speakers as Huastec only a few dictionaries are available.

  2. The resource accessibility relates to certain restrictions that respect the decision of the depositors who want to control their resources and require the safeguarding of sensitive ritual information or the protection of author rights. For instance, the AILLA archive contains hundreds of Huastec language resources deposited by three distinguished researchers but 80 % of them are protected by a password, which results in a reduced number of free resources. The ELAR archive includes hundreds of resources about an endangered variant of the Huastec language that have been deposited by one single depositor but the access to resources depends on a positive response form the depositor to the request.

  3. The metadata are based on the IMDI format (http://www.mpi.nl/imdi).

  4. This makes sense when managing minority languages, since they commonly lack a formal and standardized writing system. Incorrect syntaxes could introduce expressions that are not in use to the lexicons and libraries. In the case of the multimedia formats, incorrect cultural assets could be added to the repositories without supervision of native speakers, researchers or indigenous associations because it is quite common to confuse the dresses, food, and living places of different cultures.

  5. The traditional electronic document management systems (EDMS) enable organizations to manage documents throughout the life cycle, from creation to destruction (Adam 2008) but in Nenek we register the transformations of the language resources through the acquisition, manufacturing and diffusion stages and the destruction of resources is not contemplated.

  6. Our team is currently working on stemming and lemmatization mechanisms, which will be included in the lexicographical tool.

  7. The team that proposed the automatization method of the lexeme extraction will report its description as well as the precision evaluation of the spell checker in a forthcoming article.

  8. This category is mainly produced by organizations and associations.

  9. Of course, the databases generated by the virtual community can be helpful for researchers to conduct corpus analysis, discourse studies of internet-forum communications, and language documentation about minority languages.

References

  • Acosta, J., Hernández, T., Martínez, C., Acosta, N., LejkixKaw ti Tének (2013). An online dictionary created by speakers in a collaborative manner by using Nenek platform. http://www.nenek.mx/ES/?opc=dictionary. Accessed October 15, 2015.

  • Adam, A. (2008). Implementing electronic document and record management systems. Boca Raton: Auerbach Publications.

    Google Scholar 

  • Administration for Native Americans. Native languages archives preservation: A reference guide for establishing archives and repositories. Washington, DC. http://www.aihec.org/resources/documents/NativeLanguagePreservationReferenceGuide. Accessed October 15, 2015.

  • AILLA. The archive of the indigenous languages of Latin America. http://www.ailla.utexas.org/site/welcome.html. Accessed October 15, 2015.

  • Alaska Native Language Archive. https://www.uaf.edu/anla/. Accessed October 15, 2015.

  • Aspell Dictionaries. ftp://ftp.gnu.org/gnu/aspell/dict/0index.html. Accessed October 15, 2015.

  • Baroni, M., & Kilgarriff, A. (2006). Large linguistically-processed web corpora for multiple languages. In Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations (EACL ’06). Association for Computational Linguistics (pp. 87–90). Stroudsburg, PA, USA.

  • Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The WaCky wide web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation, 43, 209–226. doi:10.1007/s10579-009-9081-4.

    Article  Google Scholar 

  • BibliotecaTenek. (2013). Online library for Huastec speakers. http://www.nenek.mx/ECAD/. Accessed October 15, 2015.

  • Carretero, J., Gonzalez J. L., & Hooft, A. (2015). Co-Tenek. A Huastec spell checker. http://www.nenek.mx/huasteco.dic. Accessed October 15, 2015.

  • Carretero, J., Scannell, K., Gonzalez, J. L., & Hooft, A. (2015). Co-Tenek. A Huastec spell checker for Mozilla. https://addons.mozilla.org/addon/huastec-spell-checker/. Accessed October 15, 2015.

  • Chang, D. (2009). TAPS: Checklist for responsible archiving of digital language resources. MA thesis, Graduate Institute of Applied Linguistics.

  • DoBes. Documentación de lenguas amenazadas. http://dobes.mpi.nl/?lang=es. Accessed September 29, 2015.

  • ELAR. Endangered languages archive. http://www.elar-archive.org/index.php. Accessed September 29, 2015.

  • Ethnologue. The language of the world. http://www.ethnologue.com/17/. Accessed September 29, 2015.

  • Gippert, J., Nikolaus, P., Himmelmann, N., & Ulrike, M. (Eds.) (2006). Thick interfaces: Mobilizing language documentation with multimedia. In Essentials of language documentation (pp. 363–379). Berlin: Mouton de Gruyter.

  • Gonzalez, J. L., Carretero, J., Sosa-Sosa, J., Sanchez, M., & Bergua, B. (2015). SkyCDS: A resilient content delivery service based on diversified cloud storage. Simulation Modelling Practice and Theory, 54, 64–85.

    Article  Google Scholar 

  • Gonzalez, J. L., & Marcelin-Jimenez, R. (2011). Phoenix: Fault tolerant distributed web storage based on urls. Journal of Convergence, Section C: Web and Multimedia, 2(1), 79–85.

    Google Scholar 

  • González, J. L., Pérez, J. C., Sosa-Sosa, V., Cardoso, J. F. R., & Marcelín-Jiménez, R. (2013). An approach for constructing private storage services as a unified fault-tolerant system. Journal of Systems and Software, 86(7), 1907–1922.

    Article  Google Scholar 

  • Good, J. (2010). Finding the linguists place in a new technological universe. In L. A. Grenoble & N. L. Furbee (Eds.), Language documentation: practice and values (pp. 111–132). Amsterdam: John Benjamins Publishing Company.

    Chapter  Google Scholar 

  • Grenoble, L. A., & Whaley, L. J. (2006). Saving languages. An introduction to language revitalization. Cambridge: Cambridge University Press.

    Google Scholar 

  • Harrison, K. D. (2007). When languages die: The extinction of the world’s languages and the erosion of human knowledge. New York: Oxford University Press.

    Book  Google Scholar 

  • HD2015. Habilidades Digitales para todos. http://www.sep.gob.mx/es/sep1/habilidades_digitales_para_todos#.VibUA6dVKlN. Accessed October 15, 2015.

  • Himmelmann, N. P. (1998). Documentary and descriptive linguistics. Linguistics, 36, 161–195.

    Article  Google Scholar 

  • Himmelmann, N. P. (2006). Language documentation: What is it and what is it good for? In J. Gippert, N. P. Himmelmann, & U. Mosel (Eds.), Essentials of language documentation (pp. 1–30). Berlin: Mouton de Gruyter.

    Google Scholar 

  • Hinton, L. (2001). Language revitalization: An overview. In L. Hinton & K. Hale (Eds.), The green book of language revitalization in practice (pp. 3–18). San Diego: Academic Press.

    Google Scholar 

  • Hooft, A., & Gonzalez, J. L. (2014). Collaborative language documentation: The construction of the Huastec corpus CCURL2014: Collaboration and computing for under-resourced languages in the linked open data era CCURL 2014 Reykjavik, Iceland May 26, 2014.

  • INEGI: Censo de Población y Vivienda. (2010). Instituto Nacional de Estadística, Geografía e Informática. Aguascalientes (Mexico). http://www.inegi.org.mx/est/contenidos/proyectos/ccpv/cpv2010/. Accessed October 15, 2015.

  • IWS: Internet world stats: Users by language. http://www.internetworldstats.com/stats7.htm. Accessed September 29, 2015.

  • JournalTenek. Special edition for Huastec speakers, Teczapic ITV Journal. http://www.nenek.mx/Journal

  • Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the special issue on the web as corpus. Computational Linguistics, 29(3), 333–347. doi:10.1162/089120103322711569.

    Article  Google Scholar 

  • Krauss, M. (1992). The worlds languages in crisis. Language, 68, 4–10.

    Article  Google Scholar 

  • Nathan, D., & Austin, P. K. (2004). Reconceiving metadata: Language documentation through thick and thin. In P. K. Austin (Ed.), Language documentation and description (Vol. 2, pp. 179–187). London: SOAS.

    Google Scholar 

  • Nenek in Facebook. https://www.facebook.com/NenekMexico. Accessed October 15, 2015.

  • Nenek in Twitter. https://twitter.com/NenekMexico. Accessed October 15, 2015.

  • Open source software for creating private and public clouds. https://www.openstack.org/

  • Pangloss, Lacito: Langues et civilisations a tradition orale. http://lacito.vjf.cnrs.fr/pangloss/index_en.htm. Accessed September 29, 2015.

  • Paolillo, J., Pimienta, D., Prado, D., et al. (2005). Measuring linguistic diversity on the internet. Montreal: UNESCO Institute for Statistics.

    Google Scholar 

  • Paolillo, J., Pimienta, D., & Blanco, A. (2009). Twelve years of measuring linguistic diversity in the internet: Balance and perspectives. Paris: United Nations Educational, Scientific and Cultural Organization.

    Google Scholar 

  • Paradisce: The Pacific and regional archive for digital sources in endangered cultures. http://paradisec.org.au/home.html. Accessed September 29, 2015.

  • Rheingold, H. (1993). The virtual community. Homesteading the electronic frontier. http://www.well.com/user/hlr/vcbook/. Accessed October 15, 2015.

  • Summer Institute of Linguistics Language & Culture Archives, Hustec Section. http://www.sil.org/resources/search?query=huastec. Accessed October 15, 2015.

  • TLA: The language archive. https://tla.mpi.nl/. Accessed October 15, 2015.

  • Warschauer, M. (2001). Language, identity and the internet. In B. Kolko, L. Nakamura, & G. Rodman (Eds.), Race in cyberspace (pp. 151–170). New York: Routledge. http://motspluriels.arts.uwa.edu.au/MP1901mw.html. Accessed October 15, 2015.

Download references

Acknowledgments

We would like to thank the reviewers for their valuable feedback. The Nenek project is sponsored through a grant from the Mexican Secretary of Public Education and the National Council of Science and Technology (SEP-Conacyt research Grant CB-2012-180863). The work presented in this paper has been partially supported by EU under the COST programme Action IC1305, Network for Sustainable Ultrascale Computing (NESUS)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to J. L. Gonzalez.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gonzalez, J.L., van’t Hooft, A., Carretero, J. et al. Nenek: a cloud-based collaboration platform for the management of Amerindian language resources. Lang Resources & Evaluation 51, 897–925 (2017). https://doi.org/10.1007/s10579-016-9361-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-016-9361-8

Keywords

Navigation