abstract

Preserving Heritage: Developing a Translation Tool for Indigenous Dialects

Authors:
Melissa Robles

System Engineering, Universidad de Los Andes & Quantil SAS, Bogotá, Colombia

System Engineering, Universidad de Los Andes & Quantil SAS, Bogotá, Colombia

0009-0009-1414-1107
View Profile

,
Cristian A. Martínez

System Engineering, Universidad de Los Andes, Bogotá, Colombia

System Engineering, Universidad de Los Andes, Bogotá, Colombia

0009-0000-3924-9223
View Profile

,
Juan C. Prieto

System Engineering, Universidad de Los Andes, Bogotá, Colombia

System Engineering, Universidad de Los Andes, Bogotá, Colombia

0009-0004-1257-1246
View Profile

,
Sara Palacios

System Engineering, Universidad de Los Andes, Bogotá, Colombia

System Engineering, Universidad de Los Andes, Bogotá, Colombia

0000-0001-8962-0414
View Profile

,
Rubén Manrique

System Engineering, Universidad de Los Andes, Bogotá, Colombia

System Engineering, Universidad de Los Andes, Bogotá, Colombia

0000-0001-8742-2094
View Profile

WSDM '24: Proceedings of the 17th ACM International Conference on Web Search and Data MiningMarch 2024Pages 1200–1203https://doi.org/10.1145/3616855.3637828

Published:04 March 2024Publication History

WSDM '24: Proceedings of the 17th ACM International Conference on Web Search and Data Mining

Pages 1200–1203

ABSTRACT

The preservation and understanding of indigenous languages emerge as crucial, given their substantial contribution to the cultural and linguistic heritage of communities. Despite their undeniable value, these languages are threatened by extinction due to a dwindling number of native speakers and the predominance of oral traditions over written forms. In this context, this study aims to contribute to the conservation of these languages through the development of a Spanish-indigenous language translator. This research employs neural machine translation technology, investigating three distinct approaches: a translation model based on transformers, finetuning with a Finnish translator, and finetuning with a multilingual translator. The results obtained from these methodologies are promising, demonstrating competitive viability when compared to the limited existing research in this field of study.

References

[n. d.]. Antiguo testamento en Wayuu. https://www.jw.org/guc/karaloutairua/biblia/wiwuliakat-genesis-nuchikimaajatkat-jesucristo/karaloutairua/G%C3%A9nesis/1/Google Scholar
[n. d.]. Biblia en Wayuu, https://www.bible.com/es/bible/1584/MAT.1.GUC. https://www.bible.com/es/bible/1584/MAT.1.GUCGoogle Scholar
[n. d.]. Visor Biblia Iku. https://www.fdpm-co.org/es/nuestros-servicios/traducci%C3%B3n-b%C3%ADblica/biblia-iku/visor-biblia-ikuGoogle Scholar
2012. Putunkaa Serruma: Duérmete, pajarito blanco. Arrullos y relatos indígenas de cinco etnias colombianas.Google Scholar
2014. Niwi úmuke pari ayunnuga, Cantando desde la Sierra.Google Scholar
Rafael Jose Negrette Amaya. 2021. OSF spanish-wayuunaki. https://osf.io/6kbze/Google Scholar
Centro Colombiano de Estudios de Lenguas Aborígenes. 1994. Constitución Política de 1991 traducida a Lenguas Indígenas.Google Scholar
El Centro Colombiano de Estudios de Lenguas Aborígenes (C.C.E.L.A). 1994. Estructuras sintácticas de la predicación: lenguas amerindias de Colombia.Google Scholar
Autoridad Nacional de Gobierno Indígena -- ONIC. 2015. 65 Lenguas Nativas de las 69 en Colombia son Indígenas. https://www.onic.org.co/noticias/636-65-lenguas-nativas-de-las-69-en-colombia-son-indigenasGoogle Scholar
Nora Graichen, Josef Van Genabith, and Cristina España-bonet. 2023. Enriching Wayúunaiki-Spanish Neural Machine Translation with Linguistic Information. In Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP), Manuel Mager, Abteen Ebrahimi, Arturo Oncevay, Enora Rice, Shruti Rijhwani, Alexis Palmer, and Katharina Kann (Eds.). Association for Computational Linguistics, Toronto, Canada, 67--83. https://doi.org/10.18653/v1/2023.americasnlp-1.9Google ScholarCross Ref
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Fed- erico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. Association for Computational Linguistics, Prague, Czech Republic, 177--180. https://aclanthology.org/P07--2045Google ScholarDigital Library
Jesús Manuel Mager Hois, Carlos Barron Romero, and Ivan Vladimir Meza Ruíz. 2016. Traductor estadístico wixarika - español usando descomposición morfológica. COMTEL 6 (sep 2016).Google Scholar
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (Philadelphia, Pennsylvania) (ACL '02). Association for Computational Linguistics, USA, 311--318. https://doi.org/10.3115/1073083.1073135Google ScholarDigital Library
Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, Ondřej Bojar, Rajan Chatterjee, Christian Federmann, Barry Haddow, Chris Hokamp, Matthias Huck, Varvara Logacheva, and Pavel Pecina (Eds.). Association for Computational Linguistics, Lisbon, Portugal, 392--395. https://doi.org/10.18653/v1/W15--3049Google ScholarCross Ref
NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang. 2022. No Language Left Behind: Scaling Human-Centered Machine Translation. arXiv:2207.04672 [cs.CL]Google Scholar
Microsoft Translator. 2020. Marian - an efficient Neural Machine Translation framework written in pure C++. Mainly developed at Microsoft Translator and at the University of Edinburgh. https://huggingface.co/transformers/v3.5.1/model_doc/marian.htmlGoogle Scholar
Geraldyn Otavo Rodríguez y Melissa Lizette Portilla Narváez. 2022. Relatos ancestrales: una alternativa para la preservación de la identidad cultural oral del territorio Inga.Google Scholar
Aldo Andrés Álvarez López. 2022. Recopilación de corpus paralelo español-guaraní y experimentos iniciales con traductor automático estadístico. Revista sobre estudios e investigaciones del saber académico 17 (dic. 2022), e2023003. https://revistas.uni.edu.py/index.php/rseisa/article/view/342Google Scholar

Index Terms

Preserving Heritage: Developing a Translation Tool for Indigenous Dialects
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Machine translation

Recommendations

Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages
Abstract
Unsupervised Neural Machine Translation (UNMT) approaches have gained widespread popularity in recent times. Though these approaches show impressive translation performance using only monolingual corpora of the languages involved, these approaches ...
Read More
Machine Translation for Historical Research: A Case Study of Aramaic-Ancient Hebrew Translations
In this article, by the ability to translate Aramaic to another spoken languages, we investigated machine translation in a cultural heritage domain for two primary purposes: evaluating the quality of ancient translations and preserving Aramaic (an ...
Read More
Leveraging Additional Resources for Improving Statistical Machine Translation on Asian Low-Resource Languages

Phrase-based machine translation (MT) systems require large bilingual corpora for training. Nevertheless, such large bilingual corpora are unavailable for most language pairs in the world, causing a bottleneck for the development of MT. For the Asian ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WSDM '24: Proceedings of the 17th ACM International Conference on Web Search and Data Mining
March 2024
1246 pages
ISBN:9798400703713
DOI:10.1145/3616855
General Chairs:
Luz Angélica
Caudillo Mata (MDA Geointelligence)
,
Silvio Lattanzi
Google Research
,
Andrés Muñoz Medina
Google Research
,
Program Chairs:
Leman Akoglu
CMU
,
Aristides Gionis
KTH
,
Sergei Vassilvitskii
Google Research
Copyright © 2024 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 March 2024
Check for updates
Author Tags
low-resource languages
natural language processing
transformer
translator
Qualifiers
- abstract
Conference

Acceptance Rates
Overall Acceptance Rate498of2,863submissions,17%
Upcoming Conference
WSDM '25

Sponsor:

sigir

sigir

sigir

sigir

The Eighteenth ACM International Conference on Web Search and Data Mining

April 7 - 11, 2025

Hannover , Germany
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 63
  Total Downloads
- Downloads (Last 12 months)63
- Downloads (Last 6 weeks)22
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Preserving Heritage: Developing a Translation Tool for Indigenous Dialects

WSDM '24: Proceedings of the 17th ACM International Conference on Web Search and Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages

Machine Translation for Historical Research: A Case Study of Aramaic-Ancient Hebrew Translations

Leveraging Additional Resources for Improving Statistical Machine Translation on Asian Low-Resource Languages