ABSTRACT
Projects like Mozilla Common Voice were born to address the challenges of unavailability of voice data or the high cost of available data for use in speech technology such as Automatic Speech Recognition (ASR) research and application development. The pilot detailed in this paper is about creating a large freely-licensed public repository of transcribed speech in the Odia language as such a repository was not known to be available. The strategy and methodology behind this process are based on the OpenSpeaks project. Licensed under a Public Domain Dedication (CC0 1.0), the repository currently includes audio recordings of pronunciations for more than 55,000 unique words in Odia, including more than 5,600 recordings of words in the northern Odia dialect Baleswari. No known public listing of words in this dialect was found by the author prior to this pilot. This repository is arguably the most extensive transcribed speech corpus in Odia that is also available publicly under any free and open license. This paper details the strategy, approach, and process behind building both the text and the speech corpus using many open source tools such as Lingua Libre, which can be helpful in building text and speech data for different low-medium-resource languages.
- 2011. Abstract of Speakers’ Strength of Languages and Mother Tongues - 2011. Technical Report. Registrar General and Census Commissioner of India, New Delhi. 6 pages. https://www.censusindia.gov.in/2011Census/Language-2011/Statement-1.pdfGoogle Scholar
- 2021. Help:RecordWizard Manual. https://lingualibre.org/wiki/Help:RecordWizard_manualGoogle Scholar
- 2021. Tutoriel: Comment contribuer à Lingua Libre?https://commons.wikimedia.org/wiki/File:Tutoriel_Lingua_Libre.webmGoogle Scholar
- 2021. The Unicode Standard, Version 14.0: Oriya. Technical Report. Unicode, Inc.4 pages. https://www.unicode.org/charts/PDF/U0B00.pdfGoogle Scholar
- 2022. Bigyan Diganta. http://odishabigyanacademy.in/bigyan-diganta/Google Scholar
- 2022. Index of /datasets: Q322719-mis-Baleswari Oriya.zip. https://lingualibre.org/datasets/Q322719-mis-Baleswari%20Oriya.zipGoogle Scholar
- 2022. Odia: Ethnologue. https://www.ethnologue.com/language/oryGoogle Scholar
- 2022. orwiki dump progress on 20220301. https://dumps.wikimedia.org/orwiki/20220301/Google Scholar
- Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. 2020. Common Voice: A Massively-Multilingual Speech Corpus. arXiv:1912.06670 [cs] (March 2020). http://arxiv.org/abs/1912.06670 arXiv:1912.06670.Google Scholar
- Biswajeet3. 2021. T297351 Warang Citi (Ho-language writing system) characters not detected on Wikimedia Commons. https://phabricator.wikimedia.org/T297351Google Scholar
- Britone Mwasaru. 2022. Why Voice is Important. https://foundation.mozilla.org/en/blog/why-voice-is-important/ Section: Common Voice.Google Scholar
- Common Voice Contributors. 2022. Common Voice by Mozilla. https://commonvoice.mozilla.org/Google Scholar
- Artatrana Gochhayat. 2016. Odisha as a multicultural state: from multiculturalism to politics of sub-regionalism. Afro Asian Journal of Social Sciences 7, 2 (2016), 28. http://mail.onlineresearchjournals.com/aajoss/art/197.pdfGoogle Scholar
- Ramesh Hariharan, Ravi Masalthi, Rileen Sinha, and Santhosh Thottingal. 2021. Dhvani TTS. https://github.com/dhvani-tts/dhvani-tts original-date: 2013-12-17T05:16:31Z.Google Scholar
- Josh Meyer. 2022. Open Speech Corpora. https://github.com/coqui-ai/open-speech-corpora original-date: 2019-01-31T14:57:39Z.Google Scholar
- Marc Miquel-Ribé and David Laniado. 2018. Wikipedia Culture Gap: Quantifying Content Imbalances Across 40 Language Editions. Frontiers in Physics 6(2018), 54. https://doi.org/10.3389/fphy.2018.00054 12 citations (Semantic Scholar/DOI) [2022-03-08] Publisher: Frontiers.Google ScholarCross Ref
- Mahir Morshed. 2022. tool-twofivesixlex. https://phabricator.wikimedia.org/source/tool-twofivesixlex/Google Scholar
- Soumya Priyadarsini Panda and Ajit Kumar Nayak. 2015. An efficient model for text-to-speech synthesis in Indian languages. International Journal of Speech Technology 18, 3 (Sept. 2015), 305–315. https://doi.org/10.1007/s10772-015-9271-y 13 citations (Semantic Scholar/DOI) [2022-03-08].Google ScholarDigital Library
- Subhashish Panigrahi. 2015. OpenSpeaks/toolkit/Lekatha - Wikimedia Commons. https://commons.wikimedia.org/w/index.php?title=OpenSpeaks/toolkit/Lekatha&oldid=246217807Google Scholar
- Subhashish Panigrahi. 2022. Lingua Libre pronunciation by Psubhashish. https://petscan.wmflabs.org/?psid=19878687Google Scholar
- Subhashish Panigrahi, Mahir Morshed, and Lucas Werkmeister. 2022. Wikidata Lexeme Forms/Odia. https://www.wikidata.org/wiki/Wikidata:Wikidata_Lexeme_Forms/OdiaGoogle Scholar
- Project Parichay Contributors. 2022. Project Parichay. https://commons.wikimedia.org/wiki/Category:Audio_files_uploaded_under_project_ParichayaGoogle Scholar
- Minati Singha. 2021. Odisha to impart primary education in 21 tribal languages in schools run by SC/ST dept. The Times of India (Sept. 2021). https://timesofindia.indiatimes.com/city/bhubaneswar/state-to-impart-primary-edu-in-21-tribal-languages-in-schools-run-by-sc/st-dept/articleshow/86463363.cmsGoogle Scholar
- Subhashish Panigrahi. 2021. Before AI*. https://github.com/ofdn/Before-AI/blob/91b8d20fa3c44305cbbfe0da6ce8579a2ba0380e/data/odia-or-wordlist.txt original-date: 2021-10-20T23:57:09Z.Google Scholar
- Subhashish Panigrahi. 2021. Help: Recording pronunciations. https://or.wiktionary.org/s/16h9 Page Version ID: 177311.Google Scholar
- Subhashish Panigrahi. 2022. Before AI: nort2660-wordlist.txt. https://github.com/ofdn/Before-AI/blob/45b5dc31c7b2fb376a68767573b472f7cf7861ca/data/nort2660-wordlist.txt original-date: 2021-10-20T23:57:09Z.Google Scholar
- T. Shrinivasan. 2021. Odia Wordlist from Wikimedia Dump. https://github.com/ofdn/odia-wordlist-from-wikimedia-dump original-date: 2021-11-29T15:00:15Z.Google Scholar
- The Editors of Encyclopaedia. 2020. Odia language: Region, History, & Basics. https://www.britannica.com/topic/Odia-languageGoogle Scholar
- Videowiki Contributors. 2022. Videowiki project - or - Wikimedia Commons. https://commons.wikimedia.org/wiki/Category:Videowiki_project_-_orGoogle Scholar
- Wikimedia Contributors. 2022. Category:Lingua Libre pronunciation by Psubhashish. https://petscan.wmflabs.org/?psid=21604903Google Scholar
- Wikimedia Contributors. 2022. Category:Voice intro project - Wikimedia Commons. https://commons.wikimedia.org/wiki/Category:Voice_intro_projectGoogle Scholar
- Wikimedia Contributors. 2022. orwikisource dump progress on 20220301. https://dumps.wikimedia.org/orwikisource/20220301/Google Scholar
- Wikimedia Contributors. 2022. orwiktionary dump progress on 20220301. https://dumps.wikimedia.org/orwiktionary/20220301/Google Scholar
- Wikimedia Contrubutors. 2022. Index of /datasets: Q336-ori-Odia.zip. https://lingualibre.org/datasets/Q336-ori-Odia.zipGoogle Scholar
- Wikimedia Foundation. 2022. Wikimedia Downloads. https://dumps.wikimedia.org/Google Scholar
Index Terms
- Building a Public Domain Voice Database for Odia
Recommendations
Lithuanian Speech Corpus Liepa for Development of Human-Computer Interfaces Working in Voice Recognition and Synthesis Mode
The problem of speech corpus for design of human-computer interfaces working in voice recognition and synthesis mode is investigated. Specific requirements of speech corpus for speech recognizers and synthesizers were accented. It has been discussed that ...
Building an Effective Speech Corpus by Utilizing Statistical Multidimensional Scaling Method
This paper proposes a technique for building an effective speech corpus with lower cost by utilizing a statistical multidimensional scaling method. The statistical multidimensional scaling method visualizes multiple HMM acoustic models into two-...
A new speech corpus of super-elderly Japanese for acoustic modeling
AbstractThe development of accessible speech recognition technology will allow the elderly to more easily access electronically stored information. However, the necessary level of recognition accuracy for elderly speech has not yet been ...
Highlights- The acoustic characteristics of elderly speech differ from those of younger speakers.
Comments