research-article

Building a Public Domain Voice Database for Odia

Author:
Subhashish Panigrahi

OpenSpeaks, O Foundation, India

OpenSpeaks, O Foundation, India
View Profile

WWW '22: Companion Proceedings of the Web Conference 2022April 2022Pages 1331–1338https://doi.org/10.1145/3487553.3524931

Published:16 August 2022Publication History

WWW '22: Companion Proceedings of the Web Conference 2022

Pages 1331–1338

ABSTRACT

Projects like Mozilla Common Voice were born to address the challenges of unavailability of voice data or the high cost of available data for use in speech technology such as Automatic Speech Recognition (ASR) research and application development. The pilot detailed in this paper is about creating a large freely-licensed public repository of transcribed speech in the Odia language as such a repository was not known to be available. The strategy and methodology behind this process are based on the OpenSpeaks project. Licensed under a Public Domain Dedication (CC0 1.0), the repository currently includes audio recordings of pronunciations for more than 55,000 unique words in Odia, including more than 5,600 recordings of words in the northern Odia dialect Baleswari. No known public listing of words in this dialect was found by the author prior to this pilot. This repository is arguably the most extensive transcribed speech corpus in Odia that is also available publicly under any free and open license. This paper details the strategy, approach, and process behind building both the text and the speech corpus using many open source tools such as Lingua Libre, which can be helpful in building text and speech data for different low-medium-resource languages.

References

2011. Abstract of Speakers’ Strength of Languages and Mother Tongues - 2011. Technical Report. Registrar General and Census Commissioner of India, New Delhi. 6 pages. https://www.censusindia.gov.in/2011Census/Language-2011/Statement-1.pdfGoogle Scholar
2021. Help:RecordWizard Manual. https://lingualibre.org/wiki/Help:RecordWizard_manualGoogle Scholar
2021. Tutoriel: Comment contribuer à Lingua Libre?https://commons.wikimedia.org/wiki/File:Tutoriel_Lingua_Libre.webmGoogle Scholar
2021. The Unicode Standard, Version 14.0: Oriya. Technical Report. Unicode, Inc.4 pages. https://www.unicode.org/charts/PDF/U0B00.pdfGoogle Scholar
2022. Bigyan Diganta. http://odishabigyanacademy.in/bigyan-diganta/Google Scholar
2022. Index of /datasets: Q322719-mis-Baleswari Oriya.zip. https://lingualibre.org/datasets/Q322719-mis-Baleswari%20Oriya.zipGoogle Scholar
2022. Odia: Ethnologue. https://www.ethnologue.com/language/oryGoogle Scholar
2022. orwiki dump progress on 20220301. https://dumps.wikimedia.org/orwiki/20220301/Google Scholar
Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. 2020. Common Voice: A Massively-Multilingual Speech Corpus. arXiv:1912.06670 [cs] (March 2020). http://arxiv.org/abs/1912.06670 arXiv:1912.06670.Google Scholar
Biswajeet3. 2021. T297351 Warang Citi (Ho-language writing system) characters not detected on Wikimedia Commons. https://phabricator.wikimedia.org/T297351Google Scholar
Britone Mwasaru. 2022. Why Voice is Important. https://foundation.mozilla.org/en/blog/why-voice-is-important/ Section: Common Voice.Google Scholar
Common Voice Contributors. 2022. Common Voice by Mozilla. https://commonvoice.mozilla.org/Google Scholar
Artatrana Gochhayat. 2016. Odisha as a multicultural state: from multiculturalism to politics of sub-regionalism. Afro Asian Journal of Social Sciences 7, 2 (2016), 28. http://mail.onlineresearchjournals.com/aajoss/art/197.pdfGoogle Scholar
Ramesh Hariharan, Ravi Masalthi, Rileen Sinha, and Santhosh Thottingal. 2021. Dhvani TTS. https://github.com/dhvani-tts/dhvani-tts original-date: 2013-12-17T05:16:31Z.Google Scholar
Josh Meyer. 2022. Open Speech Corpora. https://github.com/coqui-ai/open-speech-corpora original-date: 2019-01-31T14:57:39Z.Google Scholar
Marc Miquel-Ribé and David Laniado. 2018. Wikipedia Culture Gap: Quantifying Content Imbalances Across 40 Language Editions. Frontiers in Physics 6(2018), 54. https://doi.org/10.3389/fphy.2018.00054 12 citations (Semantic Scholar/DOI) [2022-03-08] Publisher: Frontiers.Google ScholarCross Ref
Mahir Morshed. 2022. tool-twofivesixlex. https://phabricator.wikimedia.org/source/tool-twofivesixlex/Google Scholar
Soumya Priyadarsini Panda and Ajit Kumar Nayak. 2015. An efficient model for text-to-speech synthesis in Indian languages. International Journal of Speech Technology 18, 3 (Sept. 2015), 305–315. https://doi.org/10.1007/s10772-015-9271-y 13 citations (Semantic Scholar/DOI) [2022-03-08].Google ScholarDigital Library
Subhashish Panigrahi. 2015. OpenSpeaks/toolkit/Lekatha - Wikimedia Commons. https://commons.wikimedia.org/w/index.php?title=OpenSpeaks/toolkit/Lekatha&oldid=246217807Google Scholar
Subhashish Panigrahi. 2022. Lingua Libre pronunciation by Psubhashish. https://petscan.wmflabs.org/?psid=19878687Google Scholar
Subhashish Panigrahi, Mahir Morshed, and Lucas Werkmeister. 2022. Wikidata Lexeme Forms/Odia. https://www.wikidata.org/wiki/Wikidata:Wikidata_Lexeme_Forms/OdiaGoogle Scholar
Project Parichay Contributors. 2022. Project Parichay. https://commons.wikimedia.org/wiki/Category:Audio_files_uploaded_under_project_ParichayaGoogle Scholar
Minati Singha. 2021. Odisha to impart primary education in 21 tribal languages in schools run by SC/ST dept. The Times of India (Sept. 2021). https://timesofindia.indiatimes.com/city/bhubaneswar/state-to-impart-primary-edu-in-21-tribal-languages-in-schools-run-by-sc/st-dept/articleshow/86463363.cmsGoogle Scholar
Subhashish Panigrahi. 2021. Before AI*. https://github.com/ofdn/Before-AI/blob/91b8d20fa3c44305cbbfe0da6ce8579a2ba0380e/data/odia-or-wordlist.txt original-date: 2021-10-20T23:57:09Z.Google Scholar
Subhashish Panigrahi. 2021. Help: Recording pronunciations. https://or.wiktionary.org/s/16h9 Page Version ID: 177311.Google Scholar
Subhashish Panigrahi. 2022. Before AI: nort2660-wordlist.txt. https://github.com/ofdn/Before-AI/blob/45b5dc31c7b2fb376a68767573b472f7cf7861ca/data/nort2660-wordlist.txt original-date: 2021-10-20T23:57:09Z.Google Scholar
T. Shrinivasan. 2021. Odia Wordlist from Wikimedia Dump. https://github.com/ofdn/odia-wordlist-from-wikimedia-dump original-date: 2021-11-29T15:00:15Z.Google Scholar
The Editors of Encyclopaedia. 2020. Odia language: Region, History, & Basics. https://www.britannica.com/topic/Odia-languageGoogle Scholar
Videowiki Contributors. 2022. Videowiki project - or - Wikimedia Commons. https://commons.wikimedia.org/wiki/Category:Videowiki_project_-_orGoogle Scholar
Wikimedia Contributors. 2022. Category:Lingua Libre pronunciation by Psubhashish. https://petscan.wmflabs.org/?psid=21604903Google Scholar
Wikimedia Contributors. 2022. Category:Voice intro project - Wikimedia Commons. https://commons.wikimedia.org/wiki/Category:Voice_intro_projectGoogle Scholar
Wikimedia Contributors. 2022. orwikisource dump progress on 20220301. https://dumps.wikimedia.org/orwikisource/20220301/Google Scholar
Wikimedia Contributors. 2022. orwiktionary dump progress on 20220301. https://dumps.wikimedia.org/orwiktionary/20220301/Google Scholar
Wikimedia Contrubutors. 2022. Index of /datasets: Q336-ori-Odia.zip. https://lingualibre.org/datasets/Q336-ori-Odia.zipGoogle Scholar
Wikimedia Foundation. 2022. Wikimedia Downloads. https://dumps.wikimedia.org/Google Scholar

Index Terms

Building a Public Domain Voice Database for Odia
1. Information systems
  1. Information systems applications
    1. Multimedia information systems
      1. Multimedia databases

Recommendations

Lithuanian Speech Corpus Liepa for Development of Human-Computer Interfaces Working in Voice Recognition and Synthesis Mode

The problem of speech corpus for design of human-computer interfaces working in voice recognition and synthesis mode is investigated. Specific requirements of speech corpus for speech recognizers and synthesizers were accented. It has been discussed that ...
Read More
Building an Effective Speech Corpus by Utilizing Statistical Multidimensional Scaling Method

This paper proposes a technique for building an effective speech corpus with lower cost by utilizing a statistical multidimensional scaling method. The statistical multidimensional scaling method visualizes multiple HMM acoustic models into two-...
Read More
A new speech corpus of super-elderly Japanese for acoustic modeling
Abstract
The development of accessible speech recognition technology will allow the elderly to more easily access electronically stored information. However, the necessary level of recognition accuracy for elderly speech has not yet been ...
Highlights
- The acoustic characteristics of elderly speech differ from those of younger speakers.
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '22: Companion Proceedings of the Web Conference 2022
April 2022
1338 pages
ISBN:9781450391306
DOI:10.1145/3487553
Editors:
Frédérique Laforest
INSA Lyon, France
,
Raphaël Troncy
EURECOM, France
,
Lionel Médini
Université Lyon 1, France
,
Ivan Herman
W3C / retired
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 August 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Baleswari Odia
Odia
Wikidata
lexeme
low-resource languages
speech corpus
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 77
  Total Downloads
- Downloads (Last 12 months)24
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Building a Public Domain Voice Database for Odia

WWW '22: Companion Proceedings of the Web Conference 2022

ABSTRACT

References

Cited By

Index Terms

Recommendations

Lithuanian Speech Corpus Liepa for Development of Human-Computer Interfaces Working in Voice Recognition and Synthesis Mode

Building an Effective Speech Corpus by Utilizing Statistical Multidimensional Scaling Method

A new speech corpus of super-elderly Japanese for acoustic modeling