skip to main content
10.1145/3487553.3524931acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Building a Public Domain Voice Database for Odia

Published:16 August 2022Publication History

ABSTRACT

Projects like Mozilla Common Voice were born to address the challenges of unavailability of voice data or the high cost of available data for use in speech technology such as Automatic Speech Recognition (ASR) research and application development. The pilot detailed in this paper is about creating a large freely-licensed public repository of transcribed speech in the Odia language as such a repository was not known to be available. The strategy and methodology behind this process are based on the OpenSpeaks project. Licensed under a Public Domain Dedication (CC0 1.0), the repository currently includes audio recordings of pronunciations for more than 55,000 unique words in Odia, including more than 5,600 recordings of words in the northern Odia dialect Baleswari. No known public listing of words in this dialect was found by the author prior to this pilot. This repository is arguably the most extensive transcribed speech corpus in Odia that is also available publicly under any free and open license. This paper details the strategy, approach, and process behind building both the text and the speech corpus using many open source tools such as Lingua Libre, which can be helpful in building text and speech data for different low-medium-resource languages.

References

  1. 2011. Abstract of Speakers’ Strength of Languages and Mother Tongues - 2011. Technical Report. Registrar General and Census Commissioner of India, New Delhi. 6 pages. https://www.censusindia.gov.in/2011Census/Language-2011/Statement-1.pdfGoogle ScholarGoogle Scholar
  2. 2021. Help:RecordWizard Manual. https://lingualibre.org/wiki/Help:RecordWizard_manualGoogle ScholarGoogle Scholar
  3. 2021. Tutoriel: Comment contribuer à Lingua Libre?https://commons.wikimedia.org/wiki/File:Tutoriel_Lingua_Libre.webmGoogle ScholarGoogle Scholar
  4. 2021. The Unicode Standard, Version 14.0: Oriya. Technical Report. Unicode, Inc.4 pages. https://www.unicode.org/charts/PDF/U0B00.pdfGoogle ScholarGoogle Scholar
  5. 2022. Bigyan Diganta. http://odishabigyanacademy.in/bigyan-diganta/Google ScholarGoogle Scholar
  6. 2022. Index of /datasets: Q322719-mis-Baleswari Oriya.zip. https://lingualibre.org/datasets/Q322719-mis-Baleswari%20Oriya.zipGoogle ScholarGoogle Scholar
  7. 2022. Odia: Ethnologue. https://www.ethnologue.com/language/oryGoogle ScholarGoogle Scholar
  8. 2022. orwiki dump progress on 20220301. https://dumps.wikimedia.org/orwiki/20220301/Google ScholarGoogle Scholar
  9. Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. 2020. Common Voice: A Massively-Multilingual Speech Corpus. arXiv:1912.06670 [cs] (March 2020). http://arxiv.org/abs/1912.06670 arXiv:1912.06670.Google ScholarGoogle Scholar
  10. Biswajeet3. 2021. T297351 Warang Citi (Ho-language writing system) characters not detected on Wikimedia Commons. https://phabricator.wikimedia.org/T297351Google ScholarGoogle Scholar
  11. Britone Mwasaru. 2022. Why Voice is Important. https://foundation.mozilla.org/en/blog/why-voice-is-important/ Section: Common Voice.Google ScholarGoogle Scholar
  12. Common Voice Contributors. 2022. Common Voice by Mozilla. https://commonvoice.mozilla.org/Google ScholarGoogle Scholar
  13. Artatrana Gochhayat. 2016. Odisha as a multicultural state: from multiculturalism to politics of sub-regionalism. Afro Asian Journal of Social Sciences 7, 2 (2016), 28. http://mail.onlineresearchjournals.com/aajoss/art/197.pdfGoogle ScholarGoogle Scholar
  14. Ramesh Hariharan, Ravi Masalthi, Rileen Sinha, and Santhosh Thottingal. 2021. Dhvani TTS. https://github.com/dhvani-tts/dhvani-tts original-date: 2013-12-17T05:16:31Z.Google ScholarGoogle Scholar
  15. Josh Meyer. 2022. Open Speech Corpora. https://github.com/coqui-ai/open-speech-corpora original-date: 2019-01-31T14:57:39Z.Google ScholarGoogle Scholar
  16. Marc Miquel-Ribé and David Laniado. 2018. Wikipedia Culture Gap: Quantifying Content Imbalances Across 40 Language Editions. Frontiers in Physics 6(2018), 54. https://doi.org/10.3389/fphy.2018.00054 12 citations (Semantic Scholar/DOI) [2022-03-08] Publisher: Frontiers.Google ScholarGoogle ScholarCross RefCross Ref
  17. Mahir Morshed. 2022. tool-twofivesixlex. https://phabricator.wikimedia.org/source/tool-twofivesixlex/Google ScholarGoogle Scholar
  18. Soumya Priyadarsini Panda and Ajit Kumar Nayak. 2015. An efficient model for text-to-speech synthesis in Indian languages. International Journal of Speech Technology 18, 3 (Sept. 2015), 305–315. https://doi.org/10.1007/s10772-015-9271-y 13 citations (Semantic Scholar/DOI) [2022-03-08].Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Subhashish Panigrahi. 2015. OpenSpeaks/toolkit/Lekatha - Wikimedia Commons. https://commons.wikimedia.org/w/index.php?title=OpenSpeaks/toolkit/Lekatha&oldid=246217807Google ScholarGoogle Scholar
  20. Subhashish Panigrahi. 2022. Lingua Libre pronunciation by Psubhashish. https://petscan.wmflabs.org/?psid=19878687Google ScholarGoogle Scholar
  21. Subhashish Panigrahi, Mahir Morshed, and Lucas Werkmeister. 2022. Wikidata Lexeme Forms/Odia. https://www.wikidata.org/wiki/Wikidata:Wikidata_Lexeme_Forms/OdiaGoogle ScholarGoogle Scholar
  22. Project Parichay Contributors. 2022. Project Parichay. https://commons.wikimedia.org/wiki/Category:Audio_files_uploaded_under_project_ParichayaGoogle ScholarGoogle Scholar
  23. Minati Singha. 2021. Odisha to impart primary education in 21 tribal languages in schools run by SC/ST dept. The Times of India (Sept. 2021). https://timesofindia.indiatimes.com/city/bhubaneswar/state-to-impart-primary-edu-in-21-tribal-languages-in-schools-run-by-sc/st-dept/articleshow/86463363.cmsGoogle ScholarGoogle Scholar
  24. Subhashish Panigrahi. 2021. Before AI*. https://github.com/ofdn/Before-AI/blob/91b8d20fa3c44305cbbfe0da6ce8579a2ba0380e/data/odia-or-wordlist.txt original-date: 2021-10-20T23:57:09Z.Google ScholarGoogle Scholar
  25. Subhashish Panigrahi. 2021. Help: Recording pronunciations. https://or.wiktionary.org/s/16h9 Page Version ID: 177311.Google ScholarGoogle Scholar
  26. Subhashish Panigrahi. 2022. Before AI: nort2660-wordlist.txt. https://github.com/ofdn/Before-AI/blob/45b5dc31c7b2fb376a68767573b472f7cf7861ca/data/nort2660-wordlist.txt original-date: 2021-10-20T23:57:09Z.Google ScholarGoogle Scholar
  27. T. Shrinivasan. 2021. Odia Wordlist from Wikimedia Dump. https://github.com/ofdn/odia-wordlist-from-wikimedia-dump original-date: 2021-11-29T15:00:15Z.Google ScholarGoogle Scholar
  28. The Editors of Encyclopaedia. 2020. Odia language: Region, History, & Basics. https://www.britannica.com/topic/Odia-languageGoogle ScholarGoogle Scholar
  29. Videowiki Contributors. 2022. Videowiki project - or - Wikimedia Commons. https://commons.wikimedia.org/wiki/Category:Videowiki_project_-_orGoogle ScholarGoogle Scholar
  30. Wikimedia Contributors. 2022. Category:Lingua Libre pronunciation by Psubhashish. https://petscan.wmflabs.org/?psid=21604903Google ScholarGoogle Scholar
  31. Wikimedia Contributors. 2022. Category:Voice intro project - Wikimedia Commons. https://commons.wikimedia.org/wiki/Category:Voice_intro_projectGoogle ScholarGoogle Scholar
  32. Wikimedia Contributors. 2022. orwikisource dump progress on 20220301. https://dumps.wikimedia.org/orwikisource/20220301/Google ScholarGoogle Scholar
  33. Wikimedia Contributors. 2022. orwiktionary dump progress on 20220301. https://dumps.wikimedia.org/orwiktionary/20220301/Google ScholarGoogle Scholar
  34. Wikimedia Contrubutors. 2022. Index of /datasets: Q336-ori-Odia.zip. https://lingualibre.org/datasets/Q336-ori-Odia.zipGoogle ScholarGoogle Scholar
  35. Wikimedia Foundation. 2022. Wikimedia Downloads. https://dumps.wikimedia.org/Google ScholarGoogle Scholar

Index Terms

  1. Building a Public Domain Voice Database for Odia

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WWW '22: Companion Proceedings of the Web Conference 2022
      April 2022
      1338 pages
      ISBN:9781450391306
      DOI:10.1145/3487553

      Copyright © 2022 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 16 August 2022

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate1,899of8,196submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format