skip to main content
short-paper

Improving Tone Recognition Performance using Wav2vec 2.0-Based Learned Representation in Yoruba, a Low-Resourced Language

Published: 23 November 2024 Publication History

Abstract

Many sub-Saharan African languages are categorized as tone languages, and for the most part, they are classified as low-resource languages due to the limited resources and tools available to process these languages. Identifying the tone associated with a syllable is therefore a key challenge for speech recognition in these languages. We propose models that automate the recognition of tones in continuous speech that can easily be incorporated into a speech recognition pipeline for these languages. We have investigated different neural architectures as well as several feature extraction algorithms in speech (FBs (Filter Banks), LEAF (Learnable Frontend), CS (Cestrogram), MFCC (Mel-Frequency Cepstral Coefficients)). In the context of low-resource languages, we also evaluated W2V (Wav2vec 2.0) models for this task. In this work, we use a public speech recognition dataset on Yoruba. As for the results, using the combination of features obtained from CS and FBs, we obtain a minimum TER (Tone Error Rate) of 19.54%, whereas for the evaluations of the models using W2V, we have a TER of 17.72%, demonstrating that the use of W2V provides better performance than the models used in the literature for tone identification on low-resource languages.

References

[1]
Oliver Adams, Trevor Cohn, Graham Neubig, and Alexis Michaud. 2017. Phonemic transcription of low-resource tonal languages. In Proceedings of the 2017 Australasian Language Technology Association Workshop. 53–60. https://aclanthology.org/U17-1006
[2]
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. Wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, 12449–12460. https://proceedings.neurips.cc/paper_files/paper/2020/file/92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf
[3]
Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng. 2017. AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline. In Proceedings of the 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA ’17). IEEE, 1–5.
[4]
Malgorzata Ćavar, Damir Ćavar, and Hilaria Cruz. 2016. Endangered language documentation: Bootstrapping a Chatino speech corpus, forced aligner, ASR. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC ’16). 4004–4011. https://aclanthology.org/L16-1632
[5]
Charles Chen, Razvan C. Bunescu, Li Xu, and Chang Liu. 2016. Tone classification in Mandarin Chinese using convolutional neural networks. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech ’16).
[6]
Mingming Chen, Zhanlei Yang, and Wen-Ju Liu. 2014. Deep neural networks for Mandarin tone recognition. In Proceedings of the International Joint Conference on Neural Networks. 1154–1158.
[7]
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. In Proceedings of the NIPS 2014 Deep Learning and Representation Learning Workshop.
[8]
Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli. 2021. Unsupervised cross-lingual representation learning for speech recognition. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts. 2426–2430.
[9]
Rolando Coto-Solano. 2021. Explicit tone transcription improves ASR performance in extremely low-resource languages: A case study in Bribri. In Proceedings of the 1st Workshop on Natural Language Processing for Indigenous Languages of the Americas. 173–184.
[10]
S. Davis and P. Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 28, 4 (1980), 357–366.
[11]
Jean-J. Marie Essono. 2000. L’ewondo: Langue Bantu du Cameroun: Phonologie–Morphologie–Syntaxe. Presses de l’Université catholique d’Afrique centrale.
[12]
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning(ICML ’06). ACM, New York, NY, USA, 369–376.
[13]
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 2006 23rd International Conference on Machine Learning. ACM, New York, NY, USA, 369–376.
[14]
Alexander Gutkin, Işın Demirşahin, Oddur Kjartansson, Clara Rivera, and Kólá Túbòsún. 2020. Developing an open-source corpus of Yoruba speech. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech ’20). 404–408.
[15]
Hao Huang, Ying Hu, and Haihua Xu. 2017. Mandarin tone modeling using recurrent neural networks. arxiv:1711.01946[cs.SD] (2017).
[16]
Ai Jun Li, Maocan Lin, Xiaoxia Chen, Yiqing Zu, Guohua Sun, Wu Hua, Zhigang Yin, and Jingzhu Yan. 2000. Speech corpus of Chinese discourse and the phonetic research. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech ’00). https://api.semanticscholar.org/CorpusID:8790762
[17]
Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael L. Seltzer, and Sanjeev Khudanpur. 2017. A study on data augmentation of reverberant speech for robust speech recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’17). 5220–5224.
[18]
Vladimir I. Levenshtein. 1965. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics. Doklady 10 (1965), 707–710. https://api.semanticscholar.org/CorpusID:60827152
[19]
Xu Li, Zhang Wenle, Zhou Ning, Lee Chaoyang, Li Yongxin, Chen Xiuwu, and Zhao Xiaoyan. 2006. Mandarin Chinese tone recognition with an artificial neural network. Journal of Otology 1, 1 (2006), 30–34.
[20]
Loren Lugosch and Vikrant Singh Tomar. 2018. Tone recognition using lifters and CTC. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech ’18).
[21]
Boyd Michailovsky, Martine Mazaudon, Alexis Michaud, Séverine Guillaume, Alexandre François, and Evangelia Adamou. 2014. Documenting and researching endangered languages: The Pangloss Collection. Language Documentation & Conservation 8 (June2014), 119–135.
[22]
Odetunji Ajadi Odejobi. 2008. Recognition of tones in Yorùbá speech: Experiments with artificial neural networks. In Speech, Audio, Image and Biomedical Signal Processing Using Neural Networks. Studies in Computational Intelligence (SCI ’83). Springer, 23–47.
[23]
Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. 2019. SpecAugment: A simple data augmentation method for automatic speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech ’19). ISCA.
[24]
Lawrence R. Rabiner and Ronald W. Schafer. 2007. Introduction to digital speech processing. Foundations and Trends® in Signal Processing 1, 1-2 (2007), 1–194.
[25]
Mirco Ravanelli, Philemon Brakel, Maurizio Omologo, and Yoshua Bengio. 2018. Light gated recurrent units for speech recognition. IEEE Transactions on Emerging Topics in Computational Intelligence 2, 2 (April 2018), 92–102.
[26]
Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, Ju-Chieh Chou, Sung-Lin Yeh, Szu-Wei Fu, Chien-Feng Liao, Elena Rastorgueva, François Grondin, William Aris, Hwidong Na, Yan Gao, Renato De Mori, and Yoshua Bengio. 2021. SpeechBrain: A general-purpose speech toolkit. arxiv:2106.04624 [eess.AS](2021).
[27]
N. Ryant, M. Slaney, Mark Liberman, Elizabeth Shriberg, and Jiahong Yuan. 2014. Highly accurate Mandarin tone classification in the absence of pitch information. In Proceedings of the International Conference on Speech Prosody. 673–677.
[28]
Neville Ryant, Jiahong Yuan, and Mark Liberman. 2014. Mandarin tone classification without pitch tracking. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’14). 4868–4872.
[29]
Daniel R. van Niekerk and Etienne Barnard. 2012. Tone realisation in a Yorúbá speech recognition corpus. In Proceedings of the 3rd Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU ’12). 54–59.
[30]
Hsin-Min Wang, Berlin Chen, Jen-Wei Kuo, and Shih-Sian Cheng. 2005. MATBN: A Mandarin Chinese broadcast news corpus. International Journal of Computational Linguistics & Chinese Language Processing 10, 2 (2005), 219–236. https://aclanthology.org/O05-3004
[31]
Neil Zeghidour, Olivier Teboul, Félix de Chaumont Quitry, and Marco Tagliasacchi. 2021. LEAF: A learnable frontend for audio classification. In Proceedings of the 9th International Conference on Learning Representations (ICLR ’21). https://openreview.net/forum?id=jM76BCb6F9m

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing
ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 23, Issue 12
December 2024
237 pages
EISSN:2375-4702
DOI:10.1145/3613720
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 November 2024
Online AM: 30 August 2024
Accepted: 18 August 2024
Revised: 13 June 2024
Received: 13 January 2024
Published in TALLIP Volume 23, Issue 12

Check for updates

Author Tags

  1. Speech processing
  2. tone recognition
  3. low resource
  4. Wav2vec 2.0

Qualifiers

  • Short-paper

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 213
    Total Downloads
  • Downloads (Last 12 months)213
  • Downloads (Last 6 weeks)22
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media