short-paper

Improving Tone Recognition Performance using Wav2vec 2.0-Based Learned Representation in Yoruba, a Low-Resourced Language

Authors:

Saint Germes B. Bengono Obiang,

Norbert Tsopze,

Paulin Melatagia Yonta,

Jean-Francois Bonastre,

Tania JiménezAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 23, Issue 12

Article No.: 172, Pages 1 - 11

https://doi.org/10.1145/3690384

Published: 23 November 2024 Publication History

Abstract

Many sub-Saharan African languages are categorized as tone languages, and for the most part, they are classified as low-resource languages due to the limited resources and tools available to process these languages. Identifying the tone associated with a syllable is therefore a key challenge for speech recognition in these languages. We propose models that automate the recognition of tones in continuous speech that can easily be incorporated into a speech recognition pipeline for these languages. We have investigated different neural architectures as well as several feature extraction algorithms in speech (FBs (Filter Banks), LEAF (Learnable Frontend), CS (Cestrogram), MFCC (Mel-Frequency Cepstral Coefficients)). In the context of low-resource languages, we also evaluated W2V (Wav2vec 2.0) models for this task. In this work, we use a public speech recognition dataset on Yoruba. As for the results, using the combination of features obtained from CS and FBs, we obtain a minimum TER (Tone Error Rate) of 19.54%, whereas for the evaluations of the models using W2V, we have a TER of 17.72%, demonstrating that the use of W2V provides better performance than the models used in the literature for tone identification on low-resource languages.

References

[1]

Oliver Adams, Trevor Cohn, Graham Neubig, and Alexis Michaud. 2017. Phonemic transcription of low-resource tonal languages. In Proceedings of the 2017 Australasian Language Technology Association Workshop. 53–60. https://aclanthology.org/U17-1006

[2]

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. Wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, 12449–12460. https://proceedings.neurips.cc/paper_files/paper/2020/file/92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf

[3]

Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng. 2017. AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline. In Proceedings of the 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA ’17). IEEE, 1–5.

[4]

Malgorzata Ćavar, Damir Ćavar, and Hilaria Cruz. 2016. Endangered language documentation: Bootstrapping a Chatino speech corpus, forced aligner, ASR. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC ’16). 4004–4011. https://aclanthology.org/L16-1632

[5]

Charles Chen, Razvan C. Bunescu, Li Xu, and Chang Liu. 2016. Tone classification in Mandarin Chinese using convolutional neural networks. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech ’16).

[6]

Mingming Chen, Zhanlei Yang, and Wen-Ju Liu. 2014. Deep neural networks for Mandarin tone recognition. In Proceedings of the International Joint Conference on Neural Networks. 1154–1158.

[7]

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. In Proceedings of the NIPS 2014 Deep Learning and Representation Learning Workshop.

[8]

Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli. 2021. Unsupervised cross-lingual representation learning for speech recognition. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts. 2426–2430.

[9]

Rolando Coto-Solano. 2021. Explicit tone transcription improves ASR performance in extremely low-resource languages: A case study in Bribri. In Proceedings of the 1st Workshop on Natural Language Processing for Indigenous Languages of the Americas. 173–184.

[10]

S. Davis and P. Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 28, 4 (1980), 357–366.

[11]

Jean-J. Marie Essono. 2000. L’ewondo: Langue Bantu du Cameroun: Phonologie–Morphologie–Syntaxe. Presses de l’Université catholique d’Afrique centrale.

[12]

Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning(ICML ’06). ACM, New York, NY, USA, 369–376.

Digital Library

[13]

Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 2006 23rd International Conference on Machine Learning. ACM, New York, NY, USA, 369–376.

Digital Library

[14]

Alexander Gutkin, Işın Demirşahin, Oddur Kjartansson, Clara Rivera, and Kólá Túbòsún. 2020. Developing an open-source corpus of Yoruba speech. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech ’20). 404–408.

[15]

Hao Huang, Ying Hu, and Haihua Xu. 2017. Mandarin tone modeling using recurrent neural networks. arxiv:1711.01946[cs.SD] (2017).

[16]

Ai Jun Li, Maocan Lin, Xiaoxia Chen, Yiqing Zu, Guohua Sun, Wu Hua, Zhigang Yin, and Jingzhu Yan. 2000. Speech corpus of Chinese discourse and the phonetic research. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech ’00). https://api.semanticscholar.org/CorpusID:8790762

[17]

Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael L. Seltzer, and Sanjeev Khudanpur. 2017. A study on data augmentation of reverberant speech for robust speech recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’17). 5220–5224.

Digital Library

[18]

Vladimir I. Levenshtein. 1965. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics. Doklady 10 (1965), 707–710. https://api.semanticscholar.org/CorpusID:60827152

[19]

Xu Li, Zhang Wenle, Zhou Ning, Lee Chaoyang, Li Yongxin, Chen Xiuwu, and Zhao Xiaoyan. 2006. Mandarin Chinese tone recognition with an artificial neural network. Journal of Otology 1, 1 (2006), 30–34.

[20]

Loren Lugosch and Vikrant Singh Tomar. 2018. Tone recognition using lifters and CTC. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech ’18).

[21]

Boyd Michailovsky, Martine Mazaudon, Alexis Michaud, Séverine Guillaume, Alexandre François, and Evangelia Adamou. 2014. Documenting and researching endangered languages: The Pangloss Collection. Language Documentation & Conservation 8 (June2014), 119–135.

[22]

Odetunji Ajadi Odejobi. 2008. Recognition of tones in Yorùbá speech: Experiments with artificial neural networks. In Speech, Audio, Image and Biomedical Signal Processing Using Neural Networks. Studies in Computational Intelligence (SCI ’83). Springer, 23–47.

[23]

Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. 2019. SpecAugment: A simple data augmentation method for automatic speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech ’19). ISCA.

[24]

Lawrence R. Rabiner and Ronald W. Schafer. 2007. Introduction to digital speech processing. Foundations and Trends® in Signal Processing 1, 1-2 (2007), 1–194.

Digital Library

[25]

Mirco Ravanelli, Philemon Brakel, Maurizio Omologo, and Yoshua Bengio. 2018. Light gated recurrent units for speech recognition. IEEE Transactions on Emerging Topics in Computational Intelligence 2, 2 (April 2018), 92–102.

[26]

Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, Ju-Chieh Chou, Sung-Lin Yeh, Szu-Wei Fu, Chien-Feng Liao, Elena Rastorgueva, François Grondin, William Aris, Hwidong Na, Yan Gao, Renato De Mori, and Yoshua Bengio. 2021. SpeechBrain: A general-purpose speech toolkit. arxiv:2106.04624 [eess.AS](2021).

[27]

N. Ryant, M. Slaney, Mark Liberman, Elizabeth Shriberg, and Jiahong Yuan. 2014. Highly accurate Mandarin tone classification in the absence of pitch information. In Proceedings of the International Conference on Speech Prosody. 673–677.

[28]

Neville Ryant, Jiahong Yuan, and Mark Liberman. 2014. Mandarin tone classification without pitch tracking. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’14). 4868–4872.

[29]

Daniel R. van Niekerk and Etienne Barnard. 2012. Tone realisation in a Yorúbá speech recognition corpus. In Proceedings of the 3rd Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU ’12). 54–59.

[30]

Hsin-Min Wang, Berlin Chen, Jen-Wei Kuo, and Shih-Sian Cheng. 2005. MATBN: A Mandarin Chinese broadcast news corpus. International Journal of Computational Linguistics & Chinese Language Processing 10, 2 (2005), 219–236. https://aclanthology.org/O05-3004

[31]

Neil Zeghidour, Olivier Teboul, Félix de Chaumont Quitry, and Marco Tagliasacchi. 2021. LEAF: A learnable frontend for audio classification. In Proceedings of the 9th International Conference on Learning Representations (ICLR ’21). https://openreview.net/forum?id=jM76BCb6F9m

Index Terms

Improving Tone Recognition Performance using Wav2vec 2.0-Based Learned Representation in Yoruba, a Low-Resourced Language
1. Applied computing
  1. Arts and humanities
    1. Sound and music computing
2. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning

Recommendations

Using tone information in Cantonese continuous speech recognition

In Chinese languages, tones carry important information at various linguistic levels. This research is based on the belief that tone information, if acquired accurately and utilized effectively, contributes to the automatic speech recognition of ...
Tone Recognition of Continuous Mandarin Speech Based on Tone Nucleus Model and Neural Network

A method was developed for automatic recognition of syllable tone types in continuous speech of Mandarin by integrating two techniques, tone nucleus modeling and neural network classifier. The tone nucleus modeling considers a syllable F0 contour as ...
Tone recognition of isolated mandarin syllables
ICISP'10: Proceedings of the 4th international conference on Image and signal processing

Mandarin is tonal language. For Mandarin, tone identification is very important for speech recognition and pronunciation evaluation. Mandarin tone behavior varies greatly from speaker to speaker and it presents the greatest challenge to any speaker-...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 23, Issue 12

December 2024

237 pages

EISSN:2375-4702

DOI:10.1145/3613720

Editor:
Imed Zitouni
Google, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 November 2024

Online AM: 30 August 2024

Accepted: 18 August 2024

Revised: 13 June 2024

Received: 13 January 2024

Published in TALLIP Volume 23, Issue 12

Check for updates

Author Tags

Qualifiers

Short-paper

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
213
Total Downloads

Downloads (Last 12 months)213
Downloads (Last 6 weeks)22

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents