Abstract
Speaker verification (SV) tasks with low-resource language corpora naturally face technical difficulties and often require language mixture processing. In this paper, the LibriSpeech ASR corpus, the AISHELL-I Mandarin Speech corpus, and the Yegna2021 corpus were used for training the x-vector model. The Yegna2021 is a bilingual speech corpus consisting of Amharic and English languages. We designed and collected the Yegna2021 corpus to facilitate SV experimentation. Over 200 native Ethiopian speakers who are bilingual in both languages have participated in the creation of the corpus. To the best of our knowledge, this is the first study of SV systems in Amharic language. This study proposes that improving SV performance degradation, caused by language mismatch between training and testing utterances, requires not only combining two or more languages for training, but also considering the phonetic similarities and differences between languages that impact on obtaining better SV performance. The varied effects of language combinations have been examined on Mandarin Chinese, Amharic, and English languages. In this paper, we investigate the impact of language mismatches between training and testing on SV performance using only the Yegna2021corpus. The experimental results show that a language variability between training and testing utterances significantly degrades SV performance (between 6.5% to 9.0%). The combination of Amharic and Mandarin yields better SV performance than English and Mandarin, achieving an Equal error rate (EER) of 8.3% as compared to 9.8%, with relative performance degradation of 17.1%. To verify these results, we paired Mandarin with data from the LibriSpeech, and the result shows 18.2% relative performance degradation, with an EER of 9.9% for English and Mandarin.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Reynolds, D.A.: An overview of automatic speaker recognition technology. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 4, pp. 4072–4072 (2002)
Campbell, J.P.: Speaker recognition: a tutorial. Proc. IEEE 85(9), 1437–1462 (1997)
Li, L., Wang, D., Rozi, A., Zheng, T.F.: Cross-lingual speaker verification with deep feature learning. In: 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1040–1044 (2017)
Akbacak, M., Hansen, J.H.: Language normalization for bilingual speaker recognition systems. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP 2007, vol. 4, p. 257 (2007)
Qing, X.K., Chen, K.: On use of GMM for multilingual speaker verification: an empirical study. In: Proceedings of ISCSLP, pp. 263–266 (2000)
Zhang, S.X., Chen, Z., Zhao, Y., Li, J., Gong, Y.: End-to-end attention based text-dependent speaker verification. In: 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 171–178 (2016)
Nawaz, S., et al.: Cross-modal speaker verification and recognition: a multilingual perspective. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1682–1691 (2021)
Xia, W., Huang, J., Hansen, J.H.: Cross-lingual text-independent speaker verification using unsupervised adversarial discriminative domain adaptation. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5816–5820 (2019)
Padmini, P., Paramasivam, C., Lal, G.J., Alharbi, S., Bhowmick, K.: Age-based automatic voice conversion using blood relation for voice impaired. Comput. Mater. Continua 70(2), 4027–4051 (2022)
Ma, B., Meng, H.: English-Chinese bilingual text-independent speaker verification. In: 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 5 (2004)
Vaheb, A., Choobbasti, A.J., Najafabadi, S.H.E.M., Safavi, S.: Investigating language variability on the performance of speaker verification systems. In: Karpov, A., Jokisch, O., Potapova, R. (eds.) SPECOM 2018. LNCS (LNAI), vol. 11096, pp. 718–727. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99579-3_73
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: robust DNN embeddings for speaker recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333 (2018)
Changrampadi, M.H., Shahina, A., Narayanan, M.B., Khan, A.N.: End-to-end speech recognition of Tamil language. Intell. Autom. Soft Comput. 32(2), 1309–1323 (2022)
Shiferaw, M.: Syllable-based text-to-speech synthesis (TTS) for Amharic. Addis Ababa, Ethiopia (2012)
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, CONF. IEEE Signal Processing Society (2011)
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: LibriSpeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015)
Bu, H., Du, J., Na, X., Wu, B., Zheng, H.: Aishell-1: an open-source mandarin speech corpus and a speech recognition baseline. In: International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pp. 1–5 (2017)
Snyder, D., et al.: Kaldi VoxCeleb x-vector recipe (2018). https://github.com/kaldi-asr/kaldi/tree/master/egs/voxceleb/v2
Antal, M., Toderean, G.: Speaker recognition and broad phonetic groups. SPPRA, pp. 155–159 (2006)
Fakotakis, N., Sirigos, J.: A high performance text independent speaker recognition system based on vowel spotting and neural nets. In: 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, vol. 2, pp. 661–664 (1996)
Paliwal, K.K.: Effectiveness of different vowel sounds in automatic speaker identification. J. Phon. 12, 17–21 (1984)
Gopal, S., Padmavathi, S.: Speaker verification on English Language using phonemes. In: 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), pp. 1520–1224 (2016)
Yimam, B.: Yeamarigna sewasew (Amharic version). Addis Ababa, Ethiopia, EMPDA (1986)
Sukarto, A.R., Wikarti, E., Renata, S.: Moira: contrastive analysis between Chinese and Indonesian phonology and implementation on conversation class. Int. J. Cult. Art Stud. 3(1), 1–14 (2019)
Bradlow, A., Clopper, C., Smiljanic, R., Walter, M.A.: A perceptual phonetic similarity space for languages: evidence from five native language listener groups. Speech Commun. 52(11), 930–942 (2010)
Getahun, A.:
(Modern Amharic Grammar in a Simple Approach) Addis Ababa, Ethiopia (2010)
Seyoum, M.: The syllable structure and syllabification in Amharic, Masters of philosophy in general linguistic thesis. Trondheim, Norway (2001)
TÅ™Ãsková, H.: The structure of the mandarin syllable: why, when and how to teach it. Archivorientálnà 79(1), 99–134 (2011)
Baye, Y.: Phonological features of the Amharic variety of South Wallo. Oslo Stud. Lang. 8(1), 9–30 (2016)
Duanmu, S., Kim, H.Y., Stiennon, N.: 1 Stress and Syllable Structure in English: Approaches to Phonological Variations (2005)
Peterson, G.E., Barney, H.L.: Control methods used in a study of the vowels. J. Acoust. Soc. Am. 24, 175–184 (1951)
Chen, Y., Robb, M., Gilbert, H., Lerman, J.: Vowel production by Mandarin speakers of English. Clin. Linguist. Phon. 15(6), 427–440 (2001)
Ladefoged, P., Johnson, K.: A course in phonetics. Cengage learning (2014)
Umeda, N.: Vowel duration in American English. J. Acoust. Soc. Am. 58, 434–479 (1975)
House, A.S.: On vowel duration in English. J. Acoust. Soc. Am. 33(9), 1174–1178 (1961)
Abate, S.T., Menzel, W., Tafila, B.: An Amharic speech corpus for large vocabulary continuous speech recognition. In: Ninth European Conference on Speech Communication and Technology (2005)
Auckenthaler, R., Carey, M.J., Mason, J.S.: Language dependency in text-independent speaker verification. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221) vol. 1, pp. 441–444 (2001)
Acknowledgements
Thanks to the National Key R&D Program of China (No. 2020YFC2004103), National Natural Science Foundation of China (No. 61876131, U1936102), and Basic Research Project of Qinghai Science and Technology Program (No. 2021-ZJ-609).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Tadele, F., Wei, J., Honda, K., Zhang, R., Yang, W. (2022). Effect of Language Mixture on Speaker Verification: An Investigation with Amharic, English, and Mandarin Chinese. In: Sun, X., Zhang, X., Xia, Z., Bertino, E. (eds) Artificial Intelligence and Security. ICAIS 2022. Lecture Notes in Computer Science, vol 13340. Springer, Cham. https://doi.org/10.1007/978-3-031-06791-4_20
Download citation
DOI: https://doi.org/10.1007/978-3-031-06791-4_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06790-7
Online ISBN: 978-3-031-06791-4
eBook Packages: Computer ScienceComputer Science (R0)