Skip to main content

Effect of Language Mixture on Speaker Verification: An Investigation with Amharic, English, and Mandarin Chinese

  • Conference paper
  • First Online:
  • 1077 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13340))

Abstract

Speaker verification (SV) tasks with low-resource language corpora naturally face technical difficulties and often require language mixture processing. In this paper, the LibriSpeech ASR corpus, the AISHELL-I Mandarin Speech corpus, and the Yegna2021 corpus were used for training the x-vector model. The Yegna2021 is a bilingual speech corpus consisting of Amharic and English languages. We designed and collected the Yegna2021 corpus to facilitate SV experimentation. Over 200 native Ethiopian speakers who are bilingual in both languages have participated in the creation of the corpus. To the best of our knowledge, this is the first study of SV systems in Amharic language. This study proposes that improving SV performance degradation, caused by language mismatch between training and testing utterances, requires not only combining two or more languages for training, but also considering the phonetic similarities and differences between languages that impact on obtaining better SV performance. The varied effects of language combinations have been examined on Mandarin Chinese, Amharic, and English languages. In this paper, we investigate the impact of language mismatches between training and testing on SV performance using only the Yegna2021corpus. The experimental results show that a language variability between training and testing utterances significantly degrades SV performance (between 6.5% to 9.0%). The combination of Amharic and Mandarin yields better SV performance than English and Mandarin, achieving an Equal error rate (EER) of 8.3% as compared to 9.8%, with relative performance degradation of 17.1%. To verify these results, we paired Mandarin with data from the LibriSpeech, and the result shows 18.2% relative performance degradation, with an EER of 9.9% for English and Mandarin.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Reynolds, D.A.: An overview of automatic speaker recognition technology. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 4, pp. 4072–4072 (2002)

    Google Scholar 

  2. Campbell, J.P.: Speaker recognition: a tutorial. Proc. IEEE 85(9), 1437–1462 (1997)

    Google Scholar 

  3. Li, L., Wang, D., Rozi, A., Zheng, T.F.: Cross-lingual speaker verification with deep feature learning. In: 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1040–1044 (2017)

    Google Scholar 

  4. Akbacak, M., Hansen, J.H.: Language normalization for bilingual speaker recognition systems. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP 2007, vol. 4, p. 257 (2007)

    Google Scholar 

  5. Qing, X.K., Chen, K.: On use of GMM for multilingual speaker verification: an empirical study. In: Proceedings of ISCSLP, pp. 263–266 (2000)

    Google Scholar 

  6. Zhang, S.X., Chen, Z., Zhao, Y., Li, J., Gong, Y.: End-to-end attention based text-dependent speaker verification. In: 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 171–178 (2016)

    Google Scholar 

  7. Nawaz, S., et al.: Cross-modal speaker verification and recognition: a multilingual perspective. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1682–1691 (2021)

    Google Scholar 

  8. Xia, W., Huang, J., Hansen, J.H.: Cross-lingual text-independent speaker verification using unsupervised adversarial discriminative domain adaptation. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5816–5820 (2019)

    Google Scholar 

  9. Padmini, P., Paramasivam, C., Lal, G.J., Alharbi, S., Bhowmick, K.: Age-based automatic voice conversion using blood relation for voice impaired. Comput. Mater. Continua 70(2), 4027–4051 (2022)

    Article  Google Scholar 

  10. Ma, B., Meng, H.: English-Chinese bilingual text-independent speaker verification. In: 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 5 (2004)

    Google Scholar 

  11. Vaheb, A., Choobbasti, A.J., Najafabadi, S.H.E.M., Safavi, S.: Investigating language variability on the performance of speaker verification systems. In: Karpov, A., Jokisch, O., Potapova, R. (eds.) SPECOM 2018. LNCS (LNAI), vol. 11096, pp. 718–727. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99579-3_73

    Chapter  Google Scholar 

  12. Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: robust DNN embeddings for speaker recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333 (2018)

    Google Scholar 

  13. Changrampadi, M.H., Shahina, A., Narayanan, M.B., Khan, A.N.: End-to-end speech recognition of Tamil language. Intell. Autom. Soft Comput. 32(2), 1309–1323 (2022)

    Article  Google Scholar 

  14. Shiferaw, M.: Syllable-based text-to-speech synthesis (TTS) for Amharic. Addis Ababa, Ethiopia (2012)

    Google Scholar 

  15. Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, CONF. IEEE Signal Processing Society (2011)

    Google Scholar 

  16. Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: LibriSpeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015)

    Google Scholar 

  17. Bu, H., Du, J., Na, X., Wu, B., Zheng, H.: Aishell-1: an open-source mandarin speech corpus and a speech recognition baseline. In: International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pp. 1–5 (2017)

    Google Scholar 

  18. Snyder, D., et al.: Kaldi VoxCeleb x-vector recipe (2018). https://github.com/kaldi-asr/kaldi/tree/master/egs/voxceleb/v2

  19. Antal, M., Toderean, G.: Speaker recognition and broad phonetic groups. SPPRA, pp. 155–159 (2006)

    Google Scholar 

  20. Fakotakis, N., Sirigos, J.: A high performance text independent speaker recognition system based on vowel spotting and neural nets. In: 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, vol. 2, pp. 661–664 (1996)

    Google Scholar 

  21. Paliwal, K.K.: Effectiveness of different vowel sounds in automatic speaker identification. J. Phon. 12, 17–21 (1984)

    Article  Google Scholar 

  22. Gopal, S., Padmavathi, S.: Speaker verification on English Language using phonemes. In: 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), pp. 1520–1224 (2016)

    Google Scholar 

  23. Yimam, B.: Yeamarigna sewasew (Amharic version). Addis Ababa, Ethiopia, EMPDA (1986)

    Google Scholar 

  24. Sukarto, A.R., Wikarti, E., Renata, S.: Moira: contrastive analysis between Chinese and Indonesian phonology and implementation on conversation class. Int. J. Cult. Art Stud. 3(1), 1–14 (2019)

    Article  Google Scholar 

  25. Bradlow, A., Clopper, C., Smiljanic, R., Walter, M.A.: A perceptual phonetic similarity space for languages: evidence from five native language listener groups. Speech Commun. 52(11), 930–942 (2010)

    Article  Google Scholar 

  26. Getahun, A.: (Modern Amharic Grammar in a Simple Approach) Addis Ababa, Ethiopia (2010)

    Google Scholar 

  27. Seyoum, M.: The syllable structure and syllabification in Amharic, Masters of philosophy in general linguistic thesis. Trondheim, Norway (2001)

    Google Scholar 

  28. Třísková, H.: The structure of the mandarin syllable: why, when and how to teach it. Archivorientální 79(1), 99–134 (2011)

    Google Scholar 

  29. Baye, Y.: Phonological features of the Amharic variety of South Wallo. Oslo Stud. Lang. 8(1), 9–30 (2016)

    Google Scholar 

  30. Duanmu, S., Kim, H.Y., Stiennon, N.: 1 Stress and Syllable Structure in English: Approaches to Phonological Variations (2005)

    Google Scholar 

  31. Peterson, G.E., Barney, H.L.: Control methods used in a study of the vowels. J. Acoust. Soc. Am. 24, 175–184 (1951)

    Article  Google Scholar 

  32. Chen, Y., Robb, M., Gilbert, H., Lerman, J.: Vowel production by Mandarin speakers of English. Clin. Linguist. Phon. 15(6), 427–440 (2001)

    Article  Google Scholar 

  33. Ladefoged, P., Johnson, K.: A course in phonetics. Cengage learning (2014)

    Google Scholar 

  34. Umeda, N.: Vowel duration in American English. J. Acoust. Soc. Am. 58, 434–479 (1975)

    Article  Google Scholar 

  35. House, A.S.: On vowel duration in English. J. Acoust. Soc. Am. 33(9), 1174–1178 (1961)

    Article  Google Scholar 

  36. Abate, S.T., Menzel, W., Tafila, B.: An Amharic speech corpus for large vocabulary continuous speech recognition. In: Ninth European Conference on Speech Communication and Technology (2005)

    Google Scholar 

  37. Auckenthaler, R., Carey, M.J., Mason, J.S.: Language dependency in text-independent speaker verification. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221) vol. 1, pp. 441–444 (2001)

    Google Scholar 

Download references

Acknowledgements

Thanks to the National Key R&D Program of China (No. 2020YFC2004103), National Natural Science Foundation of China (No. 61876131, U1936102), and Basic Research Project of Qinghai Science and Technology Program (No. 2021-ZJ-609).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianguo Wei .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tadele, F., Wei, J., Honda, K., Zhang, R., Yang, W. (2022). Effect of Language Mixture on Speaker Verification: An Investigation with Amharic, English, and Mandarin Chinese. In: Sun, X., Zhang, X., Xia, Z., Bertino, E. (eds) Artificial Intelligence and Security. ICAIS 2022. Lecture Notes in Computer Science, vol 13340. Springer, Cham. https://doi.org/10.1007/978-3-031-06791-4_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-06791-4_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-06790-7

  • Online ISBN: 978-3-031-06791-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics