Abstract
Spoken language diarization (LD) is a task of automatically extracting the monolingual segments present in a given code-switched utterance. Generally in the bilingual code-switched scenario, when a single speaker spokes both the languages, mostly the phoneme production of the secondary language is biased towards the primary, leading to acoustic similarity. It is also noticed that the turn duration of the primary language is significant over the secondary, leading to data imbalance. Due to the acoustic similarity and data imbalance, the performance of the available work is biased toward the primary language. The influence of acoustic similarity can be minimized by capturing the supra-segmental language specific information. Similarly, the influence of data imbalance can be suppressed, by prior capturing the language specific information through a pre-training framework. Therefore, this work proposes a wav2vec2 based self-supervised pre-training framework to capture the supra-segmental language specific information. The obtained results show that the proposed framework provides a relative improvement of \(33.4\%\) in terms of Jaccard error rate (JER), over the available baseline deep-speech2 based approach. The improvement in JER suggests that the proposed approach can be able to resolve the performance bias issue to some extent.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Adiga, N., Prasanna, S.R.M.: Detection of glottal activity using different attributes of source information. IEEE Signal Process. Lett. 22(11), 2107–2111 (2015)
Al-Stouhi, S., Reddy, C.K.: Transfer learning for class imbalance problems with inadequate data. Knowl. Inf. Syst. 48(1), 201–228 (2016)
Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460 (2020)
Barras, C., Le, V.B., Gauvain, J.L.: Vocapia-limsi system for 2020 shared task on code-switched spoken language identification. In: The First Workshop on Speech Technologies for Code-Switching in Multilingual Communities (2020)
Dey, S., Saha, G., Sahidullah, M.: Cross-corpora language recognition: A preliminary investigation with Indian languages. In: 2021 29th European Signal Processing Conference (EUSIPCO), pp. 546–550. IEEE (2021)
Diwan, A., Vaideeswaran, R., Shah, S., Singh, A., Raghavan, S., Khare, S., Unni, V., Vyas, S., Rajpuria, A., Yarra, C., Mittal, A., Ghosh, P.K., Jyothi, P., Bali, K., Seshadri, V., Sitaram, S., Bharadwaj, S., Nanavati, J., Nanavati, R., Sankaranarayanan, K., Seeram, T., Abraham, B.: Multilingual and code-switching asr challenges for low resource Indian languages. In: Proceedings of Interspeech (2021)
Gupta, A., Chadha, H.S., Shah, P., Chimmwal, N., Dhuriya, A., Gaur, R., Raghavan, V.: Clsril-23: cross lingual speech representations for indic languages (2021). arXiv preprint arXiv:2107.07402
Jati, A., Georgiou, P.: Neural predictive coding using convolutional neural networks toward unsupervised learning of speaker characteristics. IEEE/ACM Trans. Audio, Speech, Lang. Process. 27(10), 1577–1589 (2019)
Jelil, S., Das, R.K., Prasanna, S.R.M., Sinha, R.: Spoof detection using source, instantaneous frequency and cepstral features. In: Interspeech, pp. 22–26 (2017)
Krawczyk, B.: Learning from imbalanced data: open challenges and future directions. Prog. Artif. Intell. 5(4), 221–232 (2016). https://doi.org/10.1007/s13748-016-0094-0
bibitemch42liu2021end Liu, H., Perera, L.P.G., Zhang, X., Dauwels, J., Khong, A.W., Khudanpur, S., Styles, S.J.: End-to-end language diarization for bilingual code-switching speech. In: 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, vol. 2. International Speech Communication Association (2021)
Lyu, D.C., Chng, E.S., Li, H.: Language diarization for code-switch conversational speech. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7314–7318. IEEE (2013)
Lyu, D.C., Chng, E.S., Li, H.: Language diarization for conversational code-switch speech with pronunciation dictionary adaptation. In: 2013 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP), pp. 147–150. IEEE (2013)
Mary, L., Yegnanarayana, B.: Extraction and representation of prosodic features for language and speaker recognition. Speech Commun. 50(10), 782–796 (2008)
Mishra, J., Agarwal, A., Prasanna, S.R.M.: Spoken language diarization using an attention based neural network. In: 2021 National Conference on Communications (NCC), pp. 1–6. IEEE (2021)
Mishra, J., Gandra, J., Patil, V., Prasanna, S.M.: Issues in sub-utterance level language identification in a code switched bilingual scenario. In: 2022 IEEE International Conference on Signal Processing and Communications (SPCOM), pp. 1–5. IEEE (2022)
Mishra, J., Prasanna, S.R.M.: Language vs speaker change: a comparative study (2022). arXiv preprint arXiv:2203.02680
Mori, K., Nakagawa, S.: Speaker change detection and speaker clustering using vq distortion for broadcast news speech recognition. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), vol. 1, pp. 413–416. IEEE (2001)
Park, T.J., Kanda, N., Dimitriadis, D., Han, K.J., Watanabe, S., Narayanan, S.: A review of speaker diarization: recent advances with deep learning. Comput. Speech & Lang. 72, 101317 (2022)
Prasanna, S.R.M., Gupta, C.S., Yegnanarayana, B.: Extraction of speaker-specific excitation information from linear prediction residual of speech. Speech Commun. 48(10), 1243–1261 (2006)
Rangan, P., Teki, S., Misra, H.: Exploiting spectral augmentation for code-switched spoken language identification (2020). arXiv preprint arXiv:2010.07130
Shah, S., Sitaram, S., Mehta, R.: First workshop on speech processing for code-switching in multilingual communities: shared task on code-switched spoken language identification. WSTCSMC 2020, 24 (2020)
Sitaram, S., Chandu, K.R., Rallabandi, S.K., Black, A.W.: A survey of code switching speech and language processing (2019). arXiv:1904.00784 [cs.CL]
Spoorthy, V., Thenkanidiyoor, V., Dinesh, D.A.: SVM Based Language Diarization for Code-Switched Bilingual Indian Speech Using Bottleneck Features. In: Proceedings of the 6th International Workshop on Spoken Language Technologies for Under-Resourced Languages, pp. 132–136 (2018). https://doi.org/10.21437/SLTU.2018-28
Yilmaz, E., McLaren, M., van den Heuvel, H., van Leeuwen, D.A.: Language diarization for semi-supervised bilingual acoustic model training. In: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 91–96. IEEE (2017)
Acknowledgments
The authors like to acknowledge “Anatganak”, high performance computation (HPC) facility, IIT Dharwad, for enabling us to perform our experiments. And Ministry of Electronics and Information Technology (MeitY), Govt. of India, for supporting us through “Bhashini: Speech technologies in Indian languages” project.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Mishra, J., Prasanna, S.R.M. (2022). Importance of Supra-Segmental Information and Self-Supervised Framework for Spoken Language Diarization Task. In: Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S. (eds) Speech and Computer. SPECOM 2022. Lecture Notes in Computer Science(), vol 13721. Springer, Cham. https://doi.org/10.1007/978-3-031-20980-2_42
Download citation
DOI: https://doi.org/10.1007/978-3-031-20980-2_42
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20979-6
Online ISBN: 978-3-031-20980-2
eBook Packages: Computer ScienceComputer Science (R0)