Importance of Supra-Segmental Information and Self-Supervised Framework for Spoken Language Diarization Task

Mishra, Jagabandhu; Prasanna, S. R. Mahadeva

doi:10.1007/978-3-031-20980-2_42

Jagabandhu Mishra¹¹ &
S. R. Mahadeva Prasanna¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13721))

Included in the following conference series:

International Conference on Speech and Computer

1084 Accesses
3 Citations

Abstract

Spoken language diarization (LD) is a task of automatically extracting the monolingual segments present in a given code-switched utterance. Generally in the bilingual code-switched scenario, when a single speaker spokes both the languages, mostly the phoneme production of the secondary language is biased towards the primary, leading to acoustic similarity. It is also noticed that the turn duration of the primary language is significant over the secondary, leading to data imbalance. Due to the acoustic similarity and data imbalance, the performance of the available work is biased toward the primary language. The influence of acoustic similarity can be minimized by capturing the supra-segmental language specific information. Similarly, the influence of data imbalance can be suppressed, by prior capturing the language specific information through a pre-training framework. Therefore, this work proposes a wav2vec2 based self-supervised pre-training framework to capture the supra-segmental language specific information. The obtained results show that the proposed framework provides a relative improvement of $33.4\%$ in terms of Jaccard error rate (JER), over the available baseline deep-speech2 based approach. The improvement in JER suggests that the proposed approach can be able to resolve the performance bias issue to some extent.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

ASR for Indian Regional Languages Using Fine-Tuned Wav2Vec2 Model

Optimizing Whisper models for Amazigh ASR: a comparative analysis

Article 01 November 2024

Improving Automatic Speech Recognition for Non-native English with Transfer Learning and Language Model Decoding

References

Adiga, N., Prasanna, S.R.M.: Detection of glottal activity using different attributes of source information. IEEE Signal Process. Lett. 22(11), 2107–2111 (2015)
Article Google Scholar
Al-Stouhi, S., Reddy, C.K.: Transfer learning for class imbalance problems with inadequate data. Knowl. Inf. Syst. 48(1), 201–228 (2016)
Article Google Scholar
Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460 (2020)
Google Scholar
Barras, C., Le, V.B., Gauvain, J.L.: Vocapia-limsi system for 2020 shared task on code-switched spoken language identification. In: The First Workshop on Speech Technologies for Code-Switching in Multilingual Communities (2020)
Google Scholar
Dey, S., Saha, G., Sahidullah, M.: Cross-corpora language recognition: A preliminary investigation with Indian languages. In: 2021 29th European Signal Processing Conference (EUSIPCO), pp. 546–550. IEEE (2021)
Google Scholar
Diwan, A., Vaideeswaran, R., Shah, S., Singh, A., Raghavan, S., Khare, S., Unni, V., Vyas, S., Rajpuria, A., Yarra, C., Mittal, A., Ghosh, P.K., Jyothi, P., Bali, K., Seshadri, V., Sitaram, S., Bharadwaj, S., Nanavati, J., Nanavati, R., Sankaranarayanan, K., Seeram, T., Abraham, B.: Multilingual and code-switching asr challenges for low resource Indian languages. In: Proceedings of Interspeech (2021)
Google Scholar
Gupta, A., Chadha, H.S., Shah, P., Chimmwal, N., Dhuriya, A., Gaur, R., Raghavan, V.: Clsril-23: cross lingual speech representations for indic languages (2021). arXiv preprint arXiv:2107.07402
Jati, A., Georgiou, P.: Neural predictive coding using convolutional neural networks toward unsupervised learning of speaker characteristics. IEEE/ACM Trans. Audio, Speech, Lang. Process. 27(10), 1577–1589 (2019)
Article Google Scholar
Jelil, S., Das, R.K., Prasanna, S.R.M., Sinha, R.: Spoof detection using source, instantaneous frequency and cepstral features. In: Interspeech, pp. 22–26 (2017)
Google Scholar
Krawczyk, B.: Learning from imbalanced data: open challenges and future directions. Prog. Artif. Intell. 5(4), 221–232 (2016). https://doi.org/10.1007/s13748-016-0094-0
Article Google Scholar
bibitemch42liu2021end Liu, H., Perera, L.P.G., Zhang, X., Dauwels, J., Khong, A.W., Khudanpur, S., Styles, S.J.: End-to-end language diarization for bilingual code-switching speech. In: 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, vol. 2. International Speech Communication Association (2021)
Google Scholar
Lyu, D.C., Chng, E.S., Li, H.: Language diarization for code-switch conversational speech. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7314–7318. IEEE (2013)
Google Scholar
Lyu, D.C., Chng, E.S., Li, H.: Language diarization for conversational code-switch speech with pronunciation dictionary adaptation. In: 2013 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP), pp. 147–150. IEEE (2013)
Google Scholar
Mary, L., Yegnanarayana, B.: Extraction and representation of prosodic features for language and speaker recognition. Speech Commun. 50(10), 782–796 (2008)
Article Google Scholar
Mishra, J., Agarwal, A., Prasanna, S.R.M.: Spoken language diarization using an attention based neural network. In: 2021 National Conference on Communications (NCC), pp. 1–6. IEEE (2021)
Google Scholar
Mishra, J., Gandra, J., Patil, V., Prasanna, S.M.: Issues in sub-utterance level language identification in a code switched bilingual scenario. In: 2022 IEEE International Conference on Signal Processing and Communications (SPCOM), pp. 1–5. IEEE (2022)
Google Scholar
Mishra, J., Prasanna, S.R.M.: Language vs speaker change: a comparative study (2022). arXiv preprint arXiv:2203.02680
Mori, K., Nakagawa, S.: Speaker change detection and speaker clustering using vq distortion for broadcast news speech recognition. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), vol. 1, pp. 413–416. IEEE (2001)
Google Scholar
Park, T.J., Kanda, N., Dimitriadis, D., Han, K.J., Watanabe, S., Narayanan, S.: A review of speaker diarization: recent advances with deep learning. Comput. Speech & Lang. 72, 101317 (2022)
Article Google Scholar
Prasanna, S.R.M., Gupta, C.S., Yegnanarayana, B.: Extraction of speaker-specific excitation information from linear prediction residual of speech. Speech Commun. 48(10), 1243–1261 (2006)
Article Google Scholar
Rangan, P., Teki, S., Misra, H.: Exploiting spectral augmentation for code-switched spoken language identification (2020). arXiv preprint arXiv:2010.07130
Shah, S., Sitaram, S., Mehta, R.: First workshop on speech processing for code-switching in multilingual communities: shared task on code-switched spoken language identification. WSTCSMC 2020, 24 (2020)
Google Scholar
Sitaram, S., Chandu, K.R., Rallabandi, S.K., Black, A.W.: A survey of code switching speech and language processing (2019). arXiv:1904.00784 [cs.CL]
Spoorthy, V., Thenkanidiyoor, V., Dinesh, D.A.: SVM Based Language Diarization for Code-Switched Bilingual Indian Speech Using Bottleneck Features. In: Proceedings of the 6th International Workshop on Spoken Language Technologies for Under-Resourced Languages, pp. 132–136 (2018). https://doi.org/10.21437/SLTU.2018-28
Yilmaz, E., McLaren, M., van den Heuvel, H., van Leeuwen, D.A.: Language diarization for semi-supervised bilingual acoustic model training. In: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 91–96. IEEE (2017)
Google Scholar

Download references

Acknowledgments

The authors like to acknowledge “Anatganak”, high performance computation (HPC) facility, IIT Dharwad, for enabling us to perform our experiments. And Ministry of Electronics and Information Technology (MeitY), Govt. of India, for supporting us through “Bhashini: Speech technologies in Indian languages” project.

Author information

Authors and Affiliations

Department of Electrical Engineering, Indian Institute of Technology (IIT) Dharwad, Dharwad, 580011, India
Jagabandhu Mishra & S. R. Mahadeva Prasanna

Authors

Jagabandhu Mishra
View author publications
You can also search for this author in PubMed Google Scholar
S. R. Mahadeva Prasanna
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jagabandhu Mishra .

Editor information

Editors and Affiliations

Indian Institute of Technology Dharwad, Dharwad, India
S. R. Mahadeva Prasanna
St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Koneru Lakshmaiah Education Foundation, Vaddeswaram, India
K. Samudravijaya
KIIT Group of Colleges, Gurugram, India
Shyam S. Agrawal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mishra, J., Prasanna, S.R.M. (2022). Importance of Supra-Segmental Information and Self-Supervised Framework for Spoken Language Diarization Task. In: Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S. (eds) Speech and Computer. SPECOM 2022. Lecture Notes in Computer Science(), vol 13721. Springer, Cham. https://doi.org/10.1007/978-3-031-20980-2_42

Download citation

DOI: https://doi.org/10.1007/978-3-031-20980-2_42
Published: 10 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20979-6
Online ISBN: 978-3-031-20980-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Importance of Supra-Segmental Information and Self-Supervised Framework for Spoken Language Diarization Task