Skip to main content

Importance of Supra-Segmental Information and Self-Supervised Framework for Spoken Language Diarization Task

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13721))

Included in the following conference series:

Abstract

Spoken language diarization (LD) is a task of automatically extracting the monolingual segments present in a given code-switched utterance. Generally in the bilingual code-switched scenario, when a single speaker spokes both the languages, mostly the phoneme production of the secondary language is biased towards the primary, leading to acoustic similarity. It is also noticed that the turn duration of the primary language is significant over the secondary, leading to data imbalance. Due to the acoustic similarity and data imbalance, the performance of the available work is biased toward the primary language. The influence of acoustic similarity can be minimized by capturing the supra-segmental language specific information. Similarly, the influence of data imbalance can be suppressed, by prior capturing the language specific information through a pre-training framework. Therefore, this work proposes a wav2vec2 based self-supervised pre-training framework to capture the supra-segmental language specific information. The obtained results show that the proposed framework provides a relative improvement of \(33.4\%\) in terms of Jaccard error rate (JER), over the available baseline deep-speech2 based approach. The improvement in JER suggests that the proposed approach can be able to resolve the performance bias issue to some extent.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Adiga, N., Prasanna, S.R.M.: Detection of glottal activity using different attributes of source information. IEEE Signal Process. Lett. 22(11), 2107–2111 (2015)

    Article  Google Scholar 

  2. Al-Stouhi, S., Reddy, C.K.: Transfer learning for class imbalance problems with inadequate data. Knowl. Inf. Syst. 48(1), 201–228 (2016)

    Article  Google Scholar 

  3. Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460 (2020)

    Google Scholar 

  4. Barras, C., Le, V.B., Gauvain, J.L.: Vocapia-limsi system for 2020 shared task on code-switched spoken language identification. In: The First Workshop on Speech Technologies for Code-Switching in Multilingual Communities (2020)

    Google Scholar 

  5. Dey, S., Saha, G., Sahidullah, M.: Cross-corpora language recognition: A preliminary investigation with Indian languages. In: 2021 29th European Signal Processing Conference (EUSIPCO), pp. 546–550. IEEE (2021)

    Google Scholar 

  6. Diwan, A., Vaideeswaran, R., Shah, S., Singh, A., Raghavan, S., Khare, S., Unni, V., Vyas, S., Rajpuria, A., Yarra, C., Mittal, A., Ghosh, P.K., Jyothi, P., Bali, K., Seshadri, V., Sitaram, S., Bharadwaj, S., Nanavati, J., Nanavati, R., Sankaranarayanan, K., Seeram, T., Abraham, B.: Multilingual and code-switching asr challenges for low resource Indian languages. In: Proceedings of Interspeech (2021)

    Google Scholar 

  7. Gupta, A., Chadha, H.S., Shah, P., Chimmwal, N., Dhuriya, A., Gaur, R., Raghavan, V.: Clsril-23: cross lingual speech representations for indic languages (2021). arXiv preprint arXiv:2107.07402

  8. Jati, A., Georgiou, P.: Neural predictive coding using convolutional neural networks toward unsupervised learning of speaker characteristics. IEEE/ACM Trans. Audio, Speech, Lang. Process. 27(10), 1577–1589 (2019)

    Article  Google Scholar 

  9. Jelil, S., Das, R.K., Prasanna, S.R.M., Sinha, R.: Spoof detection using source, instantaneous frequency and cepstral features. In: Interspeech, pp. 22–26 (2017)

    Google Scholar 

  10. Krawczyk, B.: Learning from imbalanced data: open challenges and future directions. Prog. Artif. Intell. 5(4), 221–232 (2016). https://doi.org/10.1007/s13748-016-0094-0

    Article  Google Scholar 

  11. bibitemch42liu2021end Liu, H., Perera, L.P.G., Zhang, X., Dauwels, J., Khong, A.W., Khudanpur, S., Styles, S.J.: End-to-end language diarization for bilingual code-switching speech. In: 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, vol. 2. International Speech Communication Association (2021)

    Google Scholar 

  12. Lyu, D.C., Chng, E.S., Li, H.: Language diarization for code-switch conversational speech. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7314–7318. IEEE (2013)

    Google Scholar 

  13. Lyu, D.C., Chng, E.S., Li, H.: Language diarization for conversational code-switch speech with pronunciation dictionary adaptation. In: 2013 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP), pp. 147–150. IEEE (2013)

    Google Scholar 

  14. Mary, L., Yegnanarayana, B.: Extraction and representation of prosodic features for language and speaker recognition. Speech Commun. 50(10), 782–796 (2008)

    Article  Google Scholar 

  15. Mishra, J., Agarwal, A., Prasanna, S.R.M.: Spoken language diarization using an attention based neural network. In: 2021 National Conference on Communications (NCC), pp. 1–6. IEEE (2021)

    Google Scholar 

  16. Mishra, J., Gandra, J., Patil, V., Prasanna, S.M.: Issues in sub-utterance level language identification in a code switched bilingual scenario. In: 2022 IEEE International Conference on Signal Processing and Communications (SPCOM), pp. 1–5. IEEE (2022)

    Google Scholar 

  17. Mishra, J., Prasanna, S.R.M.: Language vs speaker change: a comparative study (2022). arXiv preprint arXiv:2203.02680

  18. Mori, K., Nakagawa, S.: Speaker change detection and speaker clustering using vq distortion for broadcast news speech recognition. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), vol. 1, pp. 413–416. IEEE (2001)

    Google Scholar 

  19. Park, T.J., Kanda, N., Dimitriadis, D., Han, K.J., Watanabe, S., Narayanan, S.: A review of speaker diarization: recent advances with deep learning. Comput. Speech & Lang. 72, 101317 (2022)

    Article  Google Scholar 

  20. Prasanna, S.R.M., Gupta, C.S., Yegnanarayana, B.: Extraction of speaker-specific excitation information from linear prediction residual of speech. Speech Commun. 48(10), 1243–1261 (2006)

    Article  Google Scholar 

  21. Rangan, P., Teki, S., Misra, H.: Exploiting spectral augmentation for code-switched spoken language identification (2020). arXiv preprint arXiv:2010.07130

  22. Shah, S., Sitaram, S., Mehta, R.: First workshop on speech processing for code-switching in multilingual communities: shared task on code-switched spoken language identification. WSTCSMC 2020, 24 (2020)

    Google Scholar 

  23. Sitaram, S., Chandu, K.R., Rallabandi, S.K., Black, A.W.: A survey of code switching speech and language processing (2019). arXiv:1904.00784 [cs.CL]

  24. Spoorthy, V., Thenkanidiyoor, V., Dinesh, D.A.: SVM Based Language Diarization for Code-Switched Bilingual Indian Speech Using Bottleneck Features. In: Proceedings of the 6th International Workshop on Spoken Language Technologies for Under-Resourced Languages, pp. 132–136 (2018). https://doi.org/10.21437/SLTU.2018-28

  25. Yilmaz, E., McLaren, M., van den Heuvel, H., van Leeuwen, D.A.: Language diarization for semi-supervised bilingual acoustic model training. In: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 91–96. IEEE (2017)

    Google Scholar 

Download references

Acknowledgments

The authors like to acknowledge “Anatganak”, high performance computation (HPC) facility, IIT Dharwad, for enabling us to perform our experiments. And Ministry of Electronics and Information Technology (MeitY), Govt. of India, for supporting us through “Bhashini: Speech technologies in Indian languages” project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jagabandhu Mishra .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mishra, J., Prasanna, S.R.M. (2022). Importance of Supra-Segmental Information and Self-Supervised Framework for Spoken Language Diarization Task. In: Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S. (eds) Speech and Computer. SPECOM 2022. Lecture Notes in Computer Science(), vol 13721. Springer, Cham. https://doi.org/10.1007/978-3-031-20980-2_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20980-2_42

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20979-6

  • Online ISBN: 978-3-031-20980-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics