Effect of background Indian music on performance of speech recognition models for Hindi databases

Kumar, Arvind; Solanki, S. S.; Chandra, Mahesh

doi:10.1007/s10772-021-09948-3

Effect of background Indian music on performance of speech recognition models for Hindi databases

Published: 21 January 2022

Volume 26, pages 1153–1164, (2023)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

225 Accesses
Explore all metrics

Abstract

Multimedia content analysis has shown great interest over the past few decades. One of the works which find great attention to the researchers is automatic speech recognition (ASR) of speech data from broadcast radio and TV program. However, the presence of background music in such data heavily degrades the performance of ASR models. In this paper, we initially studied the temporal and spectral properties of music samples recorded from five different Indian instruments. Further, to see the effect of background Indian music on the recognition efficiency of ASR models for Hindi databases, these speech models were trained on both isolated and continuous speech databases using both clean and noisy databases. Hence, a total of four scenarios were considered: 1. Clean Isolated Database, 2. Noisy Isolated Database, 3. Clean Continuous Database, 4. Noisy Continuous Database. The variation of ASR performance was observed for different SNR levels of background music (0–30 dB). These background noises were combined with clean speech signals both independently where the sound of a single instrument was used as well as in combination with each other where sounds from several instruments were mixed. Overall, maximum degradation in performance of ASR is observed for background noise generated from audio samples of Been with an average WER of 13.37 and 72.21 for isolated and continuous text models whereas minimum degradation in performance of ASR is observed for background noise generated from audio samples of Harmonium and Flute with a WER of 15.25 and 66.09 for isolated text models and continuous text models respectively. We further correlated the observed results of ASR performance to the temporal and spectral properties of the music signals and found that higher values of Zero Crossing Rate, Roll-off rate, spectral centroid and spectral flux indicated greater degradation in ASR performance. Hence, these features are found to give important cues to understand the background noise as compared to other features like spectral entropy and Short Term Energy. The work presented in this paper will be useful in better understanding of music compensation algorithms focused on the Indian market.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hindi speech recognition in noisy environment using hybrid technique

Article 01 January 2021

Enhancements in Continuous Kannada ASR System by Background Noise Elimination

Article 16 February 2022

Amalgamation of noise elimination and TDNN acoustic modelling techniques for the advancements in continuous Kannada ASR system

Article 29 July 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Araki, S., Hayashi, T., Delcroix, M., Fujimoto, M., Takeda, K., & Nakatani, T. (2015). Exploring multi-channel features for denoising-autoencoder-based speech enhancement. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 116–120). IEEE.
Barker, J., Marxer, R., Vincent, E., & Watanabe, S. (2015). The third chime’speech separation and recognition challenge: dataset, task and baselines. In 2015 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2015).
Chanrungutai, A., & Ratanamahatana, C. A. (2008). Singing voice separation for mono-channel music using nonnegative matrix factorization. In International Conference on Advanced Technologies for Communications (pp. 243–246). IEEE.
Dash, D., Kim, M. J., Teplansky, K., & Wang, J., 2018. Automatic speech recognition with articulatory information and a unified dictionary for Hindi, Marathi, Bengali and Oriya. In INTERSPEECH (pp. 1046–1050).
Delcroix, M., Kubo, Y., Nakatani, T., & Nakamura, A. (2013). Is speech enhancement pre-processing still relevant when using deep neural networks for acoustic modeling?. In INTERSPEECH (pp. 2992–2996). ISCA.
Demir, C., Saraclar, M., & Cemgil, A. T. (2013). Single channel speech-music separation for robust ASR with mixture models. IEEE Transactions on Audio, Speech, and Language Processing, 21(4), 725–736.
Article Google Scholar
Dua, M., Aggarwal, R. K., & Biswas, M. (2018a). Performance evaluation of Hindi speech recognition system using optimized filterbanks. Engineering Science and Technology, an International Journal, 21(3), 389–398.
Article Google Scholar
Dua, M., Aggarwal, R. K., & Biswas, M. (2018b). Discriminative training using noise robust integrated features and refined HMM modeling. Journal of Intelligent Systems, 29(1), 327–344.
Article Google Scholar
Dua, M., Aggarwal, R. K., & Biswas, M. (2019). GFCC based discriminatively trained noise robust continuous ASR system for Hindi language. Journal of Ambient Intelligence and Humanized Computing, 10(6), 2301–2314.
Article Google Scholar
Ganji, S., Dhawan, K., & Sinha, R. (2019). IITG-HingCoS corpus: A Hinglish code-switching database for automatic speech recognition. Speech Communication, 110, 76–89.
Article Google Scholar
Grais, E. M., & Erdogan, H. (2011). Single channel speech music separation using nonnegative matrix factorization and spectral masks. In 2011 17th International Conference on Digital Signal Processing (DSP) (pp. 1–6). IEEE.
Kadyan, V., Dua, M., & Dhiman, P. (2021). Enhancing accuracy of long contextual dependencies for Punjabi speech recognition system using deep LSTM. International Journal of Speech Technology, 24(2), 517–527.
Article Google Scholar
Kinoshita, K., Delcroix, M., Yoshioka, T., Nakatani, T., Sehr, A., Kellermann, W., & Maas, R. (2013). The reverb challenge: a common evaluation framework for de reverberation and recognition of reverberant speech. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (pp. 1–4). IEEE.
Kumar, A., & Aggarwal, R. K. (2020). Hindi speech recognition using time delay neural network acoustic modeling with i-vector adaptation. International Journal of Speech Technology, 1–12.
Lekshmi, K. R., & Sherly, E. (2021). An acoustic model and linguistic analysis for Malayalam disyllabic words: A low resource language. International Journal of Speech Technology, 24(2), 483–495.
Article Google Scholar
Li, J., Deng, L., Gong, Y., & Haeb-Umbach, R. (2014). An overview of noise-robust automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(4), 745–777.
Article Google Scholar
Londhe, N. D., & Kshirsagar, G. B. (2018). Chhattisgarhi speech corpus for research and development in automatic speech recognition. International Journal of Speech Technology, 21(2), 193–210.
Article Google Scholar
Lu, X., Tsao, Y., Matsuda, S., & Hori, C. (2013). Speech enhancement based on deep denoising autoencoder. In INTERSPEECH (pp. 436–440).
Pala, M., Parayitam, L., & Appala, V. (2020). Unsupervised stemmed text corpus for language modeling and transcription of Telugu broadcast news. International Journal of Speech Technology, 23(3), 695–704.
Article Google Scholar
Passricha, V., & Aggarwal, R. K. (2019). PSO-based optimized CNN for Hindi ASR. International Journal of Speech Technology, 22(4), 1123–1133.
Article Google Scholar
Polasi, P. K., & Krishna, K. S. R. (2016). Combining the evidences of temporal and spectral enhancement techniques for improving the performance of Indian language identification system in the presence of background noise. International Journal of Speech Technology, 19(1), 75–85.
Article Google Scholar
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., & Silovsky, J. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding 2011 (No. CONF). IEEE Signal Processing Society.
Raj, B., Parikh, V. N., & Stern, R. M. (1997). The effects of background music on speech recognition accuracy. In 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (Vol. 2, pp. 851–854). IEEE.
Reverb challenge. (2020). Retrieved August 29, 2020, from http://reverb2014.dereverberation.com/.
Samudravijaya, K. (2021). Indian Language Speech Label (ILSL): a de facto National Standard.
Samudravijaya, K., Rao, P. V. S., & Agrawal, S. S. (2000). Hindi speech database. In Sixth International Conference on Spoken Language Processing.
Santhanavijayan, A., Kumar, D. N., & Deepak, G., 2021. A semantic-aware strategy for automatic speech recognition incorporating deep learning models. In Intelligent System Design (pp. 247–254). Springer.
Seltzer, M. L., Yu, D., & Wang, Y. (2013). An investigation of deep neural networks for noise robust speech recognition. In 2013 IEEE International Conference on. IEEE Acoustics, Speech and Signal Processing (ICASSP) (pp. 7398–7402). IEEE.
Sprechmann, P., Bronstein, A. M., & Sapiro, G. (2012). Real-time online singing voice separation from monaural recordings using robust low-rank modeling. In ISMIR (pp. 67–72).
Upadhyaya, P., Mittal, S. K., Farooq, O., Varshney, Y. V., & Abidi, M. R. (2019). Continuous Hindi speech recognition using kaldi asr based on deep neural network. In Machine Intelligence and Signal Analysis (pp. 303–311). Springer.
Vincent, E., Barker, J., Watanabe, S., Le Roux, J., Nesta, F., & Matassoni, M. (2013). The second chime speech separation and recognition challenge: datasets, tasks and baselines. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 126–130). IEEE.
Xu, Y., Du, J., Dai, L. R., & Lee, C. H. (2014). An experimental study on speech enhancement based on deep neural networks. IEEE Signal Processing Letters, 21(1), 65–68.
Article Google Scholar
Zhao, M., Wang, D., Zhang, Z., & Zhang, X. (2015). Music removal by convolutional denoising autoencoder in speech recognition. In 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA) (pp. 338–341). IEEE.

Download references

Funding

Funding was provided by BIT Mesra.

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, Birla Institute of Technology, Mesra, Ranchi, India
Arvind Kumar, S. S. Solanki & Mahesh Chandra

Authors

Arvind Kumar
View author publications
You can also search for this author in PubMed Google Scholar
S. S. Solanki
View author publications
You can also search for this author in PubMed Google Scholar
Mahesh Chandra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arvind Kumar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kumar, A., Solanki, S.S. & Chandra, M. Effect of background Indian music on performance of speech recognition models for Hindi databases. Int J Speech Technol 26, 1153–1164 (2023). https://doi.org/10.1007/s10772-021-09948-3

Download citation

Received: 30 June 2020
Accepted: 17 November 2021
Published: 21 January 2022
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10772-021-09948-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Effect of background Indian music on performance of speech recognition models for Hindi databases

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Hindi speech recognition in noisy environment using hybrid technique

Enhancements in Continuous Kannada ASR System by Background Noise Elimination

Amalgamation of noise elimination and TDNN acoustic modelling techniques for the advancements in continuous Kannada ASR system

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Effect of background Indian music on performance of speech recognition models for Hindi databases

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Hindi speech recognition in noisy environment using hybrid technique

Enhancements in Continuous Kannada ASR System by Background Noise Elimination

Amalgamation of noise elimination and TDNN acoustic modelling techniques for the advancements in continuous Kannada ASR system

Explore related subjects

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation