Accent classification from an emotional speech in clean and noisy environments

G, Priya Dharshini; Rao, K Sreenivasa

doi:10.1007/s11042-022-13236-w

Accent classification from an emotional speech in clean and noisy environments

Published: 09 July 2022

Volume 82, pages 3485–3508, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Priya Dharshini G¹ &
K Sreenivasa Rao²

320 Accesses
1 Altmetric
Explore all metrics

Abstract

The performance of speech emotion recognition systems (SER) suffers when emotional speech is spoken in different accents. One possible solution to such a problem is to identify the accent beforehand and use this knowledge in the SER task. The present work is one of the novel attempts in this regard to build effective accent recognition systems based on emotional speech. In this regard, statistical aggregation functions (like mean, std, kurtosis, etc.) have been applied on frame-level feature representations such as perceptual linear prediction (PLP), log filterbank energies (LFBE), Mel frequency cepstral coefficients (MFCC), spectral subband centroid (SSC), constant-Q cepstral coefficients (CQCC), chroma vector and Mel frequency discrete wavelet coefficients (MFDWC) to obtain utterance-level features from CREMA-D, an emotional dataset. The performance of the features for different standard classifiers is obtained by conducting experiments using clean and noisy speech signals. Finally, the experimental results show that the SSC features perform well on noisy data only when it is trained with noisy data. On the other hand, the combined MFDWC features perform well on noisy data for both clean and noisy training data. This hints at the noise-robustness of this feature set. On the other hand, we can only say that SSC is conditionally robust. We hope this work will initiate a new line of research in emotion recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust Acoustic Emotion Recognition Based on Cascaded Normalization and Extreme Learning Machines

Hierarchical emotion recognition from speech using source, power spectral and prosodic features

Article 28 July 2023

Arijul Haque & K. Sreenivasa Rao

On the Use of Spectral Feature Fusions for Enhanced Performance of Malaysian English Accents Classification

References

Amino K, Osanai T (2014) Native vs. non-native accent identification using Japanese spoken telephone numbers. Speech Comm 56:70–81
Article Google Scholar
Angkititrakul P, Hansen JHL (2006) Advances in phone-based modeling for automatic accent classification. IEEE Trans Audio Speech Lang Process 14(2):634–646
Article Google Scholar
Cao H, Cooper DG, Keutmann MK, Gur RC, Nenkova A, Verma R (2014) CREMA-D: Crowd-sourced emotional multimodal actors dataset. IEEE Trans Affect Comput 5(4):377–390
Article Google Scholar
Chen M, Yang Z, HZWL (2014) Improving native accent identification using deep neural networks. In: INTERSPEECH, pp 2170–2174
Chen M, Yang Z, Liang J, Li Y, Liu W (2015) Improving deep neural networks based multi-accent Mandarin speech recognition using i-vectors and accent-specific top layer. In: INTERSPEECH
Chen Y, jun Z, Yang CFY (2020) Aipnet: Generative adversarial pre-training of accent-invariant networks for end-to-end speech recognition. In: IEEE International conference on acoustics, speech and signal processing(ICASSP), pp 6979–6983
Chu S, Narayanan S, Kuo CJ (2009) Environmental sound recognition with timefrequency audio features. IEEE Trans Audio Speech Lang Process 17 (6):1142–1158
Article Google Scholar
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Article MATH Google Scholar
Freund Y, Schapire RE (1995) A desicion-theoretic generalization of on-line learning and an application to boosting. Computational learning theory, pp 23–37
Gajic B, Paliwal K (2006) Robust speech recognition in noisy environments based on sub-band spectral centroid histograms. IEEE Trans Audio Speech Lang Process 14(2):600–608
Article Google Scholar
Giannakopoulos T, Pikrakis A (2014) Chapter 4 - Audio features. In: Academic press oxford: introduction to audio analysis, pp 59–103
Gowdy JN, Tufekci Z (2000) Mel-scaled discrete wavelet coefficients for speech recognition. In: IEEE International conference on acoustics speech and signal processing (ICASSP), pp 1351–1354
Haeb-Umbach R, Ney H (1992) Linear discriminant analysis for improved large vocabulary continuous speech recognition. In: IEEE International conference on acoustics, speech, and signal processing (ICASSP), vol. 1, pp 13–16
Hanani A, Russell M, Carey MJ (2011) Speech-based identification of social groups in a single accent of British English by humans and computers. In: IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 4876–4879
Hansen JH, Liu G (2016) Unsupervised accent classification for deep data fusion of accent and language information. Speech Comm 78:19–33
Article Google Scholar
Hermansky H (1990) Perceptual linear predictive (PLP) analysis of speech. J Acoust Soc Am 87(4):1738–52
Article Google Scholar
Honnavalli D, Shylaja SS (2021) Supervised machine learning model for accent recognition in English speech using sequential MFCC features. In: Advances in artificial intelligence and data engineering, pp 55–66
Huang R, Hansen JHL, Angkititrakul P (2007) Dialect/accent classification using unrestricted audio. IEEE Trans Audio Speech Lang Process 15 (2):453–464
Article Google Scholar
Ikeno A, Hansen J (2006) The role of prosody in the perception of US native English accents. In: INTERSPEECH
Jiang N, Grosche P, Konz V, Mller M (2011) Analyzing chroma feature types for automated chord recognition. In: Proceedings of the 42 nd AES Conference, vol. 198
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu TY (2017) Lightgbm: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst 30:3149–3157
Google Scholar
Kolly MJ, de Mareil PB, Leemann A, Dellwo V (2017) Listeners use temporal information to identify French- and English-accented speech. Speech Commun 86:121–134
Article Google Scholar
Kua J, Thiruvaran T, Nosrati HEA, Epps J (2010) Investigation of spectral centroid magnitude and frequency for speaker recognition. In: Odyssey: the speaker and language recognition workshop
Leo B (2001) Random forests. Mach Learn 45:5–32
Article MATH Google Scholar
Mannepalli K, Suman PSM (2016) MFCC-GMM based accent recognition system for Telugu speech signals. Int J Speech Technol 19:87–93
Article Google Scholar
Najafian M, Russell M (2020) Automatic accent identification as an analytical tool for accent robust automatic speech recognition. Speech Comm 122:44–55
Article Google Scholar
Pal SK, Mitra S (1992) Multilayer perceptron, fuzzy sets, and classification. IEEE Trans Neural Netw 3(5):683–697
Article Google Scholar
Paliwal KK (1998) Spectral subband centroid features for speech recognition. In: IEEE International conference on acoustics, speech, and signal processing (ICASSP), pp 617–620
Pappagari R, Wang T, Villalba J, Chen N, Dehak N (2020) X-vectors meet emotions: a study on dependencies between emotion and speaker recognition. In: IEEE International conference on acoustics, speech and signal processing(ICASSP), pp 7169–7173
Pikrakis A, Giannakopoulos T, Theodoridis S (2008) A speech/music discriminator of radio recordings based on dynamic programming and Bayesian networks. IEEE Trans Multimed 10(5):846–857
Article Google Scholar
Rajpal A, Patel TB, Sailor HB, Madhavi MC, Patil H, Fujisaki H (2016) Native language identification using spectral and source-based features. In: INTERSPEECH
Rasipuram R, Cernak M, Nanchen A, Magimai-doss M (2015) Automatic accentedness evaluation of non-native speech using phonetic and sub-phonetic posterior probabilities. In: INTERSPEECH
Safavian SR, Landgrebe D (1991) A survey of decision tree classifier methodology. IEEE Trans Syst Man Cybern 21(3):660–674
Article MathSciNet Google Scholar
Saleem S, Subhan F, Naseer N, Bais A, Imtiaz A (2020) Forensic speaker recognition: a new method based on extracting accent and language information from short utterances. Forensic Sci Int Digit Investig 34(300):982
Google Scholar
Schorkhuber C, Klapuri A, Sontacchi A (2013) Audio pitch shifting using the constant-Q transform. J Audio Eng Soc 61:562–572
Google Scholar
Sharma G, Umapathy K, Krishnan S (2020) Trends in audio signal feature extraction methods. Appl Acoust 158(107):020
Google Scholar
Srivastava S, Gupta M, Frigyik A (2007) Bayesian quadratic discriminant analysis. J Mach Learn Res 8:1277–1305
MathSciNet MATH Google Scholar
Speech Processing, Transmission and Quality Aspects (STQ);Speech quality performance in the presence of background noise;Part 1: Background noise simulation technique and background noise database, ETSI EG 202 396-1: European Telecommunications Standards Institute, Sophia Antipolis, 45–47, (2008–09)
Todisco M, Delgado H, Evans N (2016) A new feature for automatic speaker verification anti-spoofing: constant Q cepstral coefficients Computer Speech and Language
Unni V, Joshi N, Jyothi P (2020) Coupled Training of Sequence-to-Sequence Models for Accented Speech Recognition. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 8254–8258
Vieru B, de Mareüil PB, Adda-Decker M (2011) Characterisation and identification of non-native French accents. Speech Commun 53(3):292310
Article Google Scholar
Viswanathan R, Paramasivam P, Vepa J (2018) Hierarchical accent determination and application in a large scale ASR system. In: INTERSPEECH
Waldekar S, Saha G (2018) Wavelet transform based mel-scaled features for acoustic scene classification. In: INTERSPEECH, pp 3323–3327
Waldekar S, Saha G (2018) Classification of audio scenes with novel features in a fused system framework. Digit Signal Process 75:71–82
Article MathSciNet Google Scholar
Weninger F, Sun Y, Park J, Willett D, Zhan P (2019) Deep learning based Mandarin accent identification for accent robust ASR. In: INTERSPEECH
Wu T, Duchateau J, Martens JP, Van Compernolle D (2010) Feature subset selection for improved native accent identification. Speech Comm 52(2):83–98
Article Google Scholar
Wu Y, Mao H, Yi Z (2018) Audio classification using attention-augmented convolutional neural network. Knowl-Based Syst 161:90–100
Article Google Scholar
Xuesong Y, Kartik A, Andrew R, Samuel T, Bhuvana R, Mark H (2018) Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition. In: IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 1–5
Zhan Y, Chen H, Zhang G (2006) An optimization algorithm of k-NN classification. In: International conference on machine learning and cybernetics, pp 2246–2251
Zhang T, Wu J (2019) Discriminative frequency filter banks learning with neural networks. EURASIP J Audio Speech Music Process. 1
Zhang JP, Zhong XL (2019) Adaptive recognition of different accents conversations based on convolutional neural network. Multimed Tools Appl 78:30,74930,767
Google Scholar

Download references

Author information

Authors and Affiliations

Research Scholar, Advanced Technology Developement Center, Indian Institute of Technology Kharagpur, Kharagpur, 721302, West Bengal, India
Priya Dharshini G
Professor, Dept of Computer Science, Indian Institute of Technology Kharagpur, Kharagpur, 721302, West Bengal, India
K Sreenivasa Rao

Authors

Priya Dharshini G
View author publications
You can also search for this author in PubMed Google Scholar
K Sreenivasa Rao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Priya Dharshini G.

Ethics declarations

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

G, P.D., Rao, K.S. Accent classification from an emotional speech in clean and noisy environments. Multimed Tools Appl 82, 3485–3508 (2023). https://doi.org/10.1007/s11042-022-13236-w

Download citation

Received: 02 September 2021
Revised: 07 December 2021
Accepted: 15 May 2022
Published: 09 July 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s11042-022-13236-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accent classification from an emotional speech in clean and noisy environments

Abstract

Access this article

Similar content being viewed by others

Robust Acoustic Emotion Recognition Based on Cascaded Normalization and Extreme Learning Machines

Hierarchical emotion recognition from speech using source, power spectral and prosodic features

On the Use of Spectral Feature Fusions for Enhanced Performance of Malaysian English Accents Classification

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Accent classification from an emotional speech in clean and noisy environments

Abstract

Access this article

Similar content being viewed by others

Robust Acoustic Emotion Recognition Based on Cascaded Normalization and Extreme Learning Machines

Hierarchical emotion recognition from speech using source, power spectral and prosodic features

On the Use of Spectral Feature Fusions for Enhanced Performance of Malaysian English Accents Classification

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation