Abstract
The performance of speech emotion recognition systems (SER) suffers when emotional speech is spoken in different accents. One possible solution to such a problem is to identify the accent beforehand and use this knowledge in the SER task. The present work is one of the novel attempts in this regard to build effective accent recognition systems based on emotional speech. In this regard, statistical aggregation functions (like mean, std, kurtosis, etc.) have been applied on frame-level feature representations such as perceptual linear prediction (PLP), log filterbank energies (LFBE), Mel frequency cepstral coefficients (MFCC), spectral subband centroid (SSC), constant-Q cepstral coefficients (CQCC), chroma vector and Mel frequency discrete wavelet coefficients (MFDWC) to obtain utterance-level features from CREMA-D, an emotional dataset. The performance of the features for different standard classifiers is obtained by conducting experiments using clean and noisy speech signals. Finally, the experimental results show that the SSC features perform well on noisy data only when it is trained with noisy data. On the other hand, the combined MFDWC features perform well on noisy data for both clean and noisy training data. This hints at the noise-robustness of this feature set. On the other hand, we can only say that SSC is conditionally robust. We hope this work will initiate a new line of research in emotion recognition.
Similar content being viewed by others
References
Amino K, Osanai T (2014) Native vs. non-native accent identification using Japanese spoken telephone numbers. Speech Comm 56:70–81
Angkititrakul P, Hansen JHL (2006) Advances in phone-based modeling for automatic accent classification. IEEE Trans Audio Speech Lang Process 14(2):634–646
Cao H, Cooper DG, Keutmann MK, Gur RC, Nenkova A, Verma R (2014) CREMA-D: Crowd-sourced emotional multimodal actors dataset. IEEE Trans Affect Comput 5(4):377–390
Chen M, Yang Z, HZWL (2014) Improving native accent identification using deep neural networks. In: INTERSPEECH, pp 2170–2174
Chen M, Yang Z, Liang J, Li Y, Liu W (2015) Improving deep neural networks based multi-accent Mandarin speech recognition using i-vectors and accent-specific top layer. In: INTERSPEECH
Chen Y, jun Z, Yang CFY (2020) Aipnet: Generative adversarial pre-training of accent-invariant networks for end-to-end speech recognition. In: IEEE International conference on acoustics, speech and signal processing(ICASSP), pp 6979–6983
Chu S, Narayanan S, Kuo CJ (2009) Environmental sound recognition with timefrequency audio features. IEEE Trans Audio Speech Lang Process 17 (6):1142–1158
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Freund Y, Schapire RE (1995) A desicion-theoretic generalization of on-line learning and an application to boosting. Computational learning theory, pp 23–37
Gajic B, Paliwal K (2006) Robust speech recognition in noisy environments based on sub-band spectral centroid histograms. IEEE Trans Audio Speech Lang Process 14(2):600–608
Giannakopoulos T, Pikrakis A (2014) Chapter 4 - Audio features. In: Academic press oxford: introduction to audio analysis, pp 59–103
Gowdy JN, Tufekci Z (2000) Mel-scaled discrete wavelet coefficients for speech recognition. In: IEEE International conference on acoustics speech and signal processing (ICASSP), pp 1351–1354
Haeb-Umbach R, Ney H (1992) Linear discriminant analysis for improved large vocabulary continuous speech recognition. In: IEEE International conference on acoustics, speech, and signal processing (ICASSP), vol. 1, pp 13–16
Hanani A, Russell M, Carey MJ (2011) Speech-based identification of social groups in a single accent of British English by humans and computers. In: IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 4876–4879
Hansen JH, Liu G (2016) Unsupervised accent classification for deep data fusion of accent and language information. Speech Comm 78:19–33
Hermansky H (1990) Perceptual linear predictive (PLP) analysis of speech. J Acoust Soc Am 87(4):1738–52
Honnavalli D, Shylaja SS (2021) Supervised machine learning model for accent recognition in English speech using sequential MFCC features. In: Advances in artificial intelligence and data engineering, pp 55–66
Huang R, Hansen JHL, Angkititrakul P (2007) Dialect/accent classification using unrestricted audio. IEEE Trans Audio Speech Lang Process 15 (2):453–464
Ikeno A, Hansen J (2006) The role of prosody in the perception of US native English accents. In: INTERSPEECH
Jiang N, Grosche P, Konz V, Mller M (2011) Analyzing chroma feature types for automated chord recognition. In: Proceedings of the 42 nd AES Conference, vol. 198
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu TY (2017) Lightgbm: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst 30:3149–3157
Kolly MJ, de Mareil PB, Leemann A, Dellwo V (2017) Listeners use temporal information to identify French- and English-accented speech. Speech Commun 86:121–134
Kua J, Thiruvaran T, Nosrati HEA, Epps J (2010) Investigation of spectral centroid magnitude and frequency for speaker recognition. In: Odyssey: the speaker and language recognition workshop
Leo B (2001) Random forests. Mach Learn 45:5–32
Mannepalli K, Suman PSM (2016) MFCC-GMM based accent recognition system for Telugu speech signals. Int J Speech Technol 19:87–93
Najafian M, Russell M (2020) Automatic accent identification as an analytical tool for accent robust automatic speech recognition. Speech Comm 122:44–55
Pal SK, Mitra S (1992) Multilayer perceptron, fuzzy sets, and classification. IEEE Trans Neural Netw 3(5):683–697
Paliwal KK (1998) Spectral subband centroid features for speech recognition. In: IEEE International conference on acoustics, speech, and signal processing (ICASSP), pp 617–620
Pappagari R, Wang T, Villalba J, Chen N, Dehak N (2020) X-vectors meet emotions: a study on dependencies between emotion and speaker recognition. In: IEEE International conference on acoustics, speech and signal processing(ICASSP), pp 7169–7173
Pikrakis A, Giannakopoulos T, Theodoridis S (2008) A speech/music discriminator of radio recordings based on dynamic programming and Bayesian networks. IEEE Trans Multimed 10(5):846–857
Rajpal A, Patel TB, Sailor HB, Madhavi MC, Patil H, Fujisaki H (2016) Native language identification using spectral and source-based features. In: INTERSPEECH
Rasipuram R, Cernak M, Nanchen A, Magimai-doss M (2015) Automatic accentedness evaluation of non-native speech using phonetic and sub-phonetic posterior probabilities. In: INTERSPEECH
Safavian SR, Landgrebe D (1991) A survey of decision tree classifier methodology. IEEE Trans Syst Man Cybern 21(3):660–674
Saleem S, Subhan F, Naseer N, Bais A, Imtiaz A (2020) Forensic speaker recognition: a new method based on extracting accent and language information from short utterances. Forensic Sci Int Digit Investig 34(300):982
Schorkhuber C, Klapuri A, Sontacchi A (2013) Audio pitch shifting using the constant-Q transform. J Audio Eng Soc 61:562–572
Sharma G, Umapathy K, Krishnan S (2020) Trends in audio signal feature extraction methods. Appl Acoust 158(107):020
Srivastava S, Gupta M, Frigyik A (2007) Bayesian quadratic discriminant analysis. J Mach Learn Res 8:1277–1305
Speech Processing, Transmission and Quality Aspects (STQ);Speech quality performance in the presence of background noise;Part 1: Background noise simulation technique and background noise database, ETSI EG 202 396-1: European Telecommunications Standards Institute, Sophia Antipolis, 45–47, (2008–09)
Todisco M, Delgado H, Evans N (2016) A new feature for automatic speaker verification anti-spoofing: constant Q cepstral coefficients Computer Speech and Language
Unni V, Joshi N, Jyothi P (2020) Coupled Training of Sequence-to-Sequence Models for Accented Speech Recognition. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 8254–8258
Vieru B, de Mareüil PB, Adda-Decker M (2011) Characterisation and identification of non-native French accents. Speech Commun 53(3):292310
Viswanathan R, Paramasivam P, Vepa J (2018) Hierarchical accent determination and application in a large scale ASR system. In: INTERSPEECH
Waldekar S, Saha G (2018) Wavelet transform based mel-scaled features for acoustic scene classification. In: INTERSPEECH, pp 3323–3327
Waldekar S, Saha G (2018) Classification of audio scenes with novel features in a fused system framework. Digit Signal Process 75:71–82
Weninger F, Sun Y, Park J, Willett D, Zhan P (2019) Deep learning based Mandarin accent identification for accent robust ASR. In: INTERSPEECH
Wu T, Duchateau J, Martens JP, Van Compernolle D (2010) Feature subset selection for improved native accent identification. Speech Comm 52(2):83–98
Wu Y, Mao H, Yi Z (2018) Audio classification using attention-augmented convolutional neural network. Knowl-Based Syst 161:90–100
Xuesong Y, Kartik A, Andrew R, Samuel T, Bhuvana R, Mark H (2018) Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition. In: IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 1–5
Zhan Y, Chen H, Zhang G (2006) An optimization algorithm of k-NN classification. In: International conference on machine learning and cybernetics, pp 2246–2251
Zhang T, Wu J (2019) Discriminative frequency filter banks learning with neural networks. EURASIP J Audio Speech Music Process. 1
Zhang JP, Zhong XL (2019) Adaptive recognition of different accents conversations based on convolutional neural network. Multimed Tools Appl 78:30,74930,767
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
G, P.D., Rao, K.S. Accent classification from an emotional speech in clean and noisy environments. Multimed Tools Appl 82, 3485–3508 (2023). https://doi.org/10.1007/s11042-022-13236-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-13236-w