Skip to main content
Log in

Accent classification from an emotional speech in clean and noisy environments

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The performance of speech emotion recognition systems (SER) suffers when emotional speech is spoken in different accents. One possible solution to such a problem is to identify the accent beforehand and use this knowledge in the SER task. The present work is one of the novel attempts in this regard to build effective accent recognition systems based on emotional speech. In this regard, statistical aggregation functions (like mean, std, kurtosis, etc.) have been applied on frame-level feature representations such as perceptual linear prediction (PLP), log filterbank energies (LFBE), Mel frequency cepstral coefficients (MFCC), spectral subband centroid (SSC), constant-Q cepstral coefficients (CQCC), chroma vector and Mel frequency discrete wavelet coefficients (MFDWC) to obtain utterance-level features from CREMA-D, an emotional dataset. The performance of the features for different standard classifiers is obtained by conducting experiments using clean and noisy speech signals. Finally, the experimental results show that the SSC features perform well on noisy data only when it is trained with noisy data. On the other hand, the combined MFDWC features perform well on noisy data for both clean and noisy training data. This hints at the noise-robustness of this feature set. On the other hand, we can only say that SSC is conditionally robust. We hope this work will initiate a new line of research in emotion recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Amino K, Osanai T (2014) Native vs. non-native accent identification using Japanese spoken telephone numbers. Speech Comm 56:70–81

    Article  Google Scholar 

  2. Angkititrakul P, Hansen JHL (2006) Advances in phone-based modeling for automatic accent classification. IEEE Trans Audio Speech Lang Process 14(2):634–646

    Article  Google Scholar 

  3. Cao H, Cooper DG, Keutmann MK, Gur RC, Nenkova A, Verma R (2014) CREMA-D: Crowd-sourced emotional multimodal actors dataset. IEEE Trans Affect Comput 5(4):377–390

    Article  Google Scholar 

  4. Chen M, Yang Z, HZWL (2014) Improving native accent identification using deep neural networks. In: INTERSPEECH, pp 2170–2174

  5. Chen M, Yang Z, Liang J, Li Y, Liu W (2015) Improving deep neural networks based multi-accent Mandarin speech recognition using i-vectors and accent-specific top layer. In: INTERSPEECH

  6. Chen Y, jun Z, Yang CFY (2020) Aipnet: Generative adversarial pre-training of accent-invariant networks for end-to-end speech recognition. In: IEEE International conference on acoustics, speech and signal processing(ICASSP), pp 6979–6983

  7. Chu S, Narayanan S, Kuo CJ (2009) Environmental sound recognition with timefrequency audio features. IEEE Trans Audio Speech Lang Process 17 (6):1142–1158

    Article  Google Scholar 

  8. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

    Article  MATH  Google Scholar 

  9. Freund Y, Schapire RE (1995) A desicion-theoretic generalization of on-line learning and an application to boosting. Computational learning theory, pp 23–37

  10. Gajic B, Paliwal K (2006) Robust speech recognition in noisy environments based on sub-band spectral centroid histograms. IEEE Trans Audio Speech Lang Process 14(2):600–608

    Article  Google Scholar 

  11. Giannakopoulos T, Pikrakis A (2014) Chapter 4 - Audio features. In: Academic press oxford: introduction to audio analysis, pp 59–103

  12. Gowdy JN, Tufekci Z (2000) Mel-scaled discrete wavelet coefficients for speech recognition. In: IEEE International conference on acoustics speech and signal processing (ICASSP), pp 1351–1354

  13. Haeb-Umbach R, Ney H (1992) Linear discriminant analysis for improved large vocabulary continuous speech recognition. In: IEEE International conference on acoustics, speech, and signal processing (ICASSP), vol. 1, pp 13–16

  14. Hanani A, Russell M, Carey MJ (2011) Speech-based identification of social groups in a single accent of British English by humans and computers. In: IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 4876–4879

  15. Hansen JH, Liu G (2016) Unsupervised accent classification for deep data fusion of accent and language information. Speech Comm 78:19–33

    Article  Google Scholar 

  16. Hermansky H (1990) Perceptual linear predictive (PLP) analysis of speech. J Acoust Soc Am 87(4):1738–52

    Article  Google Scholar 

  17. Honnavalli D, Shylaja SS (2021) Supervised machine learning model for accent recognition in English speech using sequential MFCC features. In: Advances in artificial intelligence and data engineering, pp 55–66

  18. Huang R, Hansen JHL, Angkititrakul P (2007) Dialect/accent classification using unrestricted audio. IEEE Trans Audio Speech Lang Process 15 (2):453–464

    Article  Google Scholar 

  19. Ikeno A, Hansen J (2006) The role of prosody in the perception of US native English accents. In: INTERSPEECH

  20. Jiang N, Grosche P, Konz V, Mller M (2011) Analyzing chroma feature types for automated chord recognition. In: Proceedings of the 42 nd AES Conference, vol. 198

  21. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu TY (2017) Lightgbm: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst 30:3149–3157

    Google Scholar 

  22. Kolly MJ, de Mareil PB, Leemann A, Dellwo V (2017) Listeners use temporal information to identify French- and English-accented speech. Speech Commun 86:121–134

    Article  Google Scholar 

  23. Kua J, Thiruvaran T, Nosrati HEA, Epps J (2010) Investigation of spectral centroid magnitude and frequency for speaker recognition. In: Odyssey: the speaker and language recognition workshop

  24. Leo B (2001) Random forests. Mach Learn 45:5–32

    Article  MATH  Google Scholar 

  25. Mannepalli K, Suman PSM (2016) MFCC-GMM based accent recognition system for Telugu speech signals. Int J Speech Technol 19:87–93

    Article  Google Scholar 

  26. Najafian M, Russell M (2020) Automatic accent identification as an analytical tool for accent robust automatic speech recognition. Speech Comm 122:44–55

    Article  Google Scholar 

  27. Pal SK, Mitra S (1992) Multilayer perceptron, fuzzy sets, and classification. IEEE Trans Neural Netw 3(5):683–697

    Article  Google Scholar 

  28. Paliwal KK (1998) Spectral subband centroid features for speech recognition. In: IEEE International conference on acoustics, speech, and signal processing (ICASSP), pp 617–620

  29. Pappagari R, Wang T, Villalba J, Chen N, Dehak N (2020) X-vectors meet emotions: a study on dependencies between emotion and speaker recognition. In: IEEE International conference on acoustics, speech and signal processing(ICASSP), pp 7169–7173

  30. Pikrakis A, Giannakopoulos T, Theodoridis S (2008) A speech/music discriminator of radio recordings based on dynamic programming and Bayesian networks. IEEE Trans Multimed 10(5):846–857

    Article  Google Scholar 

  31. Rajpal A, Patel TB, Sailor HB, Madhavi MC, Patil H, Fujisaki H (2016) Native language identification using spectral and source-based features. In: INTERSPEECH

  32. Rasipuram R, Cernak M, Nanchen A, Magimai-doss M (2015) Automatic accentedness evaluation of non-native speech using phonetic and sub-phonetic posterior probabilities. In: INTERSPEECH

  33. Safavian SR, Landgrebe D (1991) A survey of decision tree classifier methodology. IEEE Trans Syst Man Cybern 21(3):660–674

    Article  MathSciNet  Google Scholar 

  34. Saleem S, Subhan F, Naseer N, Bais A, Imtiaz A (2020) Forensic speaker recognition: a new method based on extracting accent and language information from short utterances. Forensic Sci Int Digit Investig 34(300):982

    Google Scholar 

  35. Schorkhuber C, Klapuri A, Sontacchi A (2013) Audio pitch shifting using the constant-Q transform. J Audio Eng Soc 61:562–572

    Google Scholar 

  36. Sharma G, Umapathy K, Krishnan S (2020) Trends in audio signal feature extraction methods. Appl Acoust 158(107):020

    Google Scholar 

  37. Srivastava S, Gupta M, Frigyik A (2007) Bayesian quadratic discriminant analysis. J Mach Learn Res 8:1277–1305

    MathSciNet  MATH  Google Scholar 

  38. Speech Processing, Transmission and Quality Aspects (STQ);Speech quality performance in the presence of background noise;Part 1: Background noise simulation technique and background noise database, ETSI EG 202 396-1: European Telecommunications Standards Institute, Sophia Antipolis, 45–47, (2008–09)

  39. Todisco M, Delgado H, Evans N (2016) A new feature for automatic speaker verification anti-spoofing: constant Q cepstral coefficients Computer Speech and Language

  40. Unni V, Joshi N, Jyothi P (2020) Coupled Training of Sequence-to-Sequence Models for Accented Speech Recognition. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 8254–8258

  41. Vieru B, de Mareüil PB, Adda-Decker M (2011) Characterisation and identification of non-native French accents. Speech Commun 53(3):292310

    Article  Google Scholar 

  42. Viswanathan R, Paramasivam P, Vepa J (2018) Hierarchical accent determination and application in a large scale ASR system. In: INTERSPEECH

  43. Waldekar S, Saha G (2018) Wavelet transform based mel-scaled features for acoustic scene classification. In: INTERSPEECH, pp 3323–3327

  44. Waldekar S, Saha G (2018) Classification of audio scenes with novel features in a fused system framework. Digit Signal Process 75:71–82

    Article  MathSciNet  Google Scholar 

  45. Weninger F, Sun Y, Park J, Willett D, Zhan P (2019) Deep learning based Mandarin accent identification for accent robust ASR. In: INTERSPEECH

  46. Wu T, Duchateau J, Martens JP, Van Compernolle D (2010) Feature subset selection for improved native accent identification. Speech Comm 52(2):83–98

    Article  Google Scholar 

  47. Wu Y, Mao H, Yi Z (2018) Audio classification using attention-augmented convolutional neural network. Knowl-Based Syst 161:90–100

    Article  Google Scholar 

  48. Xuesong Y, Kartik A, Andrew R, Samuel T, Bhuvana R, Mark H (2018) Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition. In: IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 1–5

  49. Zhan Y, Chen H, Zhang G (2006) An optimization algorithm of k-NN classification. In: International conference on machine learning and cybernetics, pp 2246–2251

  50. Zhang T, Wu J (2019) Discriminative frequency filter banks learning with neural networks. EURASIP J Audio Speech Music Process. 1

  51. Zhang JP, Zhong XL (2019) Adaptive recognition of different accents conversations based on convolutional neural network. Multimed Tools Appl 78:30,74930,767

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Priya Dharshini G.

Ethics declarations

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

G, P.D., Rao, K.S. Accent classification from an emotional speech in clean and noisy environments. Multimed Tools Appl 82, 3485–3508 (2023). https://doi.org/10.1007/s11042-022-13236-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-13236-w

Keywords

Navigation