Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition

Huang, Yongming; Tian, Kexin; Wu, Ao; Zhang, Guobao

doi:10.1007/s12652-017-0644-8

Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition

Original Research
Published: 02 December 2017

Volume 10, pages 1787–1798, (2019)
Cite this article

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Yongming Huang^1,2,
Kexin Tian^1,2,
Ao Wu^1,2 &
…
Guobao Zhang^1,2

1304 Accesses
58 Citations
Explore all metrics

Abstract

The speech emotion recognition accuracy of prosody feature and voice quality feature declines with the decrease of signal to noise ratio (SNR) of speech signals. In this paper, we propose novel sub-band spectral centroid weighted wavelet packet Cepstral coefficients (W-WPCC) for robust speech emotion recognition. The W-WPCC feature is computed by combining the sub-band energies with sub-band spectral centroids via a weighting scheme to generate noise-robust acoustic features. And deep belief networks (DBNs) are artificial neural networks having more than one hidden layer, which are first pre-trained layer by layer and then fine-tuned using back propagation algorithm. The well-trained deep neural networks are capable of modeling complex and non-linear features of input training data and can better predict the probability distribution over classification labels. We extracted prosody feature, voice quality features and wavelet packet Cepstral coefficients (WPCC) from the speech signals to combine with W-WPCC and fused them by DBNs. Experimental results on Berlin emotional speech database show that the proposed fused feature with W-WPCC is more suitable in speech emotion recognition under noisy conditions than other acoustics features and proposed DBNs feature learning structure combined with W-WPCC improve emotion recognition performance over the conventional emotion recognition method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Learning Based Emotion Recognition from Chinese Speech

Emotion recognition of speech signal using Taylor series and deep belief network based classification

Article 06 January 2020

Feature Learning via Deep Belief Network for Chinese Speech Emotion Recognition

References

Ali Hassan R, Damper, Niranjan M (2013) On acoustic emotion recognition: compensating for covariate shift. IEEE Trans Audio Speech Lang Process 21(7):1458–1468
Article Google Scholar
Atal BS (1974) Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. J Acoust Soc Am 55(6):1304–1312
Article Google Scholar
Bahreini K, Nadolski R, Westera W (2016) Towards multimodal emotion recognition in e-learning environments. Inter Learning Environ 24(3):590–605
Article Google Scholar
Bengio Y (2009) Learning deep architectures for AI. Now Publ Inc 2(1):67–76
MathSciNet MATH Google Scholar
Bengio Y (2012) Deep learning of representations for unsupervised and transfer learning. J Mach Learning Res Proc Track 27(2), 17–36
MathSciNet Google Scholar
Brisson J, Martel K, Serres J, Sirois S, Adrien JL (2014) Acoustic analysis of oral productions of infants later diagnosed with autism and their mother. Inf Ment Health J 35(3):285–295
Article Google Scholar
Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of german emotional speech. In: proceeding interspeech 2005, ISCA, pp 1517–1520
Caponetti L, Buscicchio CA, Castellano G (2011) Biologically inspired emotion recognition from speech. Eurasip J Adv Signal Process 2011(1):1–10
Article Google Scholar
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3), pp 1–27
Article Google Scholar
Crumpton J, Bethel CL (2015) A survey of using vocal prosody to convey emotion in robot speech. Int J Social Robot 8(2):271–285
Article Google Scholar
Deng J, Xia R, Zhang Z, Liu Y (2014) Introducing shared-hidden-layer autoencoders for transfer learning and their application in acoustic emotion recognition. Icassp IEEE international conference on acoustics, pp 4818–4822
Farooq O, Datta S (2001) Mel filter-like admissible wavelet packet structure for speech recognition. Signal Process Lett IEEE 8(7):196–198
Article Google Scholar
Fastl H, Zwicer E (1999) Psychoacoustics Facts and Models[M], 2nd edn. Springer, New York
Google Scholar
Feng Z, Zheng WX (2015) On extended dissipativity of discrete-time neural networks with time delay. IEEE Trans Neural Netw Learning Syst 26(12):3293–3300
Article MathSciNet Google Scholar
France DJ, Shiavi RG, Silverman S et al (2000) Acoustical properties of speech as indicators of depression and suicidal risk. IEEE Trans Biomed Eng 47(7):829–837
Article Google Scholar
Guzman M, Correa S, Munoz D et al (2013) Influence on spectral energy distribution of emotional expression. J Voice 27(1):129.e1–129.e10
Article Google Scholar
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Article MathSciNet MATH Google Scholar
Idris I, Salam MS (2015) Voice quality features for speech emotion recognition. J Info Assur Secur 10(4):183–191
Google Scholar
Iliev AI, Scordilis MS (2011) Spoken emotion recognition using glottal symmetry. Eurasip J Adv Sig Process 2011(1):1–11
Article Google Scholar
Kandali AB, Routray A, Basu TK (2009) Vocal emotion recognition in five native languages of Assam using new wavelet features. Int J Speech Technol 12(1):1–13
Article Google Scholar
Karmakar A, Kumar A, Patney RK (2007) Design of optimal wavelet packet trees based on auditory perception criterion. Ieee Signal Process Lett 14(4):240–243
Article Google Scholar
Lee CM, Narayanan SS (2005) Toward detecting emotions in spoken dialogs. IEEE Trans Speech Audio Process 13(2):293–303
Article Google Scholar
Lee TH, Park MJ, Park JH, Kwon OM, Lee SM (2014) Extended dissipative analysis for neural networks with time-varying delays. IEEE Trans Neural Netw Learning Syst 25(10):1936–1941
Article Google Scholar
Mallat SA (2009) Wavelet tour of signal processing, 3rd edn. Academic Press, Burlington
MATH Google Scholar
Malta L, Miyajima C, Kitaoka N et al. (2009) Multimodal estimation of a driver’s spontaneous irritation. Intelligent vehicles symposium, 2009 IEEE, pp 573–577
Mingyu Y, Chun C, Jiajun B et al. (2006) Emotion recognition from noisy speech. In: multimedia and expo, IEEE international conference on 2006, pp 1653–1656
Morrison D, Wang RL, De Silva LC (2007) Ensemble methods for spoken emotion recognition in call-centres. Speech Commun 49(2):98–112
Article Google Scholar
Paliwal KK (1998) Spectral subband centroid features for speech recognition. Acoustics, speech and processings. Proceedings of the IEEE international conference on 1998, pp 617–620
Petrushin V (2000) Emotion recognition in speech signal experimental study, development, and application. ICSLP 2000, Beijing, pp 222–225
Google Scholar
Sarikaya R, Gowdy JN (1997) Wavelet based analysis of speech under stress[C]. Southeastcon ‘97. engineering new century., proceedings IEEE, pp 92–96
Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture[C]. Acoustics, speech, and signal processing, proceedings (ICASSP ‘04). IEEE international conference on 2004, pp I-577–580
Shah M, Chakrabarti C, Spanias A (2015) Within and cross-corpus speech emotion recognition using latent topic model-based features. Eurasip J Audio Speech Music Process 2015(1):1–17
Article Google Scholar
Shamiand M, Verhelst W (2007) Anevaluation of the robustness of existing supervised machine learning approaches to the classification of emotions in speech. Speech Commun 49(3):201–212
Article Google Scholar
Stephane M (2009) A wavelet tour of signal processing, 3rd edn. Academic Press, Burlington
MATH Google Scholar
Tahon M, Devillers L (2016) Towards a small set of robust acoustic features for emotion recognition: challenges. IEEE ACM Trans Audio Speech Lang Process 24(1):16–28
Article Google Scholar
Tahon M, Sehili MA, Devillers L (2015) Cross-corpus experiments on laughter and emotion detection in HRI with elderly people. In: International Conference on Social Robotics, vol 31. Springer, pp 633–642
Vlasenko B, Schuller B, Wendemuth A et al (2007) Frame vs. turn-level: emotion recognition from speech considering static and dynamic processing[C]. Affect Comp Intell Interact Proc 781:139–147
Article Google Scholar
Wang X, He Q (2004) Enhancing generalization capability of svm classifiers with feature weight adjustment. International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, vol 3213. Springer, Heidelberg, pp 1037–1043
Yongming H, Ao W, Guobao Z, Yue L (2014a) Speech emotion recognition based on coiflet wavelet packet Cepstral coefficients. Chinese conference on pattern recognition, pp 436–443
Yongming H, Guobao Z, Yue L, Ao W (2014b) Improved emotion recognition with novel task-oriented wavelet packet features, vol 8588. In: 10th international conference, ICIC 2014, Taiyuan, China, August 3–6, pp 706–714
Zeng ZH, Tu JL, Pianfetti BM et al (2008) Audio-visual affective expression recognition through multistream fused HMM[J]. IEEE Trans Multimed 10(4):570–577
Article Google Scholar
Zeng ZH, Pantic M, Roisman GI et al (2009) A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58
Article Google Scholar
Zhang WS, Zhao DH, Chai Z, Yang LT, Liu X, Gong FM, Yang S (2017) Deep learning and SVM-based emotion recognition from Chinese speech for smart affective services. Softw Pract Exp 47(8):1127–1138
Google Scholar
Zhou GJ, Hansen JHL, Kaiser JF (2001) Nonlinear feature based classification of speech under stress. IEEE Trans Speech Audio Process 9(3):201–216
Article Google Scholar
Zhu LZ, Chen LM, Zhao DH, Zhou JH, Zhang WS (2017) Emotion recognition from chinese speech for smart affective services using a combination of SVM and DBN. Sensors 17(7):1694
Article Google Scholar

Download references

Acknowledgements

An earlier version of this paper was presented at the International Conference on Network, Communication and Computing (ICNCC 2016).This work was supported by open Fund of Jiangsu Province Natural Science Foundation (No. BK20140649) and National Natural Science Foundation (No. 61503081, No. 61473079).

Author information

Authors and Affiliations

Laboratory of Measurement and Control of Complex Systems of Engineering, Southeast University, Nanjing, China
Yongming Huang, Kexin Tian, Ao Wu & Guobao Zhang
Ministry of Education, School of Automation, Southeast University, Nanjing, 210096, China
Yongming Huang, Kexin Tian, Ao Wu & Guobao Zhang

Authors

Yongming Huang
View author publications
You can also search for this author in PubMed Google Scholar
Kexin Tian
View author publications
You can also search for this author in PubMed Google Scholar
Ao Wu
View author publications
You can also search for this author in PubMed Google Scholar
Guobao Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yongming Huang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, Y., Tian, K., Wu, A. et al. Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition. J Ambient Intell Human Comput 10, 1787–1798 (2019). https://doi.org/10.1007/s12652-017-0644-8

Download citation

Received: 19 September 2017
Accepted: 27 November 2017
Published: 02 December 2017
Issue Date: 01 May 2019
DOI: https://doi.org/10.1007/s12652-017-0644-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition

Abstract

Access this article

Similar content being viewed by others

Deep Learning Based Emotion Recognition from Chinese Speech

Emotion recognition of speech signal using Taylor series and deep belief network based classification

Feature Learning via Deep Belief Network for Chinese Speech Emotion Recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition

Abstract

Access this article

Similar content being viewed by others

Deep Learning Based Emotion Recognition from Chinese Speech

Emotion recognition of speech signal using Taylor series and deep belief network based classification

Feature Learning via Deep Belief Network for Chinese Speech Emotion Recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation