Skip to main content
Log in

Enhanced speech emotion detection using deep neural networks

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

This paper focusses on investigation of the effective performance of perceptual based speech features on emotion detection. Mel frequency cepstral coefficients (MFCC’s), perceptual linear predictive cepstrum (PLPC), Mel frequency perceptual linear prediction cepstrum (MFPLPC), bark frequency cepstral coefficients (BFCC), revised perceptual linear prediction coefficient’s (RPLP) and inverted Mel frequency cepstral coefficients (IMFCC) are the perception features considered. The algorithm using these auditory cues is evaluated with deep neural networks (DNN). The novelty of the work involves analysis of the perceptual features to identify predominant features that contain significant emotional information about the speaker. The validity of the algorithm is analysed on publicly available Berlin database with seven emotions in 1-dimensional space termed categorical and 2-dimensional continuous space consisting of emotions in valence and arousal dimensions. Comparative analysis reveals that considerable improvement in the performance of emotion recognition is obtained using DNN with the identified combination of perceptual features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Amer, M. R., Siddiquie, B., Richey, C., & Divakaran, A. (2014). Emotion detection in speech using deep networks. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), Florence, pp. 3724–3728.

  • Anagnostopoulos, C. N., Iliou, T., & Giannoukos, I. (2015). Features and classifiers for emotion recognition from speech: A survey from 2010 to 2011. Artificial Intelligence Review, 43(2), 155–177.

    Article  Google Scholar 

  • Anila, R., & Revathy, A. (2015). Emotion recognition using continuous density HMM. In IEEE international conference on communications and signal processing (ICCSP), pp. 0919–0923.

  • Badshah, A. M., Ahmad, J., Rahim, N., & Baik, S. W. (2017). Speech emotion recognition from spectrograms with deep convolutional neural network, 2017. In International conference on platform technology and service (PlatCon), Busan, South Korea, pp. 1–5.

  • Bitouk, D., Verma, R., & Nenkova, A. (2010). Class-level spectral features for emotion recognition. Speech Communication, 52(7), 613–625.

    Article  Google Scholar 

  • Cullen, A., & Harte, N. (2013). Late integration of features for acoustic emotion recognition. In Proceedings of the 21st European Signal Processing Conference (EUSIPCO), IEEE, pp 1–5.

  • Deb, S., & Dandapat, S. (2017). Emotion classification using segmentation of vowel-like and non-vowel-like regions. IEEE Transactions on Affective Computing, 99, 1–1.

    Article  Google Scholar 

  • Deng, L. (2012). Three classes of deep learning architectures and their applications: A tutorial survey. In APSIPA Transactions on Signal and Information Processing.

  • Dorota Kaminska, T., Sapinski, & Anbarjafari, G. (2017). Efficiency of chosen speech descriptors in relation to emotion recognition, Eurasip Journal of Speech, Audio and Music Processing. https://doi.org/10.1186/s13636-017-0100-x.

    Google Scholar 

  • Ekman, P. (1992). Argument for basic emotions. Cognition and Emotion., 6, 169–200.

    Article  Google Scholar 

  • Fayek, H. M., Lech, M., & Cavedon, L. (2015). Towards real-time speech emotion recognition using deep neural networks. In 2015 9th International Conference on Signal Processing and Communication Systems (ICSPCS), Cairns, QLD, pp. 1–5.

  • Feraru, S. M., & Zbancioc, M. D. (2013). Emotion recognition in Romanain language using LPC features. In E-health and bioengineering conference (EHB), pp 1–4.

  • Ghai, M., Lal, S., Duggal, S., & Manik, S. (2017). Emotion recognition on speech signals using machine learning. In 2017 international conference on big data analytics and computational intelligence (ICBDAC), Chirala, pp. 34–39.

  • Han, J., Zhang, Z., Ringeval, F., & Schuller, B. (2017). Prediction-based learning for continuous emotion recognition in speech. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, pp. 5005–5009.

  • Hassan, A., & Damper, R. I. (2012). Classification of emotional speech using 3dec hierarchical classifier. Speech Communication, 54(7), 903–916.

    Article  Google Scholar 

  • http://www.mathworks.com/matlabcentral/.

  • Huang, C. W., & Narayanan, S. S. S. (2016). Flow of Renyi information in deep neural networks. In 2016 IEEE 26th international workshop on machine learning for signal processing (MLSP), Vietrisul Mare, pp. 1–6.

  • Huang, Z., & Epps, J. (2017). A PLLR and multi-stage staircase regression framework for speech-based emotion prediction. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, pp. 5145–5149.

  • Jassim, W. A., Paramesran, R., & Harte, N. (2017). Speech emotion classification using combined neurogram and INTERSPEECH 2010 paralinguistic challenge features. IET Signal Processing, 11(5), 587–595.

    Article  Google Scholar 

  • Kamińska, D., Sapiński, T., & Pelikant, A. (2013). Comparison of perceptual features efficiency for automatic identification of emotional states from speech. In 6th international conference on human system interactions (HSI), Sopot, pp. 210–213.

  • Khan, A., & Roy, U. K. (2017). Emotion recognition using prosodie and spectral features of speech and Naïve Bayes Classifier. In 2017 international conference on wireless communications, signal processing and networking (WiSPNET), Chennai, pp. 1017–1021.

  • Khorrami, P., Le Paine, T., Brady, K., Dagli, C., & Huang, T. S. (2016). How deep neural networks can improve emotion recognition on video data. In 2016 IEEE international conference on image processing (ICIP), Phoenix, pp. 619–623.

  • Kotti, M., & Paterno, F. (2012). Speaker-independent emotion recognition exploiting a psychologically-inspired binary cascade classification schema. International Journal of Speech Technology, 15(2), 131–150.

    Article  Google Scholar 

  • Kumar, K., Kim, C., & Stern, R. M. (2011). Delta-spectral cepstral coefficients for robust speech recognition. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP), Prague, pp. 4784–4787.

  • Lalitha, S., Chaitanya, K. K., Teja, G. V. N., Varma, K. V., & Tripathi, S. (2015). Time-frequency and phase derived features for emotion classification. In 2015 annual IEEE India conference (INDICON), New Delhi, pp. 1–5.

  • Lalitha, S., Geyasruti, D., Narayanan, R., & Shravani, M. (2015). Emotion detection using MFCC and cepstrum features. In 4th international conference on eco-friendly computing and communication systems, Procedia Computer Science, pp 29–35.

  • Lalitha, S., Madhavan, A., Bhushan, B., & Saketh, S. (2014). Speech emotion recognition. In International conference on advances in electronics, computers and communications (ICAECC), pp 1–4.

  • Latha (2016). Robust speaker identification incorporating high frequency features. In Twelth international multi-conference on information processing, Procedia Computer Science, pp 804–811.

  • Li, L., et al. (2013). Hybrid deep neural network–hidden markov model (DNN-HMM) based speech emotion recognition. In Humaine association conference on affective computing and intelligent interaction, Geneva, pp. 312–317.

  • Lim, W., Jang, D., & Lee, T. (2016). Speech emotion recognition using convolutional and recurrent neural networks. In 2016 Asia-Pacific signal and information processing association annual summit and conference (APSIPA), Jeju, pp. 1–4.

  • Ma, J., Jin, H., Yang, L., & Tsai, J. (2006). Ubiquitous intelligence and computing. In Third International Conference, UIC 2006, Wuhan, China, September 3–6, 2006, Proceedings (Lecture Notes in Computer Science), Springer, New York, Inc., Secaucus.

  • Mannepalli, K., Sastry, P. N., & Suman, M. (2016). A novel adaptive fractional deep belief networks for speaker emotion recognition. Alexandria Engineering Journal. https://doi.org/10.1016/j.aej.2016.09.002

    Google Scholar 

  • Mao, X., Chen, L., & Fu, L. (2009). Multi-level Speech Emotion Recognition Based on HMM and ANN. In WRI world congress on computer science and information engineering, IEEE, pp. 225–229.

  • Niu, J., Qian, Y., & Yu, K. (2014). Acoustic emotion recognition using deep neural network. In The 9th international symposium on chinese spoken language processing, Singapore, pp. 128–132.

  • Parthasarathy, S., Lotfian, R., & Busso, C. (2017). Ranking emotional attributes with deep neural networks. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, pp. 4995–4999.

  • Schlosberg, H. (1954). Three dimensions of emotions. Psychological Review 61, 81–88.

    Article  Google Scholar 

  • Soltani, K., & Ainon, R. N. (2007). Speech emotion detection based on neural networks. In 2007 9th international symposium on signal processing and its applications, Sharjah, pp. 1–3.

  • Trigeorgis, G., et al. (2017). Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, pp. 5200–5204.

  • Vlckova-Mejvaldova, J., & Horak, P. (2011). The influence of individual prosodic parameters on the perception of emotions in Czech. In Signal processing algorithms, architectures, arrangements, and applications conference proceedings (SPA), IEEE, pp. 1–6.

  • Wang, K., An, N., Li, B. N., Zhang, Y., & Li, L. (2015). Speech emotion recognition using Fourier parameters. IEEE Transactions on Affective Computing, 6(1), 69–75.

    Article  Google Scholar 

  • Wang, Y., & Guan, L. (2008). Recognizing human emotional state from audiovisual signals. In IEEE transactions on multimedia, pp. 936–946.

  • Wang, Z., & Tashev, I. (2017). Learning utterance-level representations for speech emotion and age/gender recognition using deep neural networks. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, pp. 5150–5154.

  • Xia, R., & Liu, Y. (2017). A multi-task learning framework for emotion recognition using 2D continuous space. IEEE Transactions on Affective Computing, 8(1), 3–14.

    Article  Google Scholar 

  • Yadav, J., Kumari, A., & Rao, K. S. (2015). Emotion recognition using LP residual at sub-segmental, segmental and supra-segmental levels. In International conference on communication, information & computing Technology (ICCICT), IEEE, pp. 1–6.

  • Yang, B., & Lugger, M. (2010). Emotion recognition from speech signals using new harmony features. Signal Processing, 90(5), 1415–1423.

    Article  MATH  Google Scholar 

  • Zao, L., Cavalcante, D., & Coelho, R. (2014). Time-frequency feature and AMS-GMM mask for acoustic emotion classification. IEEE Signal Processing Letters, 21(5), 620–624.

    Article  Google Scholar 

  • Zhang, Y., Liu, Y., Weninger, F., & Schuller, B. (2017). Multi-task deep neural network with shared hidden layers: Breaking down the wall between emotion representations. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, pp. 4990–4994.

  • Zheng, W. Q., Yu, J. S., & Zou, Y. X. (2015). An experimental study of speech emotion recognition based on deep convolutional neural networks. In 2015 international conference on affective computing and intelligent interaction (ACII), Xi’an, pp. 827–831.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. Lalitha.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lalitha, S., Tripathi, S. & Gupta, D. Enhanced speech emotion detection using deep neural networks. Int J Speech Technol 22, 497–510 (2019). https://doi.org/10.1007/s10772-018-09572-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-018-09572-8

Keywords

Navigation