An Effective Discriminative Learning Approach for Emotion-Specific Features Using Deep Neural Networks

Mao, Shuiyang; Ching, Pak-Chung

doi:10.1007/978-3-030-04212-7_5

Shuiyang Mao¹⁶ &
Pak-Chung Ching¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11304))

Included in the following conference series:

International Conference on Neural Information Processing

2264 Accesses
2 Citations

Abstract

Speech contains rich yet entangled information ranging from phonetic to emotional components. These different components are always mixed together hindering certain tasks from achieving better performance. Therefore, automatically learning a good representation that disentangles these components is non-trivial. In this paper, we propose a hierarchical method to extract utterance-level features from frame-level acoustic features using deep neural networks (DNNs). Moreover, inspired by recent progress in face recognition, we introduce centre loss as a complementary supervision signal to the traditional softmax loss to facilitate the intra-class compactness of the learned features. With the joint supervision of these two loss functions, we can train the DNNs to obtain separable and discriminative emotion-specific features. Experiments on CASIA corpus, Emo-DB corpus and SAVEE database show comparable results with that of state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ververidis, D., Kotropoulos, C.: A state of the art review on emotional speech databases. In: 1st International Workshop on Interactive Rich Media Content Production (RichMedia 2003), Lausanne, Switzerland, pp. 109–119 (2003)
Google Scholar
Rao, K.S., Koolagudi, S.G.: Emotion Recognition Using Speech Features. Springer, New York (2013). https://doi.org/10.1007/978-1-4614-5143-3
Book MATH Google Scholar
Wang, K., An, N., Li, B.N., Zhang, Y., Li, L.: Speech emotion recognition using fourier parameters. IEEE Trans. Affect. Comput. 6(1), 69–75 (2015)
Article Google Scholar
Banse, R., Scherer, K.R.: Acoustic profiles in vocal emotion expression. J. Pers. Soc. Psychol. 70(3), 614–636 (1996)
Article Google Scholar
Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Article Google Scholar
Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine. In: Interspeech 2014, Singapore (2014)
Google Scholar
Lee, J., Tashev, I.: High-level feature representation using recurrent neural network for speech emotion recognition. In: Interspeech 2015, Dresden, Germany (2015)
Google Scholar
Zhang, S., Zhang, S., Huang, T., Gao, W.: Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Trans. Multimed. 20(6), 1576–1590 (2018)
Article Google Scholar
Wen, Y., Zhang, K., Li, Z., Qiao, Y.: A discriminative feature learning approach for deep face recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 499–515. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_31
Chapter Google Scholar
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: CVPR 2006, pp. 1735–1742. IEEE Press, New York (2006)
Google Scholar
Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: CVPR 2015, pp. 815–823. IEEE Press, Boston (2015)
Google Scholar
Chen, K., Salman, A.: Extracting speaker-specific information with a regularized siamese deep network. In: NIPS 2011, pp. 298–306, Granada (2011)
Google Scholar
Zheng, X., Wu, Z., Meng, H., Cai, L.: Contrastive autoencoder for phoneme recognition. In: ICASSP 2014, pp. 2529–2533. IEEE Press, Florence (2014)
Google Scholar
Bredin, H.: Tristounet: triplet loss for speaker turn embedding. In: ICASSP 2017, pp. 5430–5434. IEEE Press, New Orleans (2017)
Google Scholar
Wu, Y., Liu, H., Li, J., Fu, Y.: Deep face recognition with center invariant loss. In: Proceedings of the on Thematic Workshops of ACM Multimedia 2017, pp. 408–414. ACM, Mountain View (2017)
Google Scholar
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., Weiss, B.: A database of German emotional speech. In: Interspeech 2005, Lisbon (2005)
Google Scholar
Haq, S., Jackson, P.J.B., Edge, J.: Speaker-dependent audio-visual emotion recognition. In: AVSP 2009, pp. 53–58. Norfolk (2009)
Google Scholar
Giannakopoulos, T.: pyaudioanalysis: an open-source python library for audio signal analysis. PLoS ONE 10(12), 1–17 (2015)
Article Google Scholar
Tsiakas, K., et al.: A multimodal adaptive dialogue manager for depressive and anxiety disorder screening: a wizard-of-oz experiment. In: Proceedings of the 8th ACM International Conference on PErvasive Technologies Related to Assistive Environments, p. 82. ACM, Corfu (2015)
Google Scholar
Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: theory and applications. Neurocomputing 70(1–3), 489–501 (2006)
Article Google Scholar
Smith, S.L., Kindermans, P.J., Le, Q.V.: Don’t Decay the Learning Rate, Increase the Batch Size (2017). arXiv preprint arXiv:1711.00489
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML 2015, pp. 448–456. Lille (2015)
Google Scholar
Abadi, M., et al.: Tensorflow: A system for large-scale machine learning. In: OSDI 2016, pp. 265–283. Savannah (2016)
Google Scholar
Sun, Y., Wen, G.: Emotion recognition using semi-supervised feature selection with speaker normalization. Int. J. Speech Technol. 18(3), 317–331 (2015)
Article Google Scholar
Yuan, J., Chen, L., Fan, T., Jia, J.: Dimension reduction of speech emotion feature based on weighted linear discriminate analysis. Image Process. Pattern Recognit. 8, 299–308 (2015)
Google Scholar
Sun, Y., Wen, G., Wang, J.: Weighted spectral features based on local Hu moments for speech emotion recognition. Biomed. Signal Process. Control 18, 80–90 (2015)
Article Google Scholar
Li, C.Z., Liu, F.K., Wang, Y.T., et al.: Speech Emotion Recognition Based on PSO-optimized SVM. In: 2nd International Conference on Software, Multimedia and Communication Engineering (SMCE). Shanghai (2017)
Google Scholar
Liu, Z.T., Wu, M., Cao, W.H., et al.: Speech emotion recognition based on feature selection and extreme learning machine decision tree. Neurocomputing 273, 271–280 (2018)
Article Google Scholar
Liu, Z.T., Xie, Q., Wu, M., Cao, W.H., Mei, Y., Mao, J.W.: Speech emotion recognition based on an improved brain emotion learning model. Neurocomputing 309, 145–156 (2018)
Article Google Scholar
Lim, W., Jang, D., Lee, T.: Speech emotion recognition using convolutional and recurrent neural networks. In: APSIPA ASC 2016, pp. 1–4. IEEE Press, Jeju (2016)
Google Scholar
Sidorov, M., Brester, C., Minker, W., Semenkin, E.: Speech-based emotion recognition: feature selection by self-adaptive multi-criteria genetic algorithm. In: LREC 2014, pp. 3481–3485. Reykjavik (2014)
Google Scholar
Yogesh, C.K., Hariharan, M., Ngadiran, R., Adom, A.H., Yaacob, S., Polat, K.: Hybrid BBO\(\_\)PSO and higher order spectral features for emotion and stress recognition from natural speech. Appl. Soft Comput. 56, 217–232 (2017)
Article Google Scholar
Sun, Y., Wen, G.: Ensemble softmax regression model for speech emotion recognition. Multimed. Tools Appl. 76(6), 8305–8328 (2017)
Article Google Scholar
Haq, S., Jackson, P.J.B.: Multimodal emotion recognition. In: Wang, W.W. (ed.) Machine Audition: Principles, Algorithms and Systems, pp. 398–423. IGI Global Press, Hershey (2010). Chapter 17
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China
Shuiyang Mao & Pak-Chung Ching

Authors

Shuiyang Mao
View author publications
You can also search for this author in PubMed Google Scholar
Pak-Chung Ching
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuiyang Mao .

Editor information

Editors and Affiliations

The Chinese Academy of Sciences, Beijing, China
Long Cheng
City University of Hong Kong, Kowloon, Hong Kong
Andrew Chi Sing Leung
Kobe University, Kobe, Japan
Seiichi Ozawa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mao, S., Ching, PC. (2018). An Effective Discriminative Learning Approach for Emotion-Specific Features Using Deep Neural Networks. In: Cheng, L., Leung, A., Ozawa, S. (eds) Neural Information Processing. ICONIP 2018. Lecture Notes in Computer Science(), vol 11304. Springer, Cham. https://doi.org/10.1007/978-3-030-04212-7_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-04212-7_5
Published: 17 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-04211-0
Online ISBN: 978-3-030-04212-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics