Multi-modal Emotion Recognition Based on Speech and Image

Li, Yongqiang; He, Qi; Zhao, Yongping; Yao, Hongxun

doi:10.1007/978-3-319-77380-3_81

Yongqiang Li^19,20,
Qi He¹⁹,
Yongping Zhao¹⁹ &
…
Hongxun Yao²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10735))

Included in the following conference series:

Pacific Rim Conference on Multimedia

2874 Accesses
2 Citations

Abstract

For the past two decades emotion recognition has gained great attention because of huge potential in many applications. Most works in this field try to recognize emotion from single modal such as image or speech. Recently, there are some studies investigating emotion recognition from multi-modal, i.e., speech and image. The information fusion strategy is a key point for multi-modal emotion recognition, which can be grouped into two main categories: feature level fusion and decision level fusion. This paper explores the emotion recognition from multi-modal, i.e., speech and image. We make a systemic and detailed comparison among several feature level fusion methods and decision level fusion methods such as PCA based feature fusion, LDA based feature fusion, product rule based decision fusion, mean rule based decision fusion and so on. We test all the compared methods on the Surrey Audio-Visual Expressed Emotion (SAVEE) Database. The experimental results demonstrate that emotion recognition based on fusion of speech and image achieved high recognition accuracy than emotion recognition from single modal, and also the decision level fusion methods show superior to feature level fusion methods in this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 155.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Cahn, J.: The generation of affect in synthesized speech. J. Am. Voice I/O Soc. 8(1), 1–19 (1990)
MathSciNet Google Scholar
Tato, R., Santos, R., Kompe, R., et al.: Emotional space improves emotion recognition. In: 7th International Conference on Spoken Language Processing, Denver, Colorado, pp. 2029–2032 (2002)
Google Scholar
Liu, J., Chen, A., Ye, C.X., Bu, J.J.: Speech emotion recognition based on covariance descriptor and Riemannian manifold. Pattern Recogn. Artif. Intell. 05, 673–677 (2009)
Google Scholar
Bitouk, D., Verma, R., Nenkova, A.: Class-level spectral features for emotion recognition. Speech Commun. 52(7–8), 613–625 (2010)
Article Google Scholar
Bozkurt, E., Erzin, E., Erdem, Ē.: Formant position based weighted spectral features for emotion recognition. Speech Commun. 53(9), 1186–1197 (2011)
Article Google Scholar
Schuller, B., Batliner, A., Steidl, S., et al.: Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun. 53(9), 1062–1087 (2011)
Article Google Scholar
Koolagudi, S.G., Rao, K.S.: Emotion recognition from speech using source, system, and prosodic features. Int. J. Speech Technol. 1–25 (2012)
Google Scholar
Zhang, Z., Lyons, M., Schuster, M., et al.: Comparison between geometry-based and Gabor-wavelets-based facial expression recognition using multi-layer perceptron. In: Third IEEE International Conference on Face and Gesture Recognition, Nara, Japan, pp. 454–459 (1998)
Google Scholar
He, K., Wang, G., Yang, Y.: Optical flow-based facial feature tracking using prior measurement. In: 7th IEEE International Conference on Cognitive Informatics, Stanford, CA, pp. 324–331 (2008)
Google Scholar
Abboud, B., Davoine, F., Dang, M.: Facial expression recognition and synthesis based on an appearance model. Sig. Process. Image Commun. 19(8), 723–740 (2004)
Article Google Scholar
Zhao, S., Gao, Y., Jiang, X., Yao, H., Chua, T., Sun, X.: Exploring principles-of-art features for image emotion recognition. In: ACM Multimedia, pp. 47–56 (2014)
Google Scholar
Lee, C.M., Narayanan, S.: Toward detecting emotions in spoken dialogs. IEEE Trans. Speech Audio Process. 13(2), 293–303 (2005)
Article Google Scholar
Zhao, S., Yao, H., Sun, X.: Video classification and recommendation based on affective analysis of viewers. Neurocomputing 101–110 (2013)
Article Google Scholar
Zhao, S., Yao, H., Gao, Y., Ji, R., Ding, G.: Continuous probability distribution prediction of image emotions via multi-task shared sparse regression. IEEE Trans. Multimedia 1 (2016)
Google Scholar
Kudiri, K.M., Said, A.M., Nayan, M.Y.: Emotion detection through speech and facial expressions. In: 2014 International Conference on Computer Assisted System in Health (CASH), pp. 26–31. IEEE (2014)
Google Scholar
Zhao, S., Yao, H., Jiang, X.: Predicting continuous probability distribution of image emotions in valence-arousal space. In: ACM Multimedia, pp. 879–882 (2015)
Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)
Google Scholar
Haq, S., Jackson, P.J.B., Edge, J.: Audio-visual feature selection and reduction for emotion classification. In: Proceedings of International Conference on Auditory-Visual Speech Processing (AVSP 2008), Tangalooma, Australia (2008)
Google Scholar

Download references

Acknowledgements

This work is supported by National Natural Science Foundation of China Project (No. 61402219) and Postdoctoral Foundation Projects (Nos. LBH-Z14090, 2015M571417 and 2017T100243).

Author information

Authors and Affiliations

School of Electrical Engineering and Automation, Harbin Institute of Technology, Harbin, 150001, China
Yongqiang Li, Qi He & Yongping Zhao
School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
Yongqiang Li & Hongxun Yao

Authors

Yongqiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Qi He
View author publications
You can also search for this author in PubMed Google Scholar
Yongping Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Hongxun Yao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yongqiang Li .

Editor information

Editors and Affiliations

University of Electronic Science and Technology of China, Chengdu, China
Bing Zeng
University of Chinese Academy of Sciences, Beijing, China
Qingming Huang
University of Ottawa, Ottawa, Ontario, Canada
Abdulmotaleb El Saddik
University of Electronic Science and Technology of China, Chengdu, China
Hongliang Li
Chinese Academy of Sciences, Beijing, China
Shuqiang Jiang
Harbin Institute of Technology, Harbin, China
Xiaopeng Fan

1 Electronic Supplementary Material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOC 142 kb)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, Y., He, Q., Zhao, Y., Yao, H. (2018). Multi-modal Emotion Recognition Based on Speech and Image. In: Zeng, B., Huang, Q., El Saddik, A., Li, H., Jiang, S., Fan, X. (eds) Advances in Multimedia Information Processing – PCM 2017. PCM 2017. Lecture Notes in Computer Science(), vol 10735. Springer, Cham. https://doi.org/10.1007/978-3-319-77380-3_81

Download citation

DOI: https://doi.org/10.1007/978-3-319-77380-3_81
Published: 10 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77379-7
Online ISBN: 978-3-319-77380-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics