Abstract
For the past two decades emotion recognition has gained great attention because of huge potential in many applications. Most works in this field try to recognize emotion from single modal such as image or speech. Recently, there are some studies investigating emotion recognition from multi-modal, i.e., speech and image. The information fusion strategy is a key point for multi-modal emotion recognition, which can be grouped into two main categories: feature level fusion and decision level fusion. This paper explores the emotion recognition from multi-modal, i.e., speech and image. We make a systemic and detailed comparison among several feature level fusion methods and decision level fusion methods such as PCA based feature fusion, LDA based feature fusion, product rule based decision fusion, mean rule based decision fusion and so on. We test all the compared methods on the Surrey Audio-Visual Expressed Emotion (SAVEE) Database. The experimental results demonstrate that emotion recognition based on fusion of speech and image achieved high recognition accuracy than emotion recognition from single modal, and also the decision level fusion methods show superior to feature level fusion methods in this work.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Cahn, J.: The generation of affect in synthesized speech. J. Am. Voice I/O Soc. 8(1), 1–19 (1990)
Tato, R., Santos, R., Kompe, R., et al.: Emotional space improves emotion recognition. In: 7th International Conference on Spoken Language Processing, Denver, Colorado, pp. 2029–2032 (2002)
Liu, J., Chen, A., Ye, C.X., Bu, J.J.: Speech emotion recognition based on covariance descriptor and Riemannian manifold. Pattern Recogn. Artif. Intell. 05, 673–677 (2009)
Bitouk, D., Verma, R., Nenkova, A.: Class-level spectral features for emotion recognition. Speech Commun. 52(7–8), 613–625 (2010)
Bozkurt, E., Erzin, E., Erdem, Ē.: Formant position based weighted spectral features for emotion recognition. Speech Commun. 53(9), 1186–1197 (2011)
Schuller, B., Batliner, A., Steidl, S., et al.: Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun. 53(9), 1062–1087 (2011)
Koolagudi, S.G., Rao, K.S.: Emotion recognition from speech using source, system, and prosodic features. Int. J. Speech Technol. 1–25 (2012)
Zhang, Z., Lyons, M., Schuster, M., et al.: Comparison between geometry-based and Gabor-wavelets-based facial expression recognition using multi-layer perceptron. In: Third IEEE International Conference on Face and Gesture Recognition, Nara, Japan, pp. 454–459 (1998)
He, K., Wang, G., Yang, Y.: Optical flow-based facial feature tracking using prior measurement. In: 7th IEEE International Conference on Cognitive Informatics, Stanford, CA, pp. 324–331 (2008)
Abboud, B., Davoine, F., Dang, M.: Facial expression recognition and synthesis based on an appearance model. Sig. Process. Image Commun. 19(8), 723–740 (2004)
Zhao, S., Gao, Y., Jiang, X., Yao, H., Chua, T., Sun, X.: Exploring principles-of-art features for image emotion recognition. In: ACM Multimedia, pp. 47–56 (2014)
Lee, C.M., Narayanan, S.: Toward detecting emotions in spoken dialogs. IEEE Trans. Speech Audio Process. 13(2), 293–303 (2005)
Zhao, S., Yao, H., Sun, X.: Video classification and recommendation based on affective analysis of viewers. Neurocomputing 101–110 (2013)
Zhao, S., Yao, H., Gao, Y., Ji, R., Ding, G.: Continuous probability distribution prediction of image emotions via multi-task shared sparse regression. IEEE Trans. Multimedia 1 (2016)
Kudiri, K.M., Said, A.M., Nayan, M.Y.: Emotion detection through speech and facial expressions. In: 2014 International Conference on Computer Assisted System in Health (CASH), pp. 26–31. IEEE (2014)
Zhao, S., Yao, H., Jiang, X.: Predicting continuous probability distribution of image emotions in valence-arousal space. In: ACM Multimedia, pp. 879–882 (2015)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)
Haq, S., Jackson, P.J.B., Edge, J.: Audio-visual feature selection and reduction for emotion classification. In: Proceedings of International Conference on Auditory-Visual Speech Processing (AVSP 2008), Tangalooma, Australia (2008)
Acknowledgements
This work is supported by National Natural Science Foundation of China Project (No. 61402219) and Postdoctoral Foundation Projects (Nos. LBH-Z14090, 2015M571417 and 2017T100243).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic Supplementary Material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Li, Y., He, Q., Zhao, Y., Yao, H. (2018). Multi-modal Emotion Recognition Based on Speech and Image. In: Zeng, B., Huang, Q., El Saddik, A., Li, H., Jiang, S., Fan, X. (eds) Advances in Multimedia Information Processing – PCM 2017. PCM 2017. Lecture Notes in Computer Science(), vol 10735. Springer, Cham. https://doi.org/10.1007/978-3-319-77380-3_81
Download citation
DOI: https://doi.org/10.1007/978-3-319-77380-3_81
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77379-7
Online ISBN: 978-3-319-77380-3
eBook Packages: Computer ScienceComputer Science (R0)