Multimodal Fusion of Spatial-Temporal Features for Emotion Recognition in the Wild

Wang, Zuchen; Fang, Yuchun

doi:10.1007/978-3-319-77380-3_20

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10735))

Included in the following conference series:

Pacific Rim Conference on Multimedia

2789 Accesses

Abstract

Making the machine understand human emotion is a challenge to realize artificial intelligence. Considering the temporal correlation widely exists in the video, we present a multimodal fusion of spatial-temporal features system to recognize emotion. For the visual modality, the spatial-temporal features are extracted to represent the dynamic emotional variance along with the facial action in the video. The audio modality is utilized to assist the visual modality. A decision-level fusion approach is presented to make full use of the complementarity between visual modality and audio modality to boost the performance of the emotion recognition system. The experiments on a challenging dataset AFEW4.0 show that the proposed system achieves better generalization performance compared with other state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 155.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Dhall, A., Goecke, R., Lucey, S., Gedeon, T.: Collecting large, richly annotated facial-expression databases from movies. IEEE Multimedia 19, 34–41 (2012)
Article Google Scholar
Chen, J., Chen, Z., Chi, Z., Fu, H.: Emotion recognition in the wild with feature fusion and multiple kernel learning. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 508–513 (2014)
Google Scholar
Sun, B., Li, L., Zuo, T., Chen, Y., Zhou, G., Wu, X.: Combining multimodal features with hierarchical classifier fusion for emotion recognition in the wild. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 481–486 (2014)
Google Scholar
Liu, M., Wang, R., Li, S., Shan, S., Huang, Z., Chen, X.: Combining multiple kernel methods on riemannian manifold for emotion recognition in the wild. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 494–501 (2014)
Google Scholar
Zhao, X., Liang, X., Liu, L., Li, T., Han, Y., Vasconcelos, N., Yan, S.: Peak-piloted deep network for facial expression recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 425–442. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_27
Chapter Google Scholar
Carrier, P.L., Courville, A., Goodfellow, I.J., Mirza, M., Bengio, Y.: FER-2013 face database. Technical report (2013)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Google Scholar
Zhao, G., Pietikainen, M.: Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 29, 915–928 (2007)
Article Google Scholar
Schuller, B., Steidl, S., Batliner, A., et al.: The INTERSPEECH 2010 paralinguistic challenge. In: Conference of the International Speech Communication Association, INTERSPEECH 2010, pp. 2794–2797 (2010)
Google Scholar
Schuller, B., Valstar, M., Eyben, F., McKeown, G., Cowie, R., Pantic, M.: AVEC 2011–the first international audio/visual emotion challenge. In: D’Mello, S., Graesser, A., Schuller, B., Martin, J.-C. (eds.) ACII 2011. LNCS, vol. 6975, pp. 415–424. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24571-8_53
Chapter Google Scholar
Dhall, A., Goecke, R., Joshi, J., Sikka, K., Gedeon, T.: Emotion recognition in the wild challenge 2014: baseline, data and protocol. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 461–466 (2014)
Google Scholar
Ringeval, F., Amiriparian, S., Eyben, F., Scherer, K., Schuller, B.: Emotion recognition in the wild: incorporating voice and lip activity in multimodal decision-level fusion. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 473–480 (2014)
Google Scholar
Khorrami, P., Le Paine, T., Brady, K., Dagli, C., Huang, T.: How deep neural networks can improve emotion recognition on video data. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 619–623 (2016)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Jung, H., Lee, S., Yim, J., Park, S., Kim, J.: Joint fine-tuning in deep neural networks for facial expression recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2983–2991 (2015)
Google Scholar
Kaya, H., Salah, A.: Combining modality-specific extreme learning machines for emotion recognition in the wild. J. Multimodal User Interfaces 10, 139–149 (2016)
Article Google Scholar
Huang, X., He, Q., Hong, X., Zhao, G., Pietikainen, M.: Improved spatiotemporal local monogenic binary pattern for emotion recognition in the wild. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 514–520 (2014)
Google Scholar
Yan, J., Zheng, W., Xu, Q., Lu, G., Li, H., Wang, B.: Sparse kernel reduced-rank regression for bimodal emotion recognition from facial expression and speech. IEEE Trans. Multimedia 18, 1319–1329 (2016)
Article Google Scholar
Fan, Y., Lu, X., Li, D., Liu, Y.: Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 445–450 (2016)
Google Scholar
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Eyben, F., Wöllmer, M., Schuller, B.: OpenSmile: the Munich versatile and fast open-source audio feature extractor. In: ACM International Conference on Multimedia, pp. 1459–1462 (2010)
Google Scholar

Download references

Acknowledgements

The work is funded by the National Natural Science Foundation of China (No. 61371149, No. 61170155), Shanghai Innovation Action Plan Project (No. 16511101200) and the Open Project Program of the National Laboratory of Pattern Recognition (No. 201600017).

Author information

Authors and Affiliations

School of Computer Engineering and Science, Shanghai University, Shanghai, China
Zuchen Wang & Yuchun Fang

Authors

Zuchen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yuchun Fang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuchun Fang .

Editor information

Editors and Affiliations

University of Electronic Science and Technology of China, Chengdu, China
Bing Zeng
University of Chinese Academy of Sciences, Beijing, China
Qingming Huang
University of Ottawa, Ottawa, Ontario, Canada
Abdulmotaleb El Saddik
University of Electronic Science and Technology of China, Chengdu, China
Hongliang Li
Chinese Academy of Sciences, Beijing, China
Shuqiang Jiang
Harbin Institute of Technology, Harbin, China
Xiaopeng Fan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Z., Fang, Y. (2018). Multimodal Fusion of Spatial-Temporal Features for Emotion Recognition in the Wild. In: Zeng, B., Huang, Q., El Saddik, A., Li, H., Jiang, S., Fan, X. (eds) Advances in Multimedia Information Processing – PCM 2017. PCM 2017. Lecture Notes in Computer Science(), vol 10735. Springer, Cham. https://doi.org/10.1007/978-3-319-77380-3_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-77380-3_20
Published: 10 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77379-7
Online ISBN: 978-3-319-77380-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics