Skip to main content

Multimodal Fusion of Spatial-Temporal Features for Emotion Recognition in the Wild

  • Conference paper
  • First Online:
Advances in Multimedia Information Processing – PCM 2017 (PCM 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10735))

Included in the following conference series:

  • 2789 Accesses

Abstract

Making the machine understand human emotion is a challenge to realize artificial intelligence. Considering the temporal correlation widely exists in the video, we present a multimodal fusion of spatial-temporal features system to recognize emotion. For the visual modality, the spatial-temporal features are extracted to represent the dynamic emotional variance along with the facial action in the video. The audio modality is utilized to assist the visual modality. A decision-level fusion approach is presented to make full use of the complementarity between visual modality and audio modality to boost the performance of the emotion recognition system. The experiments on a challenging dataset AFEW4.0 show that the proposed system achieves better generalization performance compared with other state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 155.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://ffmpeg.org.

  2. 2.

    https://github.com/ShiqiYu/libfacedetection.

References

  1. Dhall, A., Goecke, R., Lucey, S., Gedeon, T.: Collecting large, richly annotated facial-expression databases from movies. IEEE Multimedia 19, 34–41 (2012)

    Article  Google Scholar 

  2. Chen, J., Chen, Z., Chi, Z., Fu, H.: Emotion recognition in the wild with feature fusion and multiple kernel learning. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 508–513 (2014)

    Google Scholar 

  3. Sun, B., Li, L., Zuo, T., Chen, Y., Zhou, G., Wu, X.: Combining multimodal features with hierarchical classifier fusion for emotion recognition in the wild. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 481–486 (2014)

    Google Scholar 

  4. Liu, M., Wang, R., Li, S., Shan, S., Huang, Z., Chen, X.: Combining multiple kernel methods on riemannian manifold for emotion recognition in the wild. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 494–501 (2014)

    Google Scholar 

  5. Zhao, X., Liang, X., Liu, L., Li, T., Han, Y., Vasconcelos, N., Yan, S.: Peak-piloted deep network for facial expression recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 425–442. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_27

    Chapter  Google Scholar 

  6. Carrier, P.L., Courville, A., Goodfellow, I.J., Mirza, M., Bengio, Y.: FER-2013 face database. Technical report (2013)

    Google Scholar 

  7. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)

    Google Scholar 

  8. Zhao, G., Pietikainen, M.: Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 29, 915–928 (2007)

    Article  Google Scholar 

  9. Schuller, B., Steidl, S., Batliner, A., et al.: The INTERSPEECH 2010 paralinguistic challenge. In: Conference of the International Speech Communication Association, INTERSPEECH 2010, pp. 2794–2797 (2010)

    Google Scholar 

  10. Schuller, B., Valstar, M., Eyben, F., McKeown, G., Cowie, R., Pantic, M.: AVEC 2011–the first international audio/visual emotion challenge. In: D’Mello, S., Graesser, A., Schuller, B., Martin, J.-C. (eds.) ACII 2011. LNCS, vol. 6975, pp. 415–424. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24571-8_53

    Chapter  Google Scholar 

  11. Dhall, A., Goecke, R., Joshi, J., Sikka, K., Gedeon, T.: Emotion recognition in the wild challenge 2014: baseline, data and protocol. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 461–466 (2014)

    Google Scholar 

  12. Ringeval, F., Amiriparian, S., Eyben, F., Scherer, K., Schuller, B.: Emotion recognition in the wild: incorporating voice and lip activity in multimodal decision-level fusion. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 473–480 (2014)

    Google Scholar 

  13. Khorrami, P., Le Paine, T., Brady, K., Dagli, C., Huang, T.: How deep neural networks can improve emotion recognition on video data. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 619–623 (2016)

    Google Scholar 

  14. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  15. Jung, H., Lee, S., Yim, J., Park, S., Kim, J.: Joint fine-tuning in deep neural networks for facial expression recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2983–2991 (2015)

    Google Scholar 

  16. Kaya, H., Salah, A.: Combining modality-specific extreme learning machines for emotion recognition in the wild. J. Multimodal User Interfaces 10, 139–149 (2016)

    Article  Google Scholar 

  17. Huang, X., He, Q., Hong, X., Zhao, G., Pietikainen, M.: Improved spatiotemporal local monogenic binary pattern for emotion recognition in the wild. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 514–520 (2014)

    Google Scholar 

  18. Yan, J., Zheng, W., Xu, Q., Lu, G., Li, H., Wang, B.: Sparse kernel reduced-rank regression for bimodal emotion recognition from facial expression and speech. IEEE Trans. Multimedia 18, 1319–1329 (2016)

    Article  Google Scholar 

  19. Fan, Y., Lu, X., Li, D., Liu, Y.: Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 445–450 (2016)

    Google Scholar 

  20. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)

  21. Eyben, F., Wöllmer, M., Schuller, B.: OpenSmile: the Munich versatile and fast open-source audio feature extractor. In: ACM International Conference on Multimedia, pp. 1459–1462 (2010)

    Google Scholar 

Download references

Acknowledgements

The work is funded by the National Natural Science Foundation of China (No. 61371149, No. 61170155), Shanghai Innovation Action Plan Project (No. 16511101200) and the Open Project Program of the National Laboratory of Pattern Recognition (No. 201600017).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuchun Fang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, Z., Fang, Y. (2018). Multimodal Fusion of Spatial-Temporal Features for Emotion Recognition in the Wild. In: Zeng, B., Huang, Q., El Saddik, A., Li, H., Jiang, S., Fan, X. (eds) Advances in Multimedia Information Processing – PCM 2017. PCM 2017. Lecture Notes in Computer Science(), vol 10735. Springer, Cham. https://doi.org/10.1007/978-3-319-77380-3_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-77380-3_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-77379-7

  • Online ISBN: 978-3-319-77380-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics