Skip to main content

Emotion Recognition in Videos via Fusing Multimodal Features

  • Conference paper
  • First Online:
Pattern Recognition (CCPR 2016)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 663))

Included in the following conference series:

  • 2494 Accesses

Abstract

Emotion recognition is a challenging task with a wide range of applications. In this paper, we present our system in the CCPR 2016 multimodal emotion recognition challenge. Multimodal features from acoustic signals, facial expressions and speech contents are extracted to recognize the emotion of the character in the video. Among them the facial CNN feature is the most discriminative feature for emotion recognition. We train SVM and random forest classifiers based on each type of features and utilize early and late fusion to combine the different modality features. To deal with the data unbalance issue, we propose to adapt the probability thresholds for each emotion class. The macro precision of our best multimodal fusion system achieves 50.34 % on the testing set, which significantly outperforms the baseline of 30.63 %.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., Weiss, B.: A database of German emotional speech. Interspeech 5, 1517–1520 (2005)

    Google Scholar 

  2. Lucey, P., Cohn, J.F., Kanade, T., Saragih, J., Ambadar, Z., Matthews, I.: The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pp. 94–101. IEEE (2010)

    Google Scholar 

  3. Ringeval, F., Schuller, B., Valstar, M., Jaiswal, S., Marchi, E., Lalanne, D., Cowie, R., Pantic, M.: AV+ EC 2015: the first affect recognition challenge bridging across audio, video, and physiological data. In: Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, pp. 3–8. ACM (2015)

    Google Scholar 

  4. Abhinav Dhall, O.V., Murthy, R., Goecke, R., Joshi, J., Gedeon, T.: Video and image based emotion recognition challenges in the wild: Emotiw 2015. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 423–426. ACM (2015)

    Google Scholar 

  5. Li, Y., Tao, J., Schuller, B., Shan, S., Jiang, D., Jia, J.: MEC 2016: the multimodal emotion recognition challenge of CCPR 2016. In: Chinese Conference on Pattern Recognition (CCPR), Chengdu, China (2016)

    Google Scholar 

  6. Chen, S., Jin, Q., Li, X., Yang, G., Jieping, X.: Speech emotion classification using acoustic features. In: 2014 9th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 579–583. IEEE (2014)

    Google Scholar 

  7. Xia, R., Deng, J., Schuller, B., Liu, Y.: Modeling gender information for emotion recognition using denoising autoencoder. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 990–994. IEEE (2014)

    Google Scholar 

  8. Deng, J., Xia, R., Zhang, Z., Liu, Y., Schuller, B.: Introducing shared-hidden-layer autoencoders for transfer learning and their application in acoustic emotion recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4818–4822. IEEE (2014)

    Google Scholar 

  9. Huang, Z., Dong, M., Mao, Q., Zhan, Y.: Speech emotion recognition using CNN. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 801–804. ACM (2014)

    Google Scholar 

  10. Jianlong, W., Lin, Z., Zha, H.: Multiple models fusion for emotion recognition in the wild. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA, 09–13 November 2015, pp. 475–481 (2015)

    Google Scholar 

  11. Sun, B., Li, L., Zhou, G., Xuewen, W., He, J., Lejun, Y., Li, D., Wei, Q.: Combining multimodal features within a fusion network for emotion recognition in the wild. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA, 09–13 November 2015, pp. 497–502 (2015)

    Google Scholar 

  12. Yao, A., Shao, J., Ma, N., Chen, Y.: Capturing AU-aware facial features and their latent relations for emotion recognition in the wild. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 451–458. ACM (2015)

    Google Scholar 

  13. Kim, Y., Provost, E.M.: Say cheese vs. smile: reducing speech-related variability for facial emotion recognition. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 27–36. ACM (2014)

    Google Scholar 

  14. Jung, H., Lee, S., Yim, J., Park, S., Kim, J.: Joint fine-tuning in deep neural networks for facial expression recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2983–2991 (2015)

    Google Scholar 

  15. Xue, B., Chen, F., Shaobin, Z.: A study on sentiment computing and classification of sina weibo with word2vec. In: 2014 IEEE International Congress on Big Data, pp. 358–363. IEEE (2014)

    Google Scholar 

  16. Chung-Hsien, W., Lin, J.-C., Wei, W.-L.: Survey on audiovisual emotion recognition: databases, features, and data fusion strategies. APSIPA Trans. Signal Inf. Process. 3, e12 (2014)

    Article  Google Scholar 

  17. Rozgic, V., Ananthakrishnan, S., Saleem, S., Kumar, R., Vembu, A.N., Prasad, R.: Emotion recognition using acoustic and lexical features. In: 13th Annual Conference of the International Speech Communication Association, INTERSPEECH 2012, Portland, Oregon, USA, 9–13 September 2012, pp. 366–369 (2012)

    Google Scholar 

  18. Huang, Z., Dang, T., Cummins, N., Stasak, B., Le, P., Sethu, V., Epps, J.: An investigation of annotation delay compensation and output-associative fusion for multimodal continuous emotion prediction. In: The International Workshop on Audio/Visual Emotion Challenge (2015)

    Google Scholar 

  19. Zuxuan, W., Jiang, Y.-G., Wang, J., Jian, P., Xue, X.: Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In: Proceedings of the ACM International Conference on Multimedia, MM 2014, Orlando, FL, USA, 03–07 November 2014, pp. 167–176 (2014)

    Google Scholar 

  20. Chen, J., Chen, Z., Chi, Z., Hong, F.: Emotion recognition in the wild with feature fusion and multiple kernel learning. In: Proceedings of the 16th International Conference on Multimodal Interaction, ICMI 2014, Istanbul, Turkey, 12–16 November 2014, pp. 508–513 (2014)

    Google Scholar 

  21. Eyben, F., Llmer, M., Schuller, B.: Opensmile: the Munich versatile and fast open-source audio feature extractor. In: ACM International Conference on Multimedia, MM, pp. 1459–1462 (2010)

    Google Scholar 

  22. Schuller, B., Batliner, A., Steidl, S., Seppi, D.: Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun. 53(9–10), 1062–1087 (2011)

    Article  Google Scholar 

  23. Schuller, B., Steidl, S., Batliner, A., Vinciarelli, A., Scherer, K., Ringeval, F., Chetouani, M., Weninger, F., Eyben, F., Marchi, E.: The interspeech 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In: INTERSPEECH 2013, Conference of the International Speech Communication Association, pp. 148–152 (2013)

    Google Scholar 

  24. Davis, S.B.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. Read. Speech Recogn. 28(4), 65–74 (1990)

    Article  Google Scholar 

  25. Pancoast, S., Akbacak, M.: Softening quantization in bag-of-audio-words. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASS 2014, pp. 1370–1374 (2014)

    Google Scholar 

  26. Mathias, M., Benenson, R., Pedersoli, M., Gool, L.: Face detection without bells and whistles. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 720–735. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10593-2_47

    Google Scholar 

  27. Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Facial landmark detection by deep multi-task learning. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 94–108. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10599-4_7

    Google Scholar 

  28. Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Learning deep representation for face alignment with auxiliary attributes. IEEE Trans. Pattern Anal. Mach. Intell. 38(5), 918–930 (2016)

    Article  Google Scholar 

  29. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004)

    Article  Google Scholar 

  30. Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on Statistical Learning in Computer Vision, ECCV, Prague, vol. 1, pp. 1–2 (2004)

    Google Scholar 

  31. Jianlong, W., Lin, Z., Zha, H.: Multiple models fusion for emotion recognition in the wild. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 475–481. ACM (2015)

    Google Scholar 

  32. LeCun, Y., Bengio, Y.: Convolutional networks for images, speech, and time series. In: The Handbook of Brain Theory and Neural Networks, vol. 3361, no. 10 (1995)

    Google Scholar 

  33. Tang, Y.: Deep learning using support vector machines. CoRR, abs/1306.0239, 2 (2013)

    Google Scholar 

  34. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  35. Rehurek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Citeseer (2010)

    Google Scholar 

  36. Hsu, C.-W., Lin, C.-J.: A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 13(2), 415–425 (2002)

    Article  Google Scholar 

  37. Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3), 18–22 (2002)

    Google Scholar 

Download references

Acknowledgments

This work is supported by National Key Research and Development Plan under Grant No. 2016YFB1001202. This work is also partially supported by Tencent Inc.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qin Jin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper

Chen, S. et al. (2016). Emotion Recognition in Videos via Fusing Multimodal Features. In: Tan, T., Li, X., Chen, X., Zhou, J., Yang, J., Cheng, H. (eds) Pattern Recognition. CCPR 2016. Communications in Computer and Information Science, vol 663. Springer, Singapore. https://doi.org/10.1007/978-981-10-3005-5_52

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-3005-5_52

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-3004-8

  • Online ISBN: 978-981-10-3005-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics