Emotion Recognition in Videos via Fusing Multimodal Features

Chen, Shizhe; Dian, Yujie; Li, Xinrui; Lin, Xiaozhu; Jin, Qin; Liu, Haibo; Lu, Li

doi:10.1007/978-981-10-3005-5_52

Shizhe Chen¹⁶,
Yujie Dian¹⁶,
Xinrui Li¹⁶,
Xiaozhu Lin¹⁶,
Qin Jin¹⁶,
Haibo Liu¹⁷ &
…
Li Lu¹⁷

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 663))

Included in the following conference series:

Chinese Conference on Pattern Recognition

2494 Accesses

Abstract

Emotion recognition is a challenging task with a wide range of applications. In this paper, we present our system in the CCPR 2016 multimodal emotion recognition challenge. Multimodal features from acoustic signals, facial expressions and speech contents are extracted to recognize the emotion of the character in the video. Among them the facial CNN feature is the most discriminative feature for emotion recognition. We train SVM and random forest classifiers based on each type of features and utilize early and late fusion to combine the different modality features. To deal with the data unbalance issue, we propose to adapt the probability thresholds for each emotion class. The macro precision of our best multimodal fusion system achieves 50.34 % on the testing set, which significantly outperforms the baseline of 30.63 %.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., Weiss, B.: A database of German emotional speech. Interspeech 5, 1517–1520 (2005)
Google Scholar
Lucey, P., Cohn, J.F., Kanade, T., Saragih, J., Ambadar, Z., Matthews, I.: The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pp. 94–101. IEEE (2010)
Google Scholar
Ringeval, F., Schuller, B., Valstar, M., Jaiswal, S., Marchi, E., Lalanne, D., Cowie, R., Pantic, M.: AV+ EC 2015: the first affect recognition challenge bridging across audio, video, and physiological data. In: Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, pp. 3–8. ACM (2015)
Google Scholar
Abhinav Dhall, O.V., Murthy, R., Goecke, R., Joshi, J., Gedeon, T.: Video and image based emotion recognition challenges in the wild: Emotiw 2015. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 423–426. ACM (2015)
Google Scholar
Li, Y., Tao, J., Schuller, B., Shan, S., Jiang, D., Jia, J.: MEC 2016: the multimodal emotion recognition challenge of CCPR 2016. In: Chinese Conference on Pattern Recognition (CCPR), Chengdu, China (2016)
Google Scholar
Chen, S., Jin, Q., Li, X., Yang, G., Jieping, X.: Speech emotion classification using acoustic features. In: 2014 9th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 579–583. IEEE (2014)
Google Scholar
Xia, R., Deng, J., Schuller, B., Liu, Y.: Modeling gender information for emotion recognition using denoising autoencoder. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 990–994. IEEE (2014)
Google Scholar
Deng, J., Xia, R., Zhang, Z., Liu, Y., Schuller, B.: Introducing shared-hidden-layer autoencoders for transfer learning and their application in acoustic emotion recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4818–4822. IEEE (2014)
Google Scholar
Huang, Z., Dong, M., Mao, Q., Zhan, Y.: Speech emotion recognition using CNN. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 801–804. ACM (2014)
Google Scholar
Jianlong, W., Lin, Z., Zha, H.: Multiple models fusion for emotion recognition in the wild. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA, 09–13 November 2015, pp. 475–481 (2015)
Google Scholar
Sun, B., Li, L., Zhou, G., Xuewen, W., He, J., Lejun, Y., Li, D., Wei, Q.: Combining multimodal features within a fusion network for emotion recognition in the wild. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA, 09–13 November 2015, pp. 497–502 (2015)
Google Scholar
Yao, A., Shao, J., Ma, N., Chen, Y.: Capturing AU-aware facial features and their latent relations for emotion recognition in the wild. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 451–458. ACM (2015)
Google Scholar
Kim, Y., Provost, E.M.: Say cheese vs. smile: reducing speech-related variability for facial emotion recognition. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 27–36. ACM (2014)
Google Scholar
Jung, H., Lee, S., Yim, J., Park, S., Kim, J.: Joint fine-tuning in deep neural networks for facial expression recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2983–2991 (2015)
Google Scholar
Xue, B., Chen, F., Shaobin, Z.: A study on sentiment computing and classification of sina weibo with word2vec. In: 2014 IEEE International Congress on Big Data, pp. 358–363. IEEE (2014)
Google Scholar
Chung-Hsien, W., Lin, J.-C., Wei, W.-L.: Survey on audiovisual emotion recognition: databases, features, and data fusion strategies. APSIPA Trans. Signal Inf. Process. 3, e12 (2014)
Article Google Scholar
Rozgic, V., Ananthakrishnan, S., Saleem, S., Kumar, R., Vembu, A.N., Prasad, R.: Emotion recognition using acoustic and lexical features. In: 13th Annual Conference of the International Speech Communication Association, INTERSPEECH 2012, Portland, Oregon, USA, 9–13 September 2012, pp. 366–369 (2012)
Google Scholar
Huang, Z., Dang, T., Cummins, N., Stasak, B., Le, P., Sethu, V., Epps, J.: An investigation of annotation delay compensation and output-associative fusion for multimodal continuous emotion prediction. In: The International Workshop on Audio/Visual Emotion Challenge (2015)
Google Scholar
Zuxuan, W., Jiang, Y.-G., Wang, J., Jian, P., Xue, X.: Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In: Proceedings of the ACM International Conference on Multimedia, MM 2014, Orlando, FL, USA, 03–07 November 2014, pp. 167–176 (2014)
Google Scholar
Chen, J., Chen, Z., Chi, Z., Hong, F.: Emotion recognition in the wild with feature fusion and multiple kernel learning. In: Proceedings of the 16th International Conference on Multimodal Interaction, ICMI 2014, Istanbul, Turkey, 12–16 November 2014, pp. 508–513 (2014)
Google Scholar
Eyben, F., Llmer, M., Schuller, B.: Opensmile: the Munich versatile and fast open-source audio feature extractor. In: ACM International Conference on Multimedia, MM, pp. 1459–1462 (2010)
Google Scholar
Schuller, B., Batliner, A., Steidl, S., Seppi, D.: Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun. 53(9–10), 1062–1087 (2011)
Article Google Scholar
Schuller, B., Steidl, S., Batliner, A., Vinciarelli, A., Scherer, K., Ringeval, F., Chetouani, M., Weninger, F., Eyben, F., Marchi, E.: The interspeech 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In: INTERSPEECH 2013, Conference of the International Speech Communication Association, pp. 148–152 (2013)
Google Scholar
Davis, S.B.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. Read. Speech Recogn. 28(4), 65–74 (1990)
Article Google Scholar
Pancoast, S., Akbacak, M.: Softening quantization in bag-of-audio-words. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASS 2014, pp. 1370–1374 (2014)
Google Scholar
Mathias, M., Benenson, R., Pedersoli, M., Gool, L.: Face detection without bells and whistles. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 720–735. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10593-2_47
Google Scholar
Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Facial landmark detection by deep multi-task learning. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 94–108. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10599-4_7
Google Scholar
Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Learning deep representation for face alignment with auxiliary attributes. IEEE Trans. Pattern Anal. Mach. Intell. 38(5), 918–930 (2016)
Article Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004)
Article Google Scholar
Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on Statistical Learning in Computer Vision, ECCV, Prague, vol. 1, pp. 1–2 (2004)
Google Scholar
Jianlong, W., Lin, Z., Zha, H.: Multiple models fusion for emotion recognition in the wild. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 475–481. ACM (2015)
Google Scholar
LeCun, Y., Bengio, Y.: Convolutional networks for images, speech, and time series. In: The Handbook of Brain Theory and Neural Networks, vol. 3361, no. 10 (1995)
Google Scholar
Tang, Y.: Deep learning using support vector machines. CoRR, abs/1306.0239, 2 (2013)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Rehurek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Citeseer (2010)
Google Scholar
Hsu, C.-W., Lin, C.-J.: A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 13(2), 415–425 (2002)
Article Google Scholar
Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3), 18–22 (2002)
Google Scholar

Download references

Acknowledgments

This work is supported by National Key Research and Development Plan under Grant No. 2016YFB1001202. This work is also partially supported by Tencent Inc.

Author information

Authors and Affiliations

Multimedia Computing Laboratory, School of Information, Renmin University of China, Beijing, People’s Republic of China
Shizhe Chen, Yujie Dian, Xinrui Li, Xiaozhu Lin & Qin Jin
Tencent Inc., Beijing, People’s Republic of China
Haibo Liu & Li Lu

Authors

Shizhe Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yujie Dian
View author publications
You can also search for this author in PubMed Google Scholar
Xinrui Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiaozhu Lin
View author publications
You can also search for this author in PubMed Google Scholar
Qin Jin
View author publications
You can also search for this author in PubMed Google Scholar
Haibo Liu
View author publications
You can also search for this author in PubMed Google Scholar
Li Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qin Jin .

Editor information

Editors and Affiliations

Institute of Automation, Chinese Academy of Sciences, Beijing, China
Tieniu Tan
Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi'an, China
Xuelong Li
Chinese Academy of Sciences, Institute of Computing Technology, Beijing, China
Xilin Chen
Tsinghua University , Beijing, China
Jie Zhou
Nanjing University of Science and Technology, Nanjing, China
Jian Yang
University of Electronic Science and Technology, Chengdu, Sichuan, China
Hong Cheng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, S. et al. (2016). Emotion Recognition in Videos via Fusing Multimodal Features. In: Tan, T., Li, X., Chen, X., Zhou, J., Yang, J., Cheng, H. (eds) Pattern Recognition. CCPR 2016. Communications in Computer and Information Science, vol 663. Springer, Singapore. https://doi.org/10.1007/978-981-10-3005-5_52

Download citation

DOI: https://doi.org/10.1007/978-981-10-3005-5_52
Published: 22 October 2016
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-3004-8
Online ISBN: 978-981-10-3005-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics