Audio Visual Recognition of Spontaneous Emotions In-the-Wild

Xia, Xiaohan; Guo, Liyong; Jiang, Dongmei; Pei, Ercheng; Yang, Le; Sahli, Hichem

doi:10.1007/978-981-10-3005-5_57

Xiaohan Xia^16,17,
Liyong Guo^16,17,
Dongmei Jiang^16,17,
Ercheng Pei^16,17,
Le Yang^16,17 &
…
Hichem Sahli^18,19

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 663))

Included in the following conference series:

Chinese Conference on Pattern Recognition

2442 Accesses

Abstract

In this paper, we target the CCPR 2016 Multimodal Emotion Recognition Challenge (MEC 2016) which is based on the Chinese Natural Audio-Visual Emotion Database (CHEAVD) of movies and TV programs showing (nearly) spontaneous human emotions. Low level descriptors (LLDs) are proposed as audio features. As visual features, we propose using histogram of oriented gradients (HOG), local phase quantisation (LPQ), shape features and behavior-related features such as head pose and eye gaze. The visual features are post processed to delete or smooth the all-zero feature vector segments. Single modal emotion recognition is performed using fully connected hidden Markov models (HMMs). For multimodal emotion recognition, two schemes are proposed: in the first scheme the normalized probability vectors from the HMMs are input to a support vector machine (SVM) for final recognition. For the second scheme, the final emotion is estimated using audio or video features depending if the face has been detected on the full video. Moreover, to make full use of the labeled data and to overcome the problem of unbalanced data, we use the training set and validation set together to train the HMMs and SVMs with parameters optimized via cross-validation experiments. Experimental results on the test set show that the macro average precisions (MAPs) of audio, visual, and multimodal emotion recognition reach \(42.85\,\%\), \(54.24\,\%\), and \(53.90\,\%\), respectively, which are much higher than the corresponding baseline results of \(24.02\,\%\), \(34.28\,\%\), and \(30.63\,\%\).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Schuller, B., Steidl, S., Batliner, A.: The INTERSPEECH 2009 emotion challenge. In: Proceedings of Interspeech, pp. 312–315, Brighton (2009)
Google Scholar
Schuller, B., et al.: The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In: Proceedings of Interspeech, pp. 148–152, Lyon (2013)
Google Scholar
Schuller, B., Steidl, S., Batliner, A., Epps, J., Eyben, F., Ringeval, F., Marchi, E., Zhang, Y.: The INTERSPEECH 2014 computational paralinguistics challenge: cognitive and physical load. In: Proceedings of Interspeech 2014, Singapore (2014)
Google Scholar
Valstar, M., Jiang, B., Mehu, M., Pantic, M., Scherer, K.: The first facial expression recognition and analysis challenge. In: Proceedings of IEEE International Conference Automatic Face and Gesture Recognition, pp. 921–926, Ljubljana (2011)
Google Scholar
Schuller, B., Valster, M., Eyben, F., Cowie, R., Pantic, M.: AVEC 2012: the continuous audio/visual emotion challenge. In: Proceedings of the 14th ACM International Conference on Multimodal Interaction, pp. 449–456. ACM, USA (2012)
Google Scholar
Valstar, M., Schuller, B., Smith, K., Eyben, F., Jiang, B., Bilakhia, S., Schnieder, S., Cowie, R., Pantic, M.: AVEC 2013: the continuous audio/visual emotion and depression recognition challenge. In: Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge, pp. 3–10. ACM, Spain (2013)
Google Scholar
Dhall, A., Goecke, R., Joshi, J., Sikka, K., Gedeon, T.: Emotion recognition in the wild challenge 2014: baseline, data and protocol. In: Proceedings of the 2014 ACM on International Conference on Multimodal Interaction, pp. 461–466, Istanbul, Turkey (2014)
Google Scholar
Dhall, A., Ramana Murthy, O., Goecke, R., Joshi, J., Gedeon, T.: Video and image based emotion recognition challenges in the wild: Emotiw 2015. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 423–426, Seattle (2015)
Google Scholar
Liu, M., Wang, R., Li, S., Shan, S., Huang, Z., Chen, X.: Combining multiple kernel methods on Riemannian manifold for emotion recognition in the wild. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 494–501, Istanbul (2014)
Google Scholar
Kaya, H., Gürpinar, F., Afshar, S., Salah, A.A.: Contrasting and combining least squares based learners for emotion recognition in the wild. In: Proceedings of the 17th International Conference on Multimodal Interaction, pp. 459–466, Seattle (2015)
Google Scholar
Jiang, B., Valstar, M., Martinez, B., Pantic, M.: A dynamic appearance descriptor approach to facial actions temporal modeling. IEEE Trans. Cybern. 44(2), 161–174 (2014)
Article Google Scholar
Dhall, A., Asthana, A., Goecke, R., Gedeon, T.: Emotion recognition using PHOG and LPQ features. In: Ninth IEEE International Conference on Automatic Face and Gesture Recognition (FG 2011), pp. 21–25, Santa Barbara (2011)
Google Scholar
Sikka, K., Dykstra, K., Sathyanarayana, S., Littlewort, G., Bartlett, M.: Multiple Kernel learning for emotion recognition in the wild. In: Proceedings of the 15th ACM on International Conference on Multimodal Interaction, pp. 517–524, Sydney (2013)
Google Scholar
Yao, A., Shao, J., Ma, N., Chen, Y.: Capturing AU-aware facial features and their latent relations for emotion recognition in the wild. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 451–458, Seattle (2015)
Google Scholar
Zhiding, Y., Zhang, C.: Image based static facial expression recognition with multiple deep network learning. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle (2015)
Google Scholar
Ng, H.-W., Nguyen, V.D., Vonikakis, V., Winkler, S.: Deep learning for emotion recognition on small datasets using transfer learning. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 443–449, Seattle (2015)
Google Scholar
Han, K., Dong, Y., Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine. In: Proceedings of Interspeech, Singapore (2014)
Google Scholar
Bao, W., et al.: Building a Chinese natural emotional audio-visual database. In: 2014 International Conference on Signal Processing. IEEE Press, Hangzhou (2014)
Google Scholar
Valstar, M.F., Schuller, B.W., Smith, K., Almaev, T.R., Eyben, F., Krajewski, J., Cowie, R., Pantic, M.: AVEC 2014: 3D dimensional affect and depression recognition challenge. In: Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge (AVEC). ACM MM, Orlando, USA (2014)
Google Scholar
Ringeval, F., Schuller, B., Valstar, M., Jaiswal, S., Marchi, E., Lalanne, D., Cowie, R., Pantic, M.: AV+EC 2015 - the first affect recognition challenge bridging across audio, video, and physiological data. In: Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge (AVEC). ACM MM, Brisbane, Australia (2015)
Google Scholar
Eyben, F., Weninger, F., Gross, F., Schuller, B.: Recent developments in openSMILE, the Munich open-source multimedia feature extractor. In: Proceedings of the 21st ACM International Conference on Multimedia, pp. 835–838, New York (2013)
Google Scholar
Viola, P., Jones, M.J.: Robust real-time object detection. Int. J. Comput. Vis. 57(2), 137–154 (2004)
Article Google Scholar
Xiong, X., Torre, F.D.L.: Supervised descent method and its applications to face alignment. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 532–539, Portland, USA (2013)
Google Scholar
Zhao, G., Pietikinen, M.: Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 29, 915–928 (2007)
Article Google Scholar
Li, Y., Tao, J., Schuller, B., Shan, S., Jiang, D., Jia, J.: MEC 2016: the multimodal emotion recognition challenge of CCPR 2016. In: Chinese Conference on Pattern Recognition (CCPR), Chengdu, China (2016)
Google Scholar
Baltrušaitis, T., Robinson, P., Morency, L.-P.: OpenFace: an open source facial behavior analysis toolkit. In: Proceedings of IEEE Winter Conference on Applications of Computer Vision, New York, USA (2016)
Google Scholar
Baltrušaitis, T., Morency, L.-P., Robinson, P.: Constrained local neural fields for robust facial landmark detection in the wild. In: Proceedings of 2013 IEEE International Conference on Computer Vision Workshops, pp. 354–361, Sydney, Australia (2013)
Google Scholar
Belhumeur, P.N., Jacobs, D.W., Kriegman, D.J., Kumar, N.: Localizing parts of faces using a consensus of exemplars. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2930–2940 (2013)
Article Google Scholar
Le, V., Brandt, J., Lin, Z., Bourdev, L., Huang, T.S.: Interactive facial feature localization. In: Proceedings of 12th European Conference on Computer Vision, pp. 679–692, Florence, Italy (2012)
Google Scholar
http://www.cse.oulu.fi/wsgi/CMV/Downloads. Accessed 28 July 2016
http://prdownloads.sourceforge.net/weka/weka-3-6-14.zip. Accessed 28 July 2016
Wöllmer, M., Metallinou, A., Eyben, F., Narayanan, S.S.: Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling. In: Proceedings of INTERSPEECH 2010, Makuhari, Chiba (2010)
Google Scholar
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: The HTK Book. Entropic Cambridge Research Laboratory, Cambridge (2006)
Google Scholar

Download references

Acknowledgments

This work is supported by the National Natural Science Foundation of China (grant 61273265), the Research and Development Program of China (863 Program) (No. 2015AA016402), and the VUB Interdisciplinary Research Program through the EMO-App project. We would like to express our thanks to the team members Xunqin Yin, Meng Zhang and Qian Lei who helped processing the data.

Author information

Authors and Affiliations

NPU-VUB Joint AVSP Lab, School of Computer Science, Northwestern Polytechnical University (NPU), Xi’an, China
Xiaohan Xia, Liyong Guo, Dongmei Jiang, Ercheng Pei & Le Yang
Shaanxi Key Laboratory on Speech and Image Information Processing, Xi’an, 710072, China
Xiaohan Xia, Liyong Guo, Dongmei Jiang, Ercheng Pei & Le Yang
NPU-VUB Joint AVSP Lab, Department ETRO, Vrije Universiteit Brussel (VUB), Pleinlaan 2, 1050, Brussels, Belgium
Hichem Sahli
Interuniversity Microelectronics Centre, Kepeldreef 75, 3001, Heverlee, Belgium
Hichem Sahli

Authors

Xiaohan Xia
View author publications
You can also search for this author in PubMed Google Scholar
Liyong Guo
View author publications
You can also search for this author in PubMed Google Scholar
Dongmei Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Ercheng Pei
View author publications
You can also search for this author in PubMed Google Scholar
Le Yang
View author publications
You can also search for this author in PubMed Google Scholar
Hichem Sahli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dongmei Jiang .

Editor information

Editors and Affiliations

Institute of Automation, Chinese Academy of Sciences, Beijing, China
Tieniu Tan
Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi'an, China
Xuelong Li
Chinese Academy of Sciences, Institute of Computing Technology, Beijing, China
Xilin Chen
Tsinghua University , Beijing, China
Jie Zhou
Nanjing University of Science and Technology, Nanjing, China
Jian Yang
University of Electronic Science and Technology, Chengdu, Sichuan, China
Hong Cheng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xia, X., Guo, L., Jiang, D., Pei, E., Yang, L., Sahli, H. (2016). Audio Visual Recognition of Spontaneous Emotions In-the-Wild. In: Tan, T., Li, X., Chen, X., Zhou, J., Yang, J., Cheng, H. (eds) Pattern Recognition. CCPR 2016. Communications in Computer and Information Science, vol 663. Springer, Singapore. https://doi.org/10.1007/978-981-10-3005-5_57

Download citation

DOI: https://doi.org/10.1007/978-981-10-3005-5_57
Published: 22 October 2016
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-3004-8
Online ISBN: 978-981-10-3005-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics