Continuous affect recognition with weakly supervised learning

Pei, Ercheng; Jiang, Dongmei; Alioscha-Perez, Mitchel; Sahli, Hichem

doi:10.1007/s11042-019-7313-1

Continuous affect recognition with weakly supervised learning

Published: 11 February 2019

Volume 78, pages 19387–19412, (2019)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Ercheng Pei ORCID: orcid.org/0000-0003-3582-6809¹,
Dongmei Jiang¹,
Mitchel Alioscha-Perez² &
…
Hichem Sahli²

521 Accesses
7 Citations
Explore all metrics

Abstract

Recognizing a person’s affective state from audio-visual signals is an essential capability for intelligent interaction. Insufficient training data and the unreliable labels of affective dimensions (e.g., valence and arousal) are two major challenges in continuous affect recognition. In this paper, we propose a weakly supervised learning approach based on hybrid deep neural network and bidirectional long short-term memory recurrent neural network (DNN-BLSTM). It firstly maps the audio/visual features into a more discriminative space via the powerful modelling capacities of DNN, then models the temporal dynamics of affect via BLSTM. To reduce the negative impact of the unreliable labels, we utilize a temporal label (TL) along with a robust loss function (RL) for incorporating weak supervision into the learning process of the DNN-BLSTM model. Therefore, the proposed method not only has a simpler structure than the deep BLSTM model in He et al. (24) which requires more training data, but also is robust to noisy and unreliable labels. Single modal and multimodal affect recognition experiments have been carried out on the RECOLA dataset. Single modal recognition results show that the proposed method with TL and RL obtains remarkable improvements on both arousal and valence in terms of concordance correlation coefficient (CCC), while multimodal recognition results show that with less feature streams, our proposed approach obtains better or comparable results with the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Deep Affect Prediction in-the-Wild: Aff-Wild Database and Challenge, Deep Architectures, and Beyond

Article Open access 13 February 2019

Neural Prediction of the User’s Mood from Visual Input

Multi-Task Learning Framework for Emotion Recognition In-the-Wild

References

Baltrušaitis T, Banda N, Robinson P (2013) Dimensional affect recognition using continuous conditional random fields. In: Proceedings of the 10th IEEE international conference and workshops on automatic face and gesture recognition (FG 2013). IEEE, pp 1–8
Bishop CM (1995) Neural networks for pattern recognition. Oxford University Press, London
MATH Google Scholar
Brady K, Gwon Y, Khorrami P, Godoy E, Campbell W, Dagli C, Huang TS (2016) Multi-modal audio, video and physiological sensor learning for continuous emotion prediction. In: Proceedings of the 6th international workshop on audio/visual emotion challenge. ACM, pp 97–104
Chao L, Tao J, Yang M, Li Y, Wen Z (2014) Multi-scale temporal modeling for dimensional emotion recognition in video. In: Proceedings of the 4th international workshop on audio/visual emotion challenge. ACM, pp 11–18
Chao L, Tao J, Yang M, Li Y, Wen Z (2015) Long short term memory recurrent neural network based multimodal dimensional emotion recognition. In: Proceedings of the 5th international workshop on audio/visual emotion challenge. ACM, pp 65–72
Chen S, Jin Q (2015) Multi-modal dimensional emotion recognition using recurrent neural networks. In: Proceedings of the 5th international workshop on audio/visual emotion challenge. ACM, pp 49–56
Chen S, Jin Q, Zhao J, Wang S (2017) Multimodal multi-task learning for dimensional and continuous emotion recognition. In: Proceedings of the 7th annual workshop on audio/visual emotion challenge. ACM, pp 19–26
Dhall A, Goecke R, Joshi J, Wagner M, Gedeon T (2013) Emotion recognition in the wild challenge 2013. In: Proceedings of the 15th ACM on International conference on multimodal interaction. ACM, pp 509–516
Dhall A, Goecke R, Joshi J, Sikka K, Gedeon T (2014) Emotion recognition in the wild challenge 2014: Baseline, data and protocol. In: Proceedings of the 16th international conference on multimodal interaction. ACM, pp 461–466
Dhall A, Ramana Murthy O, Goecke R, Joshi J, Gedeon T (2015) Video and image based emotion recognition challenges in the wild: Emotiw 2015. In: Proceedings of the 2015 international conference on multimodal interaction. ACM, pp 423–426
Dhall A, Goecke R, Joshi J, Hoey J, Gedeon T (2016) Emotiw 2016: Video and group-level emotion recognition challenges. In: Proceedings of the 18th ACM international conference on multimodal interaction. ACM, pp 427–432
Dhall A, Goecke R, Ghosh S, Joshi J, Hoey J, Gedeon T (2017) From individual to group-level emotion recognition: Emotiw 5.0. In: Proceedings of the 19th ACM international conference on multimodal interaction. ACM, pp 524–528
Duda RO, Hart PE, Stork DG (1973) Pattern classification. Wiley, New York
MATH Google Scholar
Ekman P, Friesen WV (2003) Unmasking the face: a guide to recognizing emotions from facial clues. Ishk, Los Altos
Google Scholar
Erdem CE, Turan C, Aydin Z (2015) Baum-2: a multilingual audio-visual affective face database. Multimed Tools Appl 74(18):7429–7459
Article Google Scholar
Gers FA, Schmidhuber J, Cummins F (1999) Learning to forget: Continual prediction with lstm. In: Proceedings ICANN 1999, 9th international conference on artificial neural networks. IET, pp 850–855
Ghimire D, Jeong S, Lee J, Park SH (2017) Facial expression recognition based on local region specific features and support vector machines. Multimed Tools Appl 76(6):7803–7821
Article Google Scholar
Ghimire D, Lee J, Li ZN, Jeong S (2017) Recognition of facial expressions based on salient geometric features and support vector machines. Multimed Tools Appl 76(6):7921–7946
Article Google Scholar
Graves A (2012) Supervised sequence labelling with recurrent neural networks. Springer, Berlin
Book MATH Google Scholar
Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Netw 18(5-6):602–610
Article Google Scholar
Graves A, Jaitly N, Mohamed A (2013) Hybrid speech recognition with deep bidirectional lstm. In: 2013 IEEE Workshop on automatic speech recognition and understanding (ASRU). IEEE, pp 273–278
Graves A, Mohamed A, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: Proceedings of the 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP 2013). IEEE, pp 6645–6649
Han J, Zhang Z, Ringeval F, Schuller B (2017) Reconstruction-error-based learning for continuous emotion recognition in speech. In: Proceedings of the 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP 2017). IEEE, pp 2367–2371
He L, Jiang D, Yang L, Pei E, Wu P, Sahli H (2015) Multimodal affective dimension prediction using deep bidirectional long short-term memory recurrent neural networks. In: Proceedings of the 5th international workshop on audio/visual emotion challenge. ACM, pp 73–80
Hernández-González J, Inza I, Lozano JA (2016) Weak supervision and other non-standard classification problems: a taxonomy. Pattern Recogn Lett 69:49–55
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Huang G, Liu Z, van der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the 2017 IEEE international conference on computer vision and pattern recognition (CVPR). IEEE, pp 2261–2269
Kaya H, Çilli F, Salah AA (2014) Ensemble cca for continuous emotion prediction. In: Proceedings of the 4th international workshop on audio/visual emotion challenge. ACM, pp 19–26
Le D, Aldeneh Z, Provost EM (2017) Discretized continuous speech emotion recognition with multi-task deep recurrent neural network. In: Proceedings of the 17th annual conference of the international speech communication association (INTERSPEECH 2017)
Lisetti C (1998) Affective computing. Pattern Anal Applic 1(1):71–73
Article Google Scholar
Mansoorizadeh M, Charkari NM (2010) Multimodal information fusion application to human emotion recognition from face and speech. Multimed Tools Appl 49(2):277–297
Article Google Scholar
Mathieu B, Essid S, Fillon T, Prado J, Richard G (2010) Yaafe, an easy to use and efficient audio feature extraction software. In: Proceedings of the 11th international society for music information retrieval conference (ISMIR 2010), pp 441–446
Nguyen MH, Torresani L, De La Torre F, Rother C (2009) Weakly supervised discriminative localization and classification: a joint learning process. In: Proceedings of the 12th international conference on computer vision (ICCV 2009). IEEE, pp 1925–1932
Nicolaou MA, Gunes H, Pantic M (2010) Automatic segmentation of spontaneous data using dimensional labels from multiple coders. In: Proceedings of LREC int. workshop on multimodal corpora: advances in capturing, coding and analyzing multimodality. Citeseer, pp 43–48
Nicolaou MA, Gunes H, Pantic M (2011) Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. IEEE Trans Affect Comput 2(2):92–105
Article Google Scholar
Nicolle J, Rapp V, Bailly K, Prevost L, Chetouani M (2012) Robust continuous prediction of human emotions using multiscale dynamic cues. In: Proceedings of the 14th ACM international conference on multimodal interaction. ACM, pp 501–508
Ozkan D, Scherer S, Morency LP (2012) Step-wise emotion recognition using concatenated-hmm. In: Proceedings of the 14th ACM international conference on multimodal interaction. ACM, pp 477–484
Pei E, Yang L, Jiang D, Sahli H (2015) Multimodal dimensional affect recognition using deep bidirectional long short-term memory recurrent neural networks. In: Proceedings of the 2015 international conference on affective computing and intelligent interaction (ACII 2015). IEEE, pp 208–214
Povolny F, Matejka P, Hradis M, Popková A, Otrusina L, Smrz P, Wood I, Robin C, Lamel L (2016) Multimodal emotion recognition for avec 2016 challenge. In: Proceedings of the 6th international workshop on audio/visual emotion challenge. ACM, pp 75–82
Prenter PM, et al. (2008) Splines and variational methods, Courier Corporation, Chelmsford
Pudil P, Novovičová J, Kittler J (1994) Floating search methods in feature selection. Pattern Recogn Lett 15(11):1119–1125
Article Google Scholar
Ringeval F, Sonderegger A, Sauer J, Lalanne D (2013) Introducing the recola multimodal corpus of remote collaborative and affective interactions. In: Proceedings of the 10th IEEE international conference and workshops on automatic face and gesture recognition (FG 2013). IEEE, pp 1–8
Ringeval F, Schuller B, Valstar M, Cowie R, Pantic M (2015) Avec 2015: The 5th international audio/visual emotion challenge and workshop. In: Proceedings of the 23rd ACM international conference on multimedia. ACM, pp 1335–1336
Ringeval F, Schuller B, Valstar M, Jaiswal S, Marchi E, Lalanne D, Cowie R, Pantic M (2015) Av+ ec 2015: The first affect recognition challenge bridging across audio, video, and physiological data. In: Proceedings of the 5th international workshop on audio/visual emotion Challenge. ACM, pp 3–8
Ringeval F, Schuller B, Valstar M, Gratch J, Cowie R, Scherer S, Mozgai S, Cummins N, Schmitt M, Pantic M (2017) Avec 2017: Real-life depression, and affect recognition workshop and challenge. In: Proceedings of the 7th annual workshop on audio/visual emotion challenge. ACM, pp 3–9
Russell JA (1980) A circumplex model of affect. J Pers Soc Psychol 39(6):1161
Article Google Scholar
Schuller B, Valster M, Eyben F, Cowie R, Pantic M (2012) Avec 2012: the continuous audio/visual emotion challenge. In: Proceedings of the 14th ACM international conference on multimodal interaction. ACM, pp 449–456
Schuller B, Steidl S, Batliner A, Vinciarelli A, Scherer K, Ringeval F, Chetouani M, Weninger F, Eyben F, Marchi E, et al. (2013) The interspeech 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In: Proceedings of the 14th annual conference of the international speech communication association (INTERSPEECH 2013)
Siddiqi MH, Ali R, Idris M, Khan AM, Kim ES, Whang MC, Lee S (2016) Human facial expression recognition using curvelet feature extraction and normalized mutual information feature selection. Multimed Tools Appl 75(2):935–959
Article Google Scholar
Sidorov M, Minker W (2014) Emotion recognition and depression diagnosis by acoustic and visual features: a multimodal approach. In: Proceedings of the 4th international workshop on audio/visual emotion challenge. ACM, pp 81–86
Somandepalli K, Gupta R, Nasir M, Booth BM, Lee S, Narayanan SS (2016) Online affect tracking with multimodal kalman filters. In: Proceedings of the 6th international workshop on audio/visual emotion challenge. ACM, pp 59–66
Sun B, Cao S, Li L, He J, Yu L (2016) Exploring multimodal visual features for continuous affect recognition. In: Proceedings of the 6th international workshop on audio/visual emotion challenge. ACM, pp 83–88
Trigeorgis G, Ringeval F, Brueckner R, Marchi E, Nicolaou MA, Schuller B, Zafeiriou S (2016) Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In: Proceedings of the 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP 2016). IEEE, pp 5200–5204
Valstar MF, Jiang B, Mehu M, Pantic M, Scherer K (2011) The first facial expression recognition and analysis challenge. In: Proceedings of the 2011 IEEE international conference on automatic face & gesture recognition and workshops (FG 2011). IEEE, pp 921–926
Valstar M, Schuller B, Smith K, Eyben F, Jiang B, Bilakhia S, Schnieder S, Cowie R, Pantic M (2013) Avec 2013: the continuous audio/visual emotion and depression recognition challenge. In: Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge. ACM, pp 3–10
Valstar M, Schuller B, Smith K, Almaev T, Eyben F, Krajewski J, Cowie R, Pantic M (2014) Avec 2014: 3d dimensional affect and depression recognition challenge. In: Proceedings of the 4th international workshop on audio/visual emotion challenge. ACM, pp 3–10
Valstar MF, Almaev T, Girard JM, McKeown G, Mehu M, Yin L, Pantic M, Cohn JF (2015) Fera 2015-second facial expression recognition and analysis challenge. In: 11Th IEEE international conference and workshops on automatic face and gesture recognition (FG 2015), vol 6. IEEE, pp 1–8
Valstar M, Gratch J, Schuller B, Ringeval F, Lalanne D, Torres Torres M, Scherer S, Stratou G, Cowie R, Pantic M (2016) Avec 2016: Depression, mood, and emotion recognition workshop and challenge. In: Proceedings of the 6th international workshop on audio/visual emotion challenge. ACM, pp 3–10
Valstar MF, Sánchez-Lozano E, Cohn JF, Jeni LA, Girard JM, Zhang Z, Yin L, Pantic M (2017) Fera 2017-addressing head pose in the third facial expression recognition and analysis challenge. In: 12th IEEE international conference on automatic face & gesture recognition (FG 2017). IEEE, pp 839–847
Van Der Maaten L (2012) Audio-visual emotion challenge 2012: a simple approach. In: Proceedings of the 14th ACM international conference on multimodal interaction. ACM, pp 473–476
Verma GK, Tiwary US (2017) Affect representation and recognition in 3d continuous valence–arousal–dominance space. Multimed Tools Appl 76(2):2159–2183
Article Google Scholar
Ververidis D, Kotropoulos C (2006) Fast sequential floating forward selection applied to emotional speech features estimated on des and susas data collections. In: Proceedings of the 14th european signal processing conference. IEEE, pp 1–5
Wang F, Sahli H, Gao J, Jiang D, Verhelst W (2015) Relevance units machine based dimensional and continuous speech emotion prediction. Multimed Tools Appl 74(22):9983–10000
Article Google Scholar
Weninger F, Geiger J, Wöllmer M, Schuller B, Rigoll G (2014) Feature enhancement by deep lstm networks for asr in reverberant multisource environments. Comput Speech Lang 28(4):888–902
Article Google Scholar
Weninger F, Bergmann J, Schuller BW (2015) Introducing currennt: the munich open-source cuda recurrent neural network toolkit. J Mach Learn Res 16(3):547–551
MathSciNet MATH Google Scholar
Weninger F, Ringeval F, Marchi E, Schuller B (2016) Discriminatively trained recurrent neural networks for continuous dimensional emotion recognition from audio. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence. AAAI Press, pp 2196–2202
Werbos PJ (1990) Backpropagation through time: what it does and how to do it. Proc IEEE 78(10):1550–1560
Article Google Scholar
Williams RJ, Zipser D (1995) Gradient-based learning algorithms for recurrent networks and their computational complexity. Backpropagation: Theory, architectures, and applications 1:433–486
Google Scholar
Wöllmer M, Eyben F, Reiter S, Schuller B, Cox C, Douglas-Cowie E, Cowie R (2008) Abandoning emotion classes-towards continuous emotion recognition with modelling of long-range dependencies. In: Proceedings of the ninth annual conference of the international speech communication association (INTERSPEECH 2008), pp 597–600
Wollmer M, Schuller B, Eyben F, Rigoll G (2010) Combining long short-term memory and dynamic bayesian networks for incremental emotion-sensitive artificial listening. IEEE J Sel Top Sign Proces 4(5):867–881
Article Google Scholar
Wöllmer M, Kaiser M, Eyben F, Schuller B, Rigoll G (2013) Lstm-modeling of continuous emotions in an audiovisual affect recognition framework. Image Vis Comput 31(2):153–163
Article Google Scholar
Zhang Z, Ringeval F, Han J, Deng J, Marchi E, Schuller B (2016) Facing realism in spontaneous emotion recognition from speech: Feature enhancement by autoencoder with lstm neural networks. In: Proceedings of the 17th annual conference of the international speech communication association (INTERSPEECH 2016), pp 3593–3597

Download references

Acknowledgements

This work is supported by the Shaanxi Provincial International Science and Technology Collaboration Project (grant 2017KW-ZD-14), the Chinese Scholarship Council (CSC) (grant 201706290115), the VUB Interdisciplinary Research Program through the EMO-App project.

Author information

Authors and Affiliations

VUB-NPU Joint AVSP Laboratory, School of Computer Science, Northwestern Polytechnical University (NPU), Xi’an, 710072, China
Ercheng Pei & Dongmei Jiang
Department ETRO, Vrije Universiteit Brussel (VUB), Pleinlaan 2, 1050, Brussels, Belgium
Mitchel Alioscha-Perez & Hichem Sahli

Authors

Ercheng Pei
View author publications
You can also search for this author in PubMed Google Scholar
Dongmei Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Mitchel Alioscha-Perez
View author publications
You can also search for this author in PubMed Google Scholar
Hichem Sahli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ercheng Pei.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pei, E., Jiang, D., Alioscha-Perez, M. et al. Continuous affect recognition with weakly supervised learning. Multimed Tools Appl 78, 19387–19412 (2019). https://doi.org/10.1007/s11042-019-7313-1

Download citation

Received: 30 June 2018
Revised: 16 December 2018
Accepted: 31 January 2019
Published: 11 February 2019
Issue Date: 30 July 2019
DOI: https://doi.org/10.1007/s11042-019-7313-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Continuous affect recognition with weakly supervised learning

Abstract

Access this article

Similar content being viewed by others

Deep Affect Prediction in-the-Wild: Aff-Wild Database and Challenge, Deep Architectures, and Beyond

Neural Prediction of the User’s Mood from Visual Input

Multi-Task Learning Framework for Emotion Recognition In-the-Wild

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Continuous affect recognition with weakly supervised learning

Abstract

Access this article

Similar content being viewed by others

Deep Affect Prediction in-the-Wild: Aff-Wild Database and Challenge, Deep Architectures, and Beyond

Neural Prediction of the User’s Mood from Visual Input

Multi-Task Learning Framework for Emotion Recognition In-the-Wild

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation