Skip to main content
Log in

Continuous affect recognition with weakly supervised learning

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Recognizing a person’s affective state from audio-visual signals is an essential capability for intelligent interaction. Insufficient training data and the unreliable labels of affective dimensions (e.g., valence and arousal) are two major challenges in continuous affect recognition. In this paper, we propose a weakly supervised learning approach based on hybrid deep neural network and bidirectional long short-term memory recurrent neural network (DNN-BLSTM). It firstly maps the audio/visual features into a more discriminative space via the powerful modelling capacities of DNN, then models the temporal dynamics of affect via BLSTM. To reduce the negative impact of the unreliable labels, we utilize a temporal label (TL) along with a robust loss function (RL) for incorporating weak supervision into the learning process of the DNN-BLSTM model. Therefore, the proposed method not only has a simpler structure than the deep BLSTM model in He et al. (24) which requires more training data, but also is robust to noisy and unreliable labels. Single modal and multimodal affect recognition experiments have been carried out on the RECOLA dataset. Single modal recognition results show that the proposed method with TL and RL obtains remarkable improvements on both arousal and valence in terms of concordance correlation coefficient (CCC), while multimodal recognition results show that with less feature streams, our proposed approach obtains better or comparable results with the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Baltrušaitis T, Banda N, Robinson P (2013) Dimensional affect recognition using continuous conditional random fields. In: Proceedings of the 10th IEEE international conference and workshops on automatic face and gesture recognition (FG 2013). IEEE, pp 1–8

  2. Bishop CM (1995) Neural networks for pattern recognition. Oxford University Press, London

    MATH  Google Scholar 

  3. Brady K, Gwon Y, Khorrami P, Godoy E, Campbell W, Dagli C, Huang TS (2016) Multi-modal audio, video and physiological sensor learning for continuous emotion prediction. In: Proceedings of the 6th international workshop on audio/visual emotion challenge. ACM, pp 97–104

  4. Chao L, Tao J, Yang M, Li Y, Wen Z (2014) Multi-scale temporal modeling for dimensional emotion recognition in video. In: Proceedings of the 4th international workshop on audio/visual emotion challenge. ACM, pp 11–18

  5. Chao L, Tao J, Yang M, Li Y, Wen Z (2015) Long short term memory recurrent neural network based multimodal dimensional emotion recognition. In: Proceedings of the 5th international workshop on audio/visual emotion challenge. ACM, pp 65–72

  6. Chen S, Jin Q (2015) Multi-modal dimensional emotion recognition using recurrent neural networks. In: Proceedings of the 5th international workshop on audio/visual emotion challenge. ACM, pp 49–56

  7. Chen S, Jin Q, Zhao J, Wang S (2017) Multimodal multi-task learning for dimensional and continuous emotion recognition. In: Proceedings of the 7th annual workshop on audio/visual emotion challenge. ACM, pp 19–26

  8. Dhall A, Goecke R, Joshi J, Wagner M, Gedeon T (2013) Emotion recognition in the wild challenge 2013. In: Proceedings of the 15th ACM on International conference on multimodal interaction. ACM, pp 509–516

  9. Dhall A, Goecke R, Joshi J, Sikka K, Gedeon T (2014) Emotion recognition in the wild challenge 2014: Baseline, data and protocol. In: Proceedings of the 16th international conference on multimodal interaction. ACM, pp 461–466

  10. Dhall A, Ramana Murthy O, Goecke R, Joshi J, Gedeon T (2015) Video and image based emotion recognition challenges in the wild: Emotiw 2015. In: Proceedings of the 2015 international conference on multimodal interaction. ACM, pp 423–426

  11. Dhall A, Goecke R, Joshi J, Hoey J, Gedeon T (2016) Emotiw 2016: Video and group-level emotion recognition challenges. In: Proceedings of the 18th ACM international conference on multimodal interaction. ACM, pp 427–432

  12. Dhall A, Goecke R, Ghosh S, Joshi J, Hoey J, Gedeon T (2017) From individual to group-level emotion recognition: Emotiw 5.0. In: Proceedings of the 19th ACM international conference on multimodal interaction. ACM, pp 524–528

  13. Duda RO, Hart PE, Stork DG (1973) Pattern classification. Wiley, New York

    MATH  Google Scholar 

  14. Ekman P, Friesen WV (2003) Unmasking the face: a guide to recognizing emotions from facial clues. Ishk, Los Altos

    Google Scholar 

  15. Erdem CE, Turan C, Aydin Z (2015) Baum-2: a multilingual audio-visual affective face database. Multimed Tools Appl 74(18):7429–7459

    Article  Google Scholar 

  16. Gers FA, Schmidhuber J, Cummins F (1999) Learning to forget: Continual prediction with lstm. In: Proceedings ICANN 1999, 9th international conference on artificial neural networks. IET, pp 850–855

  17. Ghimire D, Jeong S, Lee J, Park SH (2017) Facial expression recognition based on local region specific features and support vector machines. Multimed Tools Appl 76(6):7803–7821

    Article  Google Scholar 

  18. Ghimire D, Lee J, Li ZN, Jeong S (2017) Recognition of facial expressions based on salient geometric features and support vector machines. Multimed Tools Appl 76(6):7921–7946

    Article  Google Scholar 

  19. Graves A (2012) Supervised sequence labelling with recurrent neural networks. Springer, Berlin

    Book  MATH  Google Scholar 

  20. Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Netw 18(5-6):602–610

    Article  Google Scholar 

  21. Graves A, Jaitly N, Mohamed A (2013) Hybrid speech recognition with deep bidirectional lstm. In: 2013 IEEE Workshop on automatic speech recognition and understanding (ASRU). IEEE, pp 273–278

  22. Graves A, Mohamed A, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: Proceedings of the 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP 2013). IEEE, pp 6645–6649

  23. Han J, Zhang Z, Ringeval F, Schuller B (2017) Reconstruction-error-based learning for continuous emotion recognition in speech. In: Proceedings of the 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP 2017). IEEE, pp 2367–2371

  24. He L, Jiang D, Yang L, Pei E, Wu P, Sahli H (2015) Multimodal affective dimension prediction using deep bidirectional long short-term memory recurrent neural networks. In: Proceedings of the 5th international workshop on audio/visual emotion challenge. ACM, pp 73–80

  25. Hernández-González J, Inza I, Lozano JA (2016) Weak supervision and other non-standard classification problems: a taxonomy. Pattern Recogn Lett 69:49–55

    Article  Google Scholar 

  26. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  27. Huang G, Liu Z, van der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the 2017 IEEE international conference on computer vision and pattern recognition (CVPR). IEEE, pp 2261–2269

  28. Kaya H, Çilli F, Salah AA (2014) Ensemble cca for continuous emotion prediction. In: Proceedings of the 4th international workshop on audio/visual emotion challenge. ACM, pp 19–26

  29. Le D, Aldeneh Z, Provost EM (2017) Discretized continuous speech emotion recognition with multi-task deep recurrent neural network. In: Proceedings of the 17th annual conference of the international speech communication association (INTERSPEECH 2017)

  30. Lisetti C (1998) Affective computing. Pattern Anal Applic 1(1):71–73

    Article  Google Scholar 

  31. Mansoorizadeh M, Charkari NM (2010) Multimodal information fusion application to human emotion recognition from face and speech. Multimed Tools Appl 49(2):277–297

    Article  Google Scholar 

  32. Mathieu B, Essid S, Fillon T, Prado J, Richard G (2010) Yaafe, an easy to use and efficient audio feature extraction software. In: Proceedings of the 11th international society for music information retrieval conference (ISMIR 2010), pp 441–446

  33. Nguyen MH, Torresani L, De La Torre F, Rother C (2009) Weakly supervised discriminative localization and classification: a joint learning process. In: Proceedings of the 12th international conference on computer vision (ICCV 2009). IEEE, pp 1925–1932

  34. Nicolaou MA, Gunes H, Pantic M (2010) Automatic segmentation of spontaneous data using dimensional labels from multiple coders. In: Proceedings of LREC int. workshop on multimodal corpora: advances in capturing, coding and analyzing multimodality. Citeseer, pp 43–48

  35. Nicolaou MA, Gunes H, Pantic M (2011) Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. IEEE Trans Affect Comput 2(2):92–105

    Article  Google Scholar 

  36. Nicolle J, Rapp V, Bailly K, Prevost L, Chetouani M (2012) Robust continuous prediction of human emotions using multiscale dynamic cues. In: Proceedings of the 14th ACM international conference on multimodal interaction. ACM, pp 501–508

  37. Ozkan D, Scherer S, Morency LP (2012) Step-wise emotion recognition using concatenated-hmm. In: Proceedings of the 14th ACM international conference on multimodal interaction. ACM, pp 477–484

  38. Pei E, Yang L, Jiang D, Sahli H (2015) Multimodal dimensional affect recognition using deep bidirectional long short-term memory recurrent neural networks. In: Proceedings of the 2015 international conference on affective computing and intelligent interaction (ACII 2015). IEEE, pp 208–214

  39. Povolny F, Matejka P, Hradis M, Popková A, Otrusina L, Smrz P, Wood I, Robin C, Lamel L (2016) Multimodal emotion recognition for avec 2016 challenge. In: Proceedings of the 6th international workshop on audio/visual emotion challenge. ACM, pp 75–82

  40. Prenter PM, et al. (2008) Splines and variational methods, Courier Corporation, Chelmsford

  41. Pudil P, Novovičová J, Kittler J (1994) Floating search methods in feature selection. Pattern Recogn Lett 15(11):1119–1125

    Article  Google Scholar 

  42. Ringeval F, Sonderegger A, Sauer J, Lalanne D (2013) Introducing the recola multimodal corpus of remote collaborative and affective interactions. In: Proceedings of the 10th IEEE international conference and workshops on automatic face and gesture recognition (FG 2013). IEEE, pp 1–8

  43. Ringeval F, Schuller B, Valstar M, Cowie R, Pantic M (2015) Avec 2015: The 5th international audio/visual emotion challenge and workshop. In: Proceedings of the 23rd ACM international conference on multimedia. ACM, pp 1335–1336

  44. Ringeval F, Schuller B, Valstar M, Jaiswal S, Marchi E, Lalanne D, Cowie R, Pantic M (2015) Av+ ec 2015: The first affect recognition challenge bridging across audio, video, and physiological data. In: Proceedings of the 5th international workshop on audio/visual emotion Challenge. ACM, pp 3–8

  45. Ringeval F, Schuller B, Valstar M, Gratch J, Cowie R, Scherer S, Mozgai S, Cummins N, Schmitt M, Pantic M (2017) Avec 2017: Real-life depression, and affect recognition workshop and challenge. In: Proceedings of the 7th annual workshop on audio/visual emotion challenge. ACM, pp 3–9

  46. Russell JA (1980) A circumplex model of affect. J Pers Soc Psychol 39(6):1161

    Article  Google Scholar 

  47. Schuller B, Valster M, Eyben F, Cowie R, Pantic M (2012) Avec 2012: the continuous audio/visual emotion challenge. In: Proceedings of the 14th ACM international conference on multimodal interaction. ACM, pp 449–456

  48. Schuller B, Steidl S, Batliner A, Vinciarelli A, Scherer K, Ringeval F, Chetouani M, Weninger F, Eyben F, Marchi E, et al. (2013) The interspeech 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In: Proceedings of the 14th annual conference of the international speech communication association (INTERSPEECH 2013)

  49. Siddiqi MH, Ali R, Idris M, Khan AM, Kim ES, Whang MC, Lee S (2016) Human facial expression recognition using curvelet feature extraction and normalized mutual information feature selection. Multimed Tools Appl 75(2):935–959

    Article  Google Scholar 

  50. Sidorov M, Minker W (2014) Emotion recognition and depression diagnosis by acoustic and visual features: a multimodal approach. In: Proceedings of the 4th international workshop on audio/visual emotion challenge. ACM, pp 81–86

  51. Somandepalli K, Gupta R, Nasir M, Booth BM, Lee S, Narayanan SS (2016) Online affect tracking with multimodal kalman filters. In: Proceedings of the 6th international workshop on audio/visual emotion challenge. ACM, pp 59–66

  52. Sun B, Cao S, Li L, He J, Yu L (2016) Exploring multimodal visual features for continuous affect recognition. In: Proceedings of the 6th international workshop on audio/visual emotion challenge. ACM, pp 83–88

  53. Trigeorgis G, Ringeval F, Brueckner R, Marchi E, Nicolaou MA, Schuller B, Zafeiriou S (2016) Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In: Proceedings of the 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP 2016). IEEE, pp 5200–5204

  54. Valstar MF, Jiang B, Mehu M, Pantic M, Scherer K (2011) The first facial expression recognition and analysis challenge. In: Proceedings of the 2011 IEEE international conference on automatic face & gesture recognition and workshops (FG 2011). IEEE, pp 921–926

  55. Valstar M, Schuller B, Smith K, Eyben F, Jiang B, Bilakhia S, Schnieder S, Cowie R, Pantic M (2013) Avec 2013: the continuous audio/visual emotion and depression recognition challenge. In: Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge. ACM, pp 3–10

  56. Valstar M, Schuller B, Smith K, Almaev T, Eyben F, Krajewski J, Cowie R, Pantic M (2014) Avec 2014: 3d dimensional affect and depression recognition challenge. In: Proceedings of the 4th international workshop on audio/visual emotion challenge. ACM, pp 3–10

  57. Valstar MF, Almaev T, Girard JM, McKeown G, Mehu M, Yin L, Pantic M, Cohn JF (2015) Fera 2015-second facial expression recognition and analysis challenge. In: 11Th IEEE international conference and workshops on automatic face and gesture recognition (FG 2015), vol 6. IEEE, pp 1–8

  58. Valstar M, Gratch J, Schuller B, Ringeval F, Lalanne D, Torres Torres M, Scherer S, Stratou G, Cowie R, Pantic M (2016) Avec 2016: Depression, mood, and emotion recognition workshop and challenge. In: Proceedings of the 6th international workshop on audio/visual emotion challenge. ACM, pp 3–10

  59. Valstar MF, Sánchez-Lozano E, Cohn JF, Jeni LA, Girard JM, Zhang Z, Yin L, Pantic M (2017) Fera 2017-addressing head pose in the third facial expression recognition and analysis challenge. In: 12th IEEE international conference on automatic face & gesture recognition (FG 2017). IEEE, pp 839–847

  60. Van Der Maaten L (2012) Audio-visual emotion challenge 2012: a simple approach. In: Proceedings of the 14th ACM international conference on multimodal interaction. ACM, pp 473–476

  61. Verma GK, Tiwary US (2017) Affect representation and recognition in 3d continuous valence–arousal–dominance space. Multimed Tools Appl 76(2):2159–2183

    Article  Google Scholar 

  62. Ververidis D, Kotropoulos C (2006) Fast sequential floating forward selection applied to emotional speech features estimated on des and susas data collections. In: Proceedings of the 14th european signal processing conference. IEEE, pp 1–5

  63. Wang F, Sahli H, Gao J, Jiang D, Verhelst W (2015) Relevance units machine based dimensional and continuous speech emotion prediction. Multimed Tools Appl 74(22):9983–10000

    Article  Google Scholar 

  64. Weninger F, Geiger J, Wöllmer M, Schuller B, Rigoll G (2014) Feature enhancement by deep lstm networks for asr in reverberant multisource environments. Comput Speech Lang 28(4):888–902

    Article  Google Scholar 

  65. Weninger F, Bergmann J, Schuller BW (2015) Introducing currennt: the munich open-source cuda recurrent neural network toolkit. J Mach Learn Res 16(3):547–551

    MathSciNet  MATH  Google Scholar 

  66. Weninger F, Ringeval F, Marchi E, Schuller B (2016) Discriminatively trained recurrent neural networks for continuous dimensional emotion recognition from audio. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence. AAAI Press, pp 2196–2202

  67. Werbos PJ (1990) Backpropagation through time: what it does and how to do it. Proc IEEE 78(10):1550–1560

    Article  Google Scholar 

  68. Williams RJ, Zipser D (1995) Gradient-based learning algorithms for recurrent networks and their computational complexity. Backpropagation: Theory, architectures, and applications 1:433–486

    Google Scholar 

  69. Wöllmer M, Eyben F, Reiter S, Schuller B, Cox C, Douglas-Cowie E, Cowie R (2008) Abandoning emotion classes-towards continuous emotion recognition with modelling of long-range dependencies. In: Proceedings of the ninth annual conference of the international speech communication association (INTERSPEECH 2008), pp 597–600

  70. Wollmer M, Schuller B, Eyben F, Rigoll G (2010) Combining long short-term memory and dynamic bayesian networks for incremental emotion-sensitive artificial listening. IEEE J Sel Top Sign Proces 4(5):867–881

    Article  Google Scholar 

  71. Wöllmer M, Kaiser M, Eyben F, Schuller B, Rigoll G (2013) Lstm-modeling of continuous emotions in an audiovisual affect recognition framework. Image Vis Comput 31(2):153–163

    Article  Google Scholar 

  72. Zhang Z, Ringeval F, Han J, Deng J, Marchi E, Schuller B (2016) Facing realism in spontaneous emotion recognition from speech: Feature enhancement by autoencoder with lstm neural networks. In: Proceedings of the 17th annual conference of the international speech communication association (INTERSPEECH 2016), pp 3593–3597

Download references

Acknowledgements

This work is supported by the Shaanxi Provincial International Science and Technology Collaboration Project (grant 2017KW-ZD-14), the Chinese Scholarship Council (CSC) (grant 201706290115), the VUB Interdisciplinary Research Program through the EMO-App project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ercheng Pei.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pei, E., Jiang, D., Alioscha-Perez, M. et al. Continuous affect recognition with weakly supervised learning. Multimed Tools Appl 78, 19387–19412 (2019). https://doi.org/10.1007/s11042-019-7313-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-019-7313-1

Keywords

Navigation