Abstract
The focus of this work is emotion recognition in the wild based on a multitude of different audio, visual and meta features. For this, a method is proposed to optimize multi-modal fusion architectures based on evolutionary computing. Extensive uni- and multi-modal experiments show the discriminative power of each computed feature set and fusion architecture. Furthermore, we summarize the EmotiW 2013/2014 challenges and review the conclusions that have been drawn and compare our results with the state-of-the-art on this dataset.
Similar content being viewed by others
Notes
References
Almaev TR, Yüce A, Ghitulescu A, Valstar MF (2013) Distribution-based iterative pairwise classification of emotions in the wild using LGBP-TOP. In: Proceedings of the 15th ACM on international conference on multimodal interaction, ICMI ’13. ACM, pp 535–542
Atal BS, Hanauer SL (1971) Speech analysis and synthesis by linear prediction of the speech wave. J Acoust Soc Am 50(2):637–655
Bänziger T, Mortillaro M, Scherer KR (2012) Introducing the Geneva multimodal expression corpus for experimental research on emotion perception. Emotion 12:1161–1179
Bosch A, Zisserman A, Munoz X (2007) Representing shape with a spatial pyramid kernel. In: Proceedings of the 6th ACM international conference on Image and video retrieval, CIVR ’07. ACM, pp 401–408
Cardoso JF, Souloumiac A (1993) Blind beamforming for non-gaussian signals. IEE Proc F (Radar Signal Process) 140:362–370
Chen J, Chen Z, Chi Z, Fu H (2014) Emotion recognition in the wild with feature fusion and multiple kernel learning. In: Proceedings of the 16th international conference on multimodal interaction, ICMI ’14. ACM, pp 508–513
Clavel C, Vasilescu I, Devillers L, Richard G, Ehrette T (2008) Fear-type emotion recognition for future audio-based surveillance systems. Speech Commun 50(6):487–503
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE Computer Society conference on computer vision and pattern recognition, 2005. CVPR 2005, vol 1, pp 886–893
Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. Acoust Speech Signal Process IEEE Trans 28(4):357–366
Day M (2013) Emotion recognition with boosted tree classifiers. In: Proceedings of the 15th ACM on international conference on multimodal interaction, ICMI ’13. ACM, pp 531–534
Dhall A, Goecke R, Joshi J, Sikka K, Gedeon T (2014) Emotion recognition in the wild challenge 2014: Baseline, data and protocol. In: Proceedings of the 16th international conference on multimodal interaction. ACM, pp 461–466
Dhall A, Goecke R, Joshi J, Wagner M, Gedeon T (2013) Emotion recognition in the wild challenge 2013. In: Proceedings of the 15th ACM on international conference on multimodal interaction. ACM, pp 509–516
Dhall A, Goecke R, Lucey S, Gedeon T (2012) Collecting large, richly annotated facial-expression databases from movies. IEEE Multimed 3:34–41
Eerola T, Vuoskoski JK (2011) A comparison of the discrete and dimensional models of emotion in music. Psychol Music 39(1):18–49
Eyben F, Wöllmer M, Schuller B (2009) OpenEAR - introducing the Munich open-source emotion and affect recognition toolkit. In: Affective computing and intelligent interaction and workshops, 2009. ACII 2009, pp 1–6
Gehrig T, Ekenel HK (2013) Why is facial expression analysis in the wild challenging? In: Proceedings of the 2013 on emotion recognition in the wild challenge and workshop, EmotiW ’13. ACM, pp 9–16
Gómez Jáuregui DA, Martin JC (2013) Evaluation of vision-based real-time measures for emotions discrimination under uncontrolled conditions. In: Proceedings of the 2013 on emotion recognition in the wild challenge and workshop, EmotiW ’13. ACM, pp 17–22
Grimm M, Kroschel K, Narayanan S (2008) The Vera am Mittag German audio-visual emotional speech database. In: IEEE international conference on multimedia and expo, pp 865–868
Grosicki M (2014) Neural networks for emotion recognition in the wild. In: Proceedings of the 16th international conference on multimodal interaction, ICMI ’14. ACM, pp 467–472
Guoying Z, Pietikäinen M (2007) Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans Pattern Anal Mach Intell 29(6):915–928
Hermansky H (1990) Perceptual linear predictive (PLP) analysis of speech. J Acoust Soc Am 87(4):1738–1752
Hermansky H (1997) The modulation spectrum in automatic recognition of speech. In: Proceedings of IEEE workshop on automatic speech recognition and understanding
Hermansky H, Morgan N, Bayya A, Kohn P (1992) RASTA-PLP speech analysis technique. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP-92), vol 1, pp 121–124
Huang X, He Q, Hong X, Zhao G, Pietikäinen M (2014) Improved spatiotemporal local monogenic binary pattern for emotion recognition in the wild. In: Proceedings of the 16th international conference on multimodal interaction, ICMI ’14. ACM, pp 514–520
Kächele M, Schels M, Schwenker F (2014) Inferring depression and affect from application dependent meta knowledge. In: Proceedings of the 4th international workshop on audio/visual emotion challenge, AVEC ’14. ACM, pp 41–48
Kächele M., Thiam P., Palm G., Schwenker F., Schels M (2015) Ensemble methods for continuous affect recognition: multi-modality, temporality, and challenges. In: Proceedings of the 5th international workshop on audio/visual emotion challenge, AVEC ’15. ACM, pp 9–16
Kächele M, Zharkov D, Meudt S, Schwenker F (2014) Prosodic, spectral and voice quality feature selection using a long-term stopping criterion for audio-based emotion recognition. In: Proceedings of the international conference on pattern recognition (ICPR), pp 803–808
Kahou SE, Pal C, Bouthillier X, Froumenty P, Gülçere Ç, et al. (2013) Combining modality specific deep neural networks for emotion recognition in video. In: Proceedings of the 15th ACM on international conference on multimodal interaction, ICMI ’13. ACM, pp 543–550
Kanade T, Cohn J, Tian Y (2000) Comprehensive database for facial expression analysis. Autom Face Gesture Recognit 2000:46–53
Kaya H, Salah AA (2014) Combining modality-specific extreme learning machines for emotion recognition in the wild. In: Proceedings of the 16th international conference on multimodal interaction, ICMI ’14. ACM, pp 487–493
Krishna T, Rai A, Bansal S, Khandelwal S, Gupta S, Goyal D (2013) Emotion recognition using facial and audio features. In: Proceedings of the 15th ACM on international conference on multimodal interaction, ICMI ’13. ACM, pp 557–564
Levi K, Weiss Y (2004) Learning object detection from a small number of examples: the importance of good features. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition (CVPR), vol 2, pp II-53–II-60
Liu M, Wang R, Huang Z, Shan S, Chen X (2013) Partial least squares regression on grassmannian manifold for emotion recognition. In: Proceedings of the 15th ACM on international conference on multimodal interaction, ICMI ’13. ACM, pp 525–530
Liu M, Wang R, Li S, Shan S, Huang Z, Chen X (2014) Combining multiple kernel methods on riemannian manifold for emotion recognition in the wild. In: Proceedings of the 16th international conference on multimodal interaction, ICMI ’14. ACM, pp 494–501
McKeown G, Valstar MF, Cowie R, Pantic M (2010) The SEMAINE corpus of emotionally coloured character interactions. In: IEEE international conference on multimedia and expo (ICME). IEEE, pp 1079–1084
Meng H, Pears N (2009) Descriptive temporal template features for visual motion recognition. Pattern Recognit Lett 30(12):1049–1058
Meng H, Romera-Paredes B, Bianchi-Berthouze N (2011) Emotion recognition by two view SVM-2K classifier on dynamic facial expression features. In: 2011 IEEE international conference on automatic face gesture recognition and workshops (FG 2011), pp 854–859
Meudt S, Schwenker F (2014) Enhanced autocorrelation in real world emotion recognition. In: Proceedings of the 16th international conference on multimodal interaction, ICMI ’14. ACM, pp 502–507
Meudt S, Zharkov D, Kächele M, Schwenker F (2013) Multi classifier systems and forward backward feature selection algorithms to classify emotional coloured speech. In: Proceedings of the international conference on multimodal interaction, ICMI 2013. ACM, pp 551–556
Ojala T, Pietikäinen M, Maenpaa T (2002) Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. Pattern Anal Mach Intell IEEE Trans 24(7):971–987
Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175
Pudil P, Novovičová J, Kittler J (1994) Floating search methods in feature selection. Pattern Recognit Lett 15(11):1119–1125
Ringeval F, Amiriparian S, Eyben F, Scherer K, Schuller B (2014) Emotion recognition in the wild: Incorporating voice and lip activity in multimodal decision-level fusion. In: Proceedings of the 16th international conference on multimodal interaction, ICMI ’14. ACM, pp 473–480
Ringeval F, Sonderegger A, Sauer J, Lalanne D (2013) Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. In: Proceedings of face and gestures 2013, 2nd IEEE international workshop on emotion representation, analysis and synthesis in continuous time and space (EmoSPACE)
Robinson DW, Dadson RS (1956) A re-determination of the equal-loudness relations for pure tones. Br J Appl Phys 7(5):166–181
Sidorov M, Minker W (2014) Emotion recognition in real-world conditions with acoustic and visual features. In: Proceedings of the 16th international conference on multimodal interaction, ICMI ’14. ACM, pp 521–524
Sikka K, Dykstra K, Sathyanarayana S, Littlewort G, Bartlett M (2013) Multiple kernel learning for emotion recognition in the wild. In: Proceedings of the 15th ACM on international conference on multimodal interaction, ICMI ’13. ACM, pp 517–524
Sun B, Li L, Zuo T, Chen Y, Zhou G, Wu X (2014) Combining multimodal features with hierarchical classifier fusion for emotion recognition in the wild. In: Proceedings of the 16th international conference on multimodal interaction, ICMI ’14. ACM, pp 481–486
Tolonen T, Karjalainen M (2000) A computationally efficient multipitch analysis model. IEEE Trans Speech Audio Process 8(6):708–716
Walter S, Scherer S, Schels M, Glodek M, Hrabal D, Schmidt M, Böck R, Limbrecht K, Traue H, Schwenker F (2011) Multimodal emotion classification in naturalistic user behavior, towards mobile and intelligent interaction environments, LNCS. In: Jacko J (ed) Human–computer interaction, vol 6763. Springer, Berlin Heidelberg, pp 603–611
Weiss S, Indurkhya N, Zhang T, Damerau F (2005) Text mining: predictive methods for analyzing unstructured information, 1st edn. Springer, New York
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kächele, M., Schels, M., Meudt, S. et al. Revisiting the EmotiW challenge: how wild is it really?. J Multimodal User Interfaces 10, 151–162 (2016). https://doi.org/10.1007/s12193-015-0202-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12193-015-0202-7