skip to main content
10.1145/2808196.2811640acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

An Investigation of Annotation Delay Compensation and Output-Associative Fusion for Multimodal Continuous Emotion Prediction

Published: 26 October 2015 Publication History

Abstract

Continuous emotion dimension prediction has increased in popularity over the last few years, as the shift away from discrete classification based tasks has introduced more realism in emotion modeling. However, many questions remain including how best to combine information from several modalities (e.g. audio, video, etc). As part of the AV+EC 2015 Challenge, we investigate annotation delay compensation and propose a range of multimodal systems based on an output-associative fusion framework. The performance of the proposed systems are significantly higher than the challenge baseline, with the strongest performing system yielding 66.7% and 53.9% relative increases in prediction accuracy over the AV+EC 2015 test set arousal and valence baselines respectively. Results also demonstrate the importance of annotation delay compensation for continuous emotion analysis. Of particular interest was the output-associative based fusion framework, which performed very well in a number of significantly different configurations, highlighting that incorporating both affective dimensional dependencies and temporal information is a promising research direction for predicting emotion dimensions.

References

[1]
H. Gunes and B. Schuller, "Categorical and dimensional affect analysis in continuous input: Current trends and future directions," Image and Vision Computing, vol. 31, pp. 120--136, 2013.
[2]
R. Cowie, G. McKeown, and E. Douglas-Cowie, "Tracing emotion: an overview," International Journal of Synthetic Emotions, vol. 3, pp. 1--17, 2012.
[3]
M. Grimm, K. Kroschel, and S. Narayanan, "Support vector regression for automatic recognition of spontaneous emotions in speech," in IEEE International Conference on Acoustics, Speech and Signal Processing, 2007, pp. 1085--1088.
[4]
H. Gunes, M. A. Nicolaou, and M. Pantic, "Continuous analysis of affect from voice and face," in Computer Analysis of Human Behavior, ed: Springer, 2011, pp. 255--291.
[5]
A. Metallinou, A. Katsamanis, and S. Narayanan, "Tracking continuous emotional trends of participants during affective dyadic interactions using body language and speech information," Image and Vision Computing, vol. 31, pp. 137--152, 2013.
[6]
G. Chanel, K. Ansari-Asl, and T. Pun, "Valence-arousal evaluation using physiological signals in an emotion recall paradigm," in IEEE International Conference on Systems, Man and Cybernetics, 2007, pp. 2662--2667.
[7]
S. D'Mello and J. Kory, "Consistent but modest: a meta-analysis on unimodal and multimodal affect detection accuracies from 30 studies," in Proceedings of the 14th ACM international conference on Multimodal interaction, 2012, pp. 31--38.
[8]
M. Nicolaou, H. Gunes, and M. Pantic, "Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space," IEEE Transactions on Affective Computing, vol. 2, pp. 92--105, 2011.
[9]
F. Ringeval, B. Schuller, M. Valstar, S. Jaiswal, E. Marchi, D. Lalanne, et al., "AV+EC 2015 -- The First Affect Recognition Challenge Bridging Across Audio, Video, and Physiological Data," in Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge (AVEC), ACM MM, Brisbane, Australia, October 2015.
[10]
M. Wöllmer, F. Eyben, S. Reiter, B. Schuller, C. Cox, E. Douglas-Cowie, et al., "Abandoning emotion classes-towards continuous emotion recognition with modelling of long-range dependencies," in Interspeech, 2008, pp. 597--600.
[11]
F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, "Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions," in Proceedings of Face & Gestures 2013, 2nd IEEE International Workshop on Emotion Representation, Analysis and Synthesis in Continuous Time and Space (EmoSPACE), Shanghai, China, April 2013.
[12]
J. Nicolle, V. Rapp, K. Bailly, L. Prevost, and M. Chetouani, "Robust continuous prediction of human emotions using multiscale dynamic cues," in Proceedings of the 14th ACM international conference on Multimodal interaction, 2012, pp. 501--508.
[13]
S. Mariooryad and C. Busso, "Correcting time-continuous emotional labels by modeling the reaction lag of evaluators," IEEE Transactions on Affective Computing, 2014.
[14]
M. Soleymani, G. Chanel, J. J. Kierkels, and T. Pun, "Affective characterization of movie scenes based on multimedia content analysis and user's physiological emotional responses," in Tenth IEEE International Symposium on Multimedia, 2008, pp. 228--235.
[15]
M. A. Nicolaou, H. Gunes, and M. Pantic, "Output-associative RVM regression for dimensional and continuous emotion prediction," Image and Vision Computing, vol. 30, pp. 186--196, 2012.
[16]
M. E. Tipping, "Bayesian inference: An introduction to principles and practice in machine learning," in Advanced lectures on machine Learning, ed: Springer, 2004, pp. 41--62.
[17]
N. Cummins, V. Sethu, J. Epps, and J. Krajewski, "Relevance Vector Machine for Depression Prediction," in Proc. Interspeech, Dresden, Germany, 2015.
[18]
F. Ringeval, F. Eyben, E. Kroupi, A. Yuce, J.-P. Thiran, T. Ebrahimi, et al., "Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data," Pattern Recognition Letters, 2014.
[19]
C. J. Merz and M. J. Pazzani, "A principal components approach to combining regression estimates," Machine Learning, vol. 36, pp. 9--32, 1999.
[20]
F. Eyben, K. Scherer, B. Schuller, J. Sundberg, E. A. e, C. Busso, et al., "The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing," IEEE Transactions on Affective Computing, to appear, 2015.
[21]
Y.-L. Shue, Patricia Keating, C. Vicenik, and K. Yu, "VoiceSauce: A Program for Voice Analysis," in ICPhS, 2011.
[22]
F. Eyben, F. Weninger, F. Gross, and B. Schuller, "Recent developments in opensmile, the munich open-source multimedia feature extractor," in Proceedings of the 21st ACM international conference on Multimedia, 2013, pp. 835--838.
[23]
H. Kaya, F. Çilli, and A. A. Salah, "Ensemble cca for continuous emotion prediction," in Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, 2014, pp. 19--26.
[24]
F. Wang, W. Verhelst, and H. Sahli, "Relevance vector machine based speech emotion recognition," in Affective computing and intelligent interaction, Memphis, US, 2011, pp. 111--120.
[25]
C. M. Bishop, Pattern recognition and machine learning: Springer, 2006.
[26]
T. F. Yap, J. Epps, E. Ambikairajah, and E. H. Choi, "Voice source features for cognitive load classification," in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011, pp. 5700--5703.
[27]
B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, et al., "The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism," 2013.

Cited By

View all
  • (2024)Continuous Emotion Ambiguity Prediction: Modeling With Beta DistributionsIEEE Transactions on Affective Computing10.1109/TAFFC.2024.336737115:3(1684-1695)Online publication date: Jul-2024
  • (2024)Impact of Annotation Modality on Label Quality and Model Performance in the Automatic Assessment of Laughter In-the-WildIEEE Transactions on Affective Computing10.1109/TAFFC.2023.326900315:2(519-534)Online publication date: Apr-2024
  • (2024)A multimodal fusion-based deep learning framework combined with local-global contextual TCNs for continuous emotion recognition from videosApplied Intelligence10.1007/s10489-024-05329-w54:4(3040-3057)Online publication date: 1-Feb-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
AVEC '15: Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge
October 2015
90 pages
ISBN:9781450337434
DOI:10.1145/2808196
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. annotation delay compensation
  2. emotion dimension prediction
  3. multimodal fusion
  4. output-associative fusion
  5. relevance vector machine
  6. support vector regression

Qualifiers

  • Research-article

Conference

MM '15
Sponsor:
MM '15: ACM Multimedia Conference
October 26, 2015
Brisbane, Australia

Acceptance Rates

AVEC '15 Paper Acceptance Rate 9 of 15 submissions, 60%;
Overall Acceptance Rate 52 of 98 submissions, 53%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)26
  • Downloads (Last 6 weeks)1
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Continuous Emotion Ambiguity Prediction: Modeling With Beta DistributionsIEEE Transactions on Affective Computing10.1109/TAFFC.2024.336737115:3(1684-1695)Online publication date: Jul-2024
  • (2024)Impact of Annotation Modality on Label Quality and Model Performance in the Automatic Assessment of Laughter In-the-WildIEEE Transactions on Affective Computing10.1109/TAFFC.2023.326900315:2(519-534)Online publication date: Apr-2024
  • (2024)A multimodal fusion-based deep learning framework combined with local-global contextual TCNs for continuous emotion recognition from videosApplied Intelligence10.1007/s10489-024-05329-w54:4(3040-3057)Online publication date: 1-Feb-2024
  • (2023)Efficient Labelling of Affective Video Datasets via Few-Shot & Multi-Task Contrastive LearningProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3613784(6161-6170)Online publication date: 26-Oct-2023
  • (2023)Audio–Visual Fusion for Emotion Recognition in the Valence–Arousal Space Using Joint Cross-AttentionIEEE Transactions on Biometrics, Behavior, and Identity Science10.1109/TBIOM.2022.32330835:3(360-373)Online publication date: Jul-2023
  • (2023)A Novel Markovian Framework for Integrating Absolute and Relative Ordinal Emotion InformationIEEE Transactions on Affective Computing10.1109/TAFFC.2022.315978214:3(2089-2101)Online publication date: 1-Jul-2023
  • (2023)Constrained Dynamical Neural ODE for Time Series Modelling: A Case Study on Continuous Emotion PredictionICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10095778(1-5)Online publication date: 4-Jun-2023
  • (2022)Gaze-enhanced Crossmodal Embeddings for Emotion RecognitionProceedings of the ACM on Human-Computer Interaction10.1145/35308796:ETRA(1-18)Online publication date: 13-May-2022
  • (2022)Few-Shot Learning for Fine-Grained Emotion Recognition Using Physiological SignalsIEEE Transactions on Multimedia10.1109/TMM.2022.316571525(3773-3787)Online publication date: 7-Apr-2022
  • (2022)A Bayesian Filtering Framework for Continuous Affect Recognition From Facial ImagesIEEE Transactions on Multimedia10.1109/TMM.2022.316424825(3709-3722)Online publication date: 1-Apr-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media