skip to main content
10.1145/3266302.3266304acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Multimodal Continuous Emotion Recognition with Data Augmentation Using Recurrent Neural Networks

Authors Info & Claims
Published:15 October 2018Publication History

ABSTRACT

This paper presents our effects for Cross-cultural Emotion Sub-challenge in the Audio/Visual Emotion Challenge (AVEC) 2018, whose goal is to predict the level of three emotional dimensions time-continuously in a cross-cultural setup. We extract the emotional features from audio, visual and textual modalities. The state of art regressor for continuous emotion recognition, long short term memory recurrent neural network (LSTM-RNN) is utilized. We augment the training data by replacing the original training samples with shorter overlapping samples extracted from them, thus multiplying the number of training samples and also beneficial to train emotional temporal model with LSTM-RNN. In addition, two strategies are explored to decrease the interlocutor influence to improve the performance. We also compare the performance of feature level fusion and decision level fusion. The experimental results show the efficiency of the proposed method and competitive results are obtained.

References

  1. H. Gunes, M. Pantic. 2010. Automatic, dimensional and continuous emotion recognition. International Journal of Synthetic Emotions, vol. 1, no. 1, pp. 68--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. H. Gunes, B. Schuller. 2013. Categorical and dimensional affect analysis in continuous input: Current trends and future directions. Image Vision Comput., vol. 31, no. 2, pp. 120--136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Fontaine, K. Scherer, E. Roesch, P. Ellsworth. 2007. The world of emotions is not two-dimensional. Psychological Science, vol. 18, no. 12, pp. 1050--1057.Google ScholarGoogle ScholarCross RefCross Ref
  4. B. Schuller, M. Valstar, F. Eyben, et al. 2011. Avec 2011--the first international audio/visual emotion challenge. Affective Computing and Intelligent Interaction, 415--424. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Schuller, M. Valster, F. Eyben, et al. 2012. AVEC 2012: the continuous audio/visual emotion challenge. Proceedings of the 14th ACM international conference on Multimodal interaction. ACM, 449--456. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Valstar, B. Schuller, K. Smith, et al. 2013. AVEC 2013: the continuous audio/visual emotion and depression recognition challenge. Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge. ACM, 3--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Valstar, B. Schuller, K. Smith, et al. 2014. Avec 2014: 3d dimensional affect and depression recognition challenge. Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge. ACM, 3--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. F. Ringeval, B. Schuller, M. Valstar, et al. 2015. AVEC 2015: The 5th international audio/visual emotion challenge and workshop. Proceedings of the 23rd ACM international conference on Multimedia. ACM, 1335--1336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Valstar, J. Gratch, B. Schuller, et al. 2016. Avec 2016: Depression, mood, and emotion recognition workshop and challenge. Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. ACM, 3--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. F. Ringeval, B. Schuller, M. Valstar, et al. 2017. AVEC 2017: Real-life Depression, and Affect Recognition Workshop and Challenge. Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 3--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. Fabien, B. Schuller, M. Valstar, et al. 2018. AVEC 2018 Workshop and Challenge: Bipolar Disorder and Cross-Cultural Affect Recognition. Proceedings of the 8th International Workshop on Audio/Visual Emotion Challenge. ACM 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. F. Eyben, K. R. Scherer, B. W. Schuller, et al. 2016. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing, 7(2): 190--202.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Popková, F. Povolný, P. Matejka, et al. 2016. Investigation of Bottle-Neck Features for Emotion Recognition. International Conference on Text, Speech, and Dialogue. Springer International Publishing, 426--434.Google ScholarGoogle Scholar
  14. F. Povolny, P. Matejka, M. Hradis, et al. 2016. Multimodal Emotion Recognition for AVEC 2016 Challenge. Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. ACM, 75--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Chen, Q. Jin, J. Zhao, et al. 2017. Multimodal Multi-task Learning for Dimensional and Continuous Emotion Recognition. The Workshop on Audio/visual Emotion Challenge. ACM, 19--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Y. Aytar, C. Vondrick, A. Torralba. 2016. Soundnet: Learning sound representations from unlabeled video. Advances in Neural Information Processing Systems, 892--900. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. R. Almaev, M. F. Valstar. 2013. Local gabor binary patterns from three orthogonal planes for automatic facial expression recognition. Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on. IEEE, 356--361. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. B. Sun, S. Cao, L. Li, et al. 2016. Exploring multimodal visual features for continuous affect recognition. Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. ACM, 83--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. H. W. Chung, C. L. Jen, L. W. Wen. 2014. Survey on audiovisual emotion recognition: databases, features, and data fusion strategies. APSIPA transactions on signal and information processing.Google ScholarGoogle Scholar
  20. R. Viktor, A. Sankaranarayanan, S. Shirin, et al. 2012. Emotion recognition using acoustic and lexical features. In Thirteenth Annual Conference of the International Speech Communication Association.Google ScholarGoogle Scholar
  21. H. Zhaocheng, D. Ting, C. Nicholas, et al. 2015. An investigation of annotation delay compensation and output-associative fusion for multimodal continuous emotion prediction. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, 41--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. JunKai, C. Zenghai, C. Zheru, F. Hong. 2014. Emotion recognition in the wild with feature fusion and multiple kernel learning. In Proceedings of the 16th International Conference on Multimodal Interaction, pages 508--513. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C. Shizhe, J. Qin. 2016. Multi-modal conditional attention fusion for dimensional emotion prediction. In Proceedings of the 2016 ACM on Multimedia Conference, 571--575. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. L. Chao, J. Tao, M. Yang, et al. 2014. Multi-scale temporal modeling for dimensional emotion recognition in video. Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge. ACM, 11--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Wöllmer, M. Kaiser, F. Eyben, et al. 2013. LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework. Image and Vision Computing, 31(2): 153--163. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Huang, Y. Li, J. Tao, et al. 2017. Continuous Multimodal Emotion Prediction Based on Long Short Term Memory Recurrent Neural Network. Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 11--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Mariooryad, C. Busso. 2015. Correcting time-continuous emotional labels by modeling the reaction lag of evaluators. IEEE Transactions on Affective Computing, 6(2): 97--108.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. Ko, V. Peddinti, D. Povey, et al. 2015. Audio augmentation for speech recognition. Sixteenth Annual Conference of the International Speech Communication Association.Google ScholarGoogle ScholarCross RefCross Ref
  29. R. Girshick, J. Donahue, T. Darrell, et al. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition, 580--587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. G. Keren, J. Deng, J. Pohjalainen, et al. 2016. Convolutional Neural Networks with Data Augmentation for Classifying Speakers' Native Language. Interspeech, 2393--2397.Google ScholarGoogle Scholar
  31. C. L. Chi, B. Carlos, L. Sungbok, N. Shrikanth. 2009. Modeling Mutual Influence of Interlocutor Emotion States in Dyadic Spoken Interactions. In Proceedings of Interspeech 2009, Brighton, UK.Google ScholarGoogle Scholar
  32. R. Zhang, A. Atsushi, S. Kobashikawa, et al. 2017. Interaction and Transition Model for Speech Emotion Recognition in Dialogue. Interspeech, 1094--1097.Google ScholarGoogle Scholar
  33. B.Valentin, C. Chlo´e, E. Slim. 2018. Attitude classification in adjacency pairs of a human-agent interaction with hidden conditional random fields. ICASSP.Google ScholarGoogle Scholar
  34. G. A. Bryant, H. C. Barrett. 2008. Vocal emotion recognition across disparate cultures{J}. Journal of Cognition and Culture, 8(1): 135--148.Google ScholarGoogle ScholarCross RefCross Ref
  35. P. Ekman. 1971. Universals and cultural differences in facial expressions of emotion. Nebraska symposium on motivation. University of Nebraska Press.Google ScholarGoogle Scholar
  36. S. M. Feraru, D. Schuller. 2015. Cross-language acoustic emotion recognition: An overview and some tendencies. Affective Computing and Intelligent Interaction (ACII), 2015 International Conference on. IEEE, 125--131. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. Bhaykar, J. Yadav, K. S. Rao. 2013. Speaker dependent, speaker independent and cross language emotion recognition from speech using GMM and HMM. Communications (NCC), 2013 National Conference on. IEEE, 1--5.Google ScholarGoogle ScholarCross RefCross Ref
  38. F. Weninger, F. Ringeval, E. Marchi, et al. 2016. Discriminatively Trained Recurrent Neural Networks for Continuous Dimensional Emotion Recognition from Audio. IJCAI, 2196--2202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. F. Eyben, M. Wöllmer, B. Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. Proceedings of the 18th ACM international conference on Multimedia. ACM, 1459--1462. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. B. Schuller, S. Steidl, A. Batliner, et al. 2010. The INTERSPEECH 2010 paralinguistic challenge. Eleventh Annual Conference of the International Speech Communication Association.Google ScholarGoogle Scholar
  41. B. Tadas, R. Peter, P. M. Louis. 2016. OpenFace: An Open Source Facial Behavior Analysis Toolkit. In: Proc. IEEE Winter Conference on Applications of Computer Vision, New York, USA.Google ScholarGoogle Scholar
  42. M. Tomas, S. Ilya, C. Kai, et al. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111--3119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. J. Pennington, R. Socher, C. Manning. 2014. Glove: Global vectors for word representation. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532--1543.Google ScholarGoogle ScholarCross RefCross Ref
  44. L. Chao, J. Tao, M. Yang, et al. 2015. Long short term memory recurrent neural network based multimodal dimensional emotion recognition. Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 65--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. M. D. Zeiler. 2012. ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701Google ScholarGoogle Scholar

Index Terms

  1. Multimodal Continuous Emotion Recognition with Data Augmentation Using Recurrent Neural Networks

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      AVEC'18: Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop
      October 2018
      113 pages
      ISBN:9781450359832
      DOI:10.1145/3266302

      Copyright © 2018 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 15 October 2018

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      AVEC'18 Paper Acceptance Rate11of23submissions,48%Overall Acceptance Rate52of98submissions,53%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader