short-paper

Multimodal Representations and Assessments of Emotional Fluctuations of Speakers in Call Centers Conversations

Author:
Yajing Feng

CNRS-LISN/Paris-Saclay University, France and Axys Consultants, France

CNRS-LISN/Paris-Saclay University, France and Axys Consultants, France

0000-0001-6904-6005
View Profile

ICMI '22: Proceedings of the 2022 International Conference on Multimodal InteractionNovember 2022Pages 724–729https://doi.org/10.1145/3536221.3557033

Published:07 November 2022Publication History

ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction

Pages 724–729

ABSTRACT

The question I am addressing in my Ph.D. research is multimodal representations and assessments of emotional fluctuations of speakers in call center conversations. Emotion detection in human conversations has attracted the increasing attention of researchers over the last three decades. Various machine learning models have been developed from detecting six basic emotions to more subtle, complex dimensional emotions, demonstrating a promising result. However, in real-life use cases, the complexity of data and the cost of human annotation remain challenging. In my research, I will work on various real-life conversations, by focusing on real-life data processing, emotional data annotation, and multimodal emotion recognition system design, to build robust and ethical automatic emotion recognition systems.

References

Mehmet Berkehan Akçay and Kaya Oğuz. 2020. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Communication 116 (Jan. 2020), 56–76. https://doi.org/10.1016/j.specom.2019.12.001Google ScholarDigital Library
Sevegni Odilon Clement Allognon, Alessandro Lameiras Koerich, and Alceu de Souza Britto. 2020. Continuous Emotion Recognition via Deep Convolutional Autoencoder and Support Vector Regressor. 2020 International Joint Conference on Neural Networks (IJCNN) (2020), 1–8.Google ScholarCross Ref
Pradeep K. Atrey, M. Anwar Hossain, Abdulmotaleb El Saddik, and Mohan S. Kankanhalli. 2010. Multimodal fusion for multimedia analysis: A survey. Multimedia Systems 16 (Nov. 2010), 345–379. https://doi.org/10.1007/s00530-010-0182-0Google ScholarDigital Library
Eric Y. Chen, Zhiyun Lu, Hao Xu, Liangliang Cao, Yu Zhang, and James Fan. 2020. A Large Scale Speech Sentiment Corpus. LREC 2020 Proceedings of the 12th Language Resources and Evaluation Conference (May 2020), 6549–6555. https://aclanthology.org/2020.lrec-1.806Google Scholar
Watson D, Clark LA, and Tellegen A.1988. Development and validation of brief measures of positive and negative affect: the PANAS scales. J Pers Soc Psychol 54, 6 (June 1988), 1063–70. https://doi.org/10.1037//0022-3514.54.6.1063Google Scholar
Laurence Devillers, Christophe Vaudable, and Clément Chastagnol. 2020. Real-life emotion-related states detection in call centers: a cross-corpora study. INTERSPEECH 2010 (Jan. 2020), 2350–2353.Google Scholar
Laurence Devillers and Laurence Vidrascu. 2006. Real-life emotions detection with lexical and paralinguistic cues on human-human call center dialogs. INTERSPEECH 2006 - ICSLP (Sept. 2006).Google ScholarCross Ref
Paul Ekman. 1999. Basic emotions. In T. Dalgleish & M. J. Power (Eds.), Handbook of cognition and emotion(pp. 45–60), John Wiley & Sons Ltd. PP (1999). https://doi.org/10.1002/0470013494.ch3Google ScholarCross Ref
Paul Ekman, Wallace V Friesen, and Phoebe Ellsworth. 2013. Emotion in the Human Face: Guidelines for Research and an Integration of Findings. Elsevier Science.Google Scholar
Florian Eyben, Klaus R. Scherer, Björn W. Schuller, Johan Sundberg, Elisabeth André, Carlos Busso, Laurence Y. Devillers, Julien Epps, Petri Laukka, Shrikanth S. Narayanan, and Khiet P. Truong. 2016. The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing. IEEE Transactions on Affective Computing 7, 2 (2016), 190–202. https://doi.org/10.1109/TAFFC.2015.2457417Google ScholarDigital Library
Florian Eyben, Martin Wöllmer, and Björn Schuller. 2012. A multitask approach to continuous five-dimensional affect sensing in natural speech. ACM Trans. Interact. Intell. Syst. 2 (2012), 6:1–6:29.Google ScholarDigital Library
Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. openSMILE – The Munich Versatile and Fast Open-Source Audio Feature Extractor. MM’10 - Proceedings of the ACM Multimedia 2010 International Conference (Jan. 2010), 1459–1462. https://doi.org/10.1145/1873951.1874246Google ScholarDigital Library
Johnny R. J. Fontaine, Klaus R. Scherer, Etienne B. Roesch, and Phoebe C. Ellsworth. 2007. The world of emotions is not two-dimensional. Psychol Sci 18, 12 (Dec. 2007), 1050–7. https://doi.org/10.1111/j.1467-9280.2007.02024.xGoogle ScholarCross Ref
Jeffrey M Girard and Aidan G C Wright. 2018. DARMA: Software for Dual Axis Rating and Media Annotation. Behavior Research Methods 50, 3 (2018), 902–909. https://doi.org/10.3758/s13428-017-0915-5Google ScholarCross Ref
J.J. Godfrey, E.C. Holliman, and J. McDaniel. 1992. SWITCHBOARD: telephone speech corpus for research and development. ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing 1 (March 1992), 517–520 vol.1. https://doi.org/10.1109/ICASSP.1992.225858Google Scholar
Bhanu Prakash Reddy Guda, Aparna Garimella, and Niyati Chhaya. 2021. EmpathBERT: A BERT-based Framework for Demographic-aware Empathy Prediction. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics, Online, 3072–3079. https://doi.org/10.18653/v1/2021.eacl-main.268Google Scholar
Jing Han, Zixing Zhang, Zhao Ren, and Björn Schuller. 2019. EmoBed: Strengthening Monomodal Emotion Recognition via Training with Crossmodal Emotion Embeddings. IEEE Transactions on Affective Computing PP (07 2019), 1–1. https://doi.org/10.1109/TAFFC.2019.2928297Google Scholar
Jing Han, Zixing Zhang, Fabien Ringeval, and Björn Schuller. 2017. Prediction-based learning for continuous emotion recognition in speech. 5005–5009. https://doi.org/10.1109/ICASSP.2017.7953109Google ScholarDigital Library
Eva Lieskovská, Maroš Jakubec, Roman Jarina, and Michal Chmulík. 2021. A Review on Speech Emotion Recognition Using Deep Learning and Attention Mechanism. Electronics 10, 10 (May 2021). https://doi.org/10.3390/electronics10101163Google ScholarCross Ref
Manon Macary, Marie Tahon, Yannick Estève, and Anthony Rousseau. 2020. AlloSat: A New Call Center French Corpus for Satisfaction and Frustration Analysis. Language Resources and Evaluation Conference, LREC 2020 (May 2020). https://hal.archives-ouvertes.fr/hal-02506086Google Scholar
Manon Macary, Marie Tahon, Yannick Estève, and Anthony Rousseau. 2021. On the use of Self-supervised Pre-trained Acoustic and Linguistic Features for Continuous Speech Emotion Recognition. IEEE Spoken Language Technology Workshop (Jan. 2021). https://hal.archives-ouvertes.fr/hal-03003469Google ScholarCross Ref
Veronika Makarova and Valery Petrushin. 2003. Phonetics of Emotion in Russian Speech. (01 2003).Google Scholar
Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah, and Benoît Sagot. 2020. CamemBERT: a Tasty French Language Model. Association for Computational Linguistics (July 2020), 7203–7219. https://www.aclweb.org/anthology/2020.acl-main.645Google Scholar
Mihalis A. Nicolaou, Hatice Gunes, and Maja Pantic. 2011. Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space. IEEE Transactions on Affective Computing 2 (2011), 92–105.Google ScholarDigital Library
R. Plutchik. 1980. Emotion, a Psychoevolutionary Synthesis. Harper & Row. https://books.google.fr/books?id=G5t9AAAAMAAJGoogle Scholar
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Nagendra Goel, Mirko Hannemann, Yanmin Qian, Petr Schwarz, and Georg Stemmer. 2011. The kaldi speech recognition toolkit. In In IEEE 2011 workshop.Google Scholar
Fabien Ringeval, Björn Schuller, Michel Valstar, Roddy Cowie, Heysem Kaya, Maximilian Schmitt, Shahin Amiriparian, Nicholas Cummins, Denis Lalanne, Adrien Michaud, Elvan Ciftçi, Hüseyin Güleç, Albert Ali Salah, and Maja Pantic. 2018. AVEC 2018 Workshop and Challenge: Bipolar Disorder and Cross-Cultural Affect Recognition. Association for Computing Machinery(2018), 3–13. https://doi.org/10.1145/3266302.3266316Google ScholarDigital Library
Fabien Ringeval, Andreas Sonderegger, Juergen Sauer, and Denis Lalanne. 2013. Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. In 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG). 1–8. https://doi.org/10.1109/FG.2013.6553805Google ScholarCross Ref
Maximilian Schmitt, Nicholas Cummins, and Björn Schuller. 2019. Continuous Emotion Recognition in Speech — Do We Need Recurrence?Interspeech 2019 (Sept. 2019), 2808–2812. https://doi.org/10.21437/Interspeech.2019-2710Google Scholar
Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised pre-training for speech recognition. Proc. of INTERSPEECH(2019), 3465–3469. http://arxiv.org/abs/1904.05862Google ScholarCross Ref
Ting-Wei Sun. 2020. End-to-End Speech Emotion Recognition with Gender Information. IEEE Access PP (Aug. 2020), 1–1. https://doi.org/10.1109/ACCESS.2020.3017462Google ScholarCross Ref
Pedro Javier Ortiz Suárez, Benoît Sagot, and Laurent Romary. 2019. Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019 (July 2019). https://doi.org/10.14618/ids-pub-9021Google Scholar
George Trigeorgis, Fabien Ringeval, Raymond Brueckner, Erik Marchi, Mihalis A. Nicolaou, Björn Schuller, and Stefanos Zafeiriou. 2016. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5200–5204. https://doi.org/10.1109/ICASSP.2016.7472669Google ScholarDigital Library
Christophe Vaudable and Laurence Devillers. 2012. Negative emotions detection as an indicator of dialogs quality in call centers. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing (March 2012). https://doi.org/10.1109/ICASSP.2012.6289070Google Scholar
Laurence Vidrascu and Laurence Devillers. 2005. Detection of real-life emotions in call centers. INTERSPEECH 2005 (Jan. 2005).Google ScholarCross Ref
Jiamei Wei, Ercheng Pei, Dongmei Jiang, Hichem Sahli, Lei Xie, and Zhong-hua Fu. 2015. Multimodal continuous affect recognition based on LSTM and multiple kernel learning. 2014 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2014 (02 2015). https://doi.org/10.1109/APSIPA.2014.7041743Google Scholar
Felix Weninger, Fabien Ringeval, Erik Marchi, and Björn Schuller. 2016. Discriminatively trained recurrent neural networks for continuous dimensional emotion recognition from audio. (01 2016).Google Scholar

Recommendations

Enhancing intelligence in multimodal emotion assessments

Computer systems are a part of everyday life, since they influence human behavior and stimulate changes in the emotional states of the users. The assessment of users' emotions during their interaction with computer systems can help to provide tailorable ...
Read More
Influence of speakers' emotional states on voice recognition scores
COST'10: Proceedings of the 2010 international conference on Analysis of Verbal and Nonverbal Communication and Enactment

The paper presents the voice recognition EER (Equal Error Rate) scores for speakers' basic emotional states. The database of Polish emotional speech used during the tests includes recordings of six acted emotional states (anger, sadness, happiness, fear,...
Read More
Ensemble methods for spoken emotion recognition in call-centres

Machine-based emotional intelligence is a requirement for more natural interaction between humans and computer interfaces and a basic level of accurate emotion perception is needed for computer systems to respond adequately to human emotion. Humans ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction
November 2022
830 pages
ISBN:9781450393904
DOI:10.1145/3536221
Editors:
Raj Tumuluri
Openstream
,
Nicu Sebe
University of Trento
,
Gopal Pingali
Accenture
,
Dinesh Babu Jayagopi
IIIT Bangalore
,
Abhinav Dhall
IIT Ropar
,
Richa Singh
IIT Jodhpur
,
Lisa Anthony
University of Florida
,
Albert Ali Salah
Utrecht University and Boğaziçi University
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 November 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
emotional data annotation
machine learning
multimodal continuous emotion recognition
real-life call center conversations
Qualifiers
- short-paper
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate453of1,080submissions,42%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 97
  Total Downloads
- Downloads (Last 12 months)35
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Multimodal Representations and Assessments of Emotional Fluctuations of Speakers in Call Centers Conversations

ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction

ABSTRACT

References

Cited By

Recommendations

Enhancing intelligence in multimodal emotion assessments

Influence of speakers' emotional states on voice recognition scores

Ensemble methods for spoken emotion recognition in call-centres