ABSTRACT
The question I am addressing in my Ph.D. research is multimodal representations and assessments of emotional fluctuations of speakers in call center conversations. Emotion detection in human conversations has attracted the increasing attention of researchers over the last three decades. Various machine learning models have been developed from detecting six basic emotions to more subtle, complex dimensional emotions, demonstrating a promising result. However, in real-life use cases, the complexity of data and the cost of human annotation remain challenging. In my research, I will work on various real-life conversations, by focusing on real-life data processing, emotional data annotation, and multimodal emotion recognition system design, to build robust and ethical automatic emotion recognition systems.
- Mehmet Berkehan Akçay and Kaya Oğuz. 2020. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Communication 116 (Jan. 2020), 56–76. https://doi.org/10.1016/j.specom.2019.12.001Google ScholarDigital Library
- Sevegni Odilon Clement Allognon, Alessandro Lameiras Koerich, and Alceu de Souza Britto. 2020. Continuous Emotion Recognition via Deep Convolutional Autoencoder and Support Vector Regressor. 2020 International Joint Conference on Neural Networks (IJCNN) (2020), 1–8.Google ScholarCross Ref
- Pradeep K. Atrey, M. Anwar Hossain, Abdulmotaleb El Saddik, and Mohan S. Kankanhalli. 2010. Multimodal fusion for multimedia analysis: A survey. Multimedia Systems 16 (Nov. 2010), 345–379. https://doi.org/10.1007/s00530-010-0182-0Google ScholarDigital Library
- Eric Y. Chen, Zhiyun Lu, Hao Xu, Liangliang Cao, Yu Zhang, and James Fan. 2020. A Large Scale Speech Sentiment Corpus. LREC 2020 Proceedings of the 12th Language Resources and Evaluation Conference (May 2020), 6549–6555. https://aclanthology.org/2020.lrec-1.806Google Scholar
- Watson D, Clark LA, and Tellegen A.1988. Development and validation of brief measures of positive and negative affect: the PANAS scales. J Pers Soc Psychol 54, 6 (June 1988), 1063–70. https://doi.org/10.1037//0022-3514.54.6.1063Google Scholar
- Laurence Devillers, Christophe Vaudable, and Clément Chastagnol. 2020. Real-life emotion-related states detection in call centers: a cross-corpora study. INTERSPEECH 2010 (Jan. 2020), 2350–2353.Google Scholar
- Laurence Devillers and Laurence Vidrascu. 2006. Real-life emotions detection with lexical and paralinguistic cues on human-human call center dialogs. INTERSPEECH 2006 - ICSLP (Sept. 2006).Google ScholarCross Ref
- Paul Ekman. 1999. Basic emotions. In T. Dalgleish & M. J. Power (Eds.), Handbook of cognition and emotion(pp. 45–60), John Wiley & Sons Ltd. PP (1999). https://doi.org/10.1002/0470013494.ch3Google ScholarCross Ref
- Paul Ekman, Wallace V Friesen, and Phoebe Ellsworth. 2013. Emotion in the Human Face: Guidelines for Research and an Integration of Findings. Elsevier Science.Google Scholar
- Florian Eyben, Klaus R. Scherer, Björn W. Schuller, Johan Sundberg, Elisabeth André, Carlos Busso, Laurence Y. Devillers, Julien Epps, Petri Laukka, Shrikanth S. Narayanan, and Khiet P. Truong. 2016. The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing. IEEE Transactions on Affective Computing 7, 2 (2016), 190–202. https://doi.org/10.1109/TAFFC.2015.2457417Google ScholarDigital Library
- Florian Eyben, Martin Wöllmer, and Björn Schuller. 2012. A multitask approach to continuous five-dimensional affect sensing in natural speech. ACM Trans. Interact. Intell. Syst. 2 (2012), 6:1–6:29.Google ScholarDigital Library
- Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. openSMILE – The Munich Versatile and Fast Open-Source Audio Feature Extractor. MM’10 - Proceedings of the ACM Multimedia 2010 International Conference (Jan. 2010), 1459–1462. https://doi.org/10.1145/1873951.1874246Google ScholarDigital Library
- Johnny R. J. Fontaine, Klaus R. Scherer, Etienne B. Roesch, and Phoebe C. Ellsworth. 2007. The world of emotions is not two-dimensional. Psychol Sci 18, 12 (Dec. 2007), 1050–7. https://doi.org/10.1111/j.1467-9280.2007.02024.xGoogle ScholarCross Ref
- Jeffrey M Girard and Aidan G C Wright. 2018. DARMA: Software for Dual Axis Rating and Media Annotation. Behavior Research Methods 50, 3 (2018), 902–909. https://doi.org/10.3758/s13428-017-0915-5Google ScholarCross Ref
- J.J. Godfrey, E.C. Holliman, and J. McDaniel. 1992. SWITCHBOARD: telephone speech corpus for research and development. ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing 1 (March 1992), 517–520 vol.1. https://doi.org/10.1109/ICASSP.1992.225858Google Scholar
- Bhanu Prakash Reddy Guda, Aparna Garimella, and Niyati Chhaya. 2021. EmpathBERT: A BERT-based Framework for Demographic-aware Empathy Prediction. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics, Online, 3072–3079. https://doi.org/10.18653/v1/2021.eacl-main.268Google Scholar
- Jing Han, Zixing Zhang, Zhao Ren, and Björn Schuller. 2019. EmoBed: Strengthening Monomodal Emotion Recognition via Training with Crossmodal Emotion Embeddings. IEEE Transactions on Affective Computing PP (07 2019), 1–1. https://doi.org/10.1109/TAFFC.2019.2928297Google Scholar
- Jing Han, Zixing Zhang, Fabien Ringeval, and Björn Schuller. 2017. Prediction-based learning for continuous emotion recognition in speech. 5005–5009. https://doi.org/10.1109/ICASSP.2017.7953109Google ScholarDigital Library
- Eva Lieskovská, Maroš Jakubec, Roman Jarina, and Michal Chmulík. 2021. A Review on Speech Emotion Recognition Using Deep Learning and Attention Mechanism. Electronics 10, 10 (May 2021). https://doi.org/10.3390/electronics10101163Google ScholarCross Ref
- Manon Macary, Marie Tahon, Yannick Estève, and Anthony Rousseau. 2020. AlloSat: A New Call Center French Corpus for Satisfaction and Frustration Analysis. Language Resources and Evaluation Conference, LREC 2020 (May 2020). https://hal.archives-ouvertes.fr/hal-02506086Google Scholar
- Manon Macary, Marie Tahon, Yannick Estève, and Anthony Rousseau. 2021. On the use of Self-supervised Pre-trained Acoustic and Linguistic Features for Continuous Speech Emotion Recognition. IEEE Spoken Language Technology Workshop (Jan. 2021). https://hal.archives-ouvertes.fr/hal-03003469Google ScholarCross Ref
- Veronika Makarova and Valery Petrushin. 2003. Phonetics of Emotion in Russian Speech. (01 2003).Google Scholar
- Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah, and Benoît Sagot. 2020. CamemBERT: a Tasty French Language Model. Association for Computational Linguistics (July 2020), 7203–7219. https://www.aclweb.org/anthology/2020.acl-main.645Google Scholar
- Mihalis A. Nicolaou, Hatice Gunes, and Maja Pantic. 2011. Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space. IEEE Transactions on Affective Computing 2 (2011), 92–105.Google ScholarDigital Library
- R. Plutchik. 1980. Emotion, a Psychoevolutionary Synthesis. Harper & Row. https://books.google.fr/books?id=G5t9AAAAMAAJGoogle Scholar
- Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Nagendra Goel, Mirko Hannemann, Yanmin Qian, Petr Schwarz, and Georg Stemmer. 2011. The kaldi speech recognition toolkit. In In IEEE 2011 workshop.Google Scholar
- Fabien Ringeval, Björn Schuller, Michel Valstar, Roddy Cowie, Heysem Kaya, Maximilian Schmitt, Shahin Amiriparian, Nicholas Cummins, Denis Lalanne, Adrien Michaud, Elvan Ciftçi, Hüseyin Güleç, Albert Ali Salah, and Maja Pantic. 2018. AVEC 2018 Workshop and Challenge: Bipolar Disorder and Cross-Cultural Affect Recognition. Association for Computing Machinery(2018), 3–13. https://doi.org/10.1145/3266302.3266316Google ScholarDigital Library
- Fabien Ringeval, Andreas Sonderegger, Juergen Sauer, and Denis Lalanne. 2013. Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. In 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG). 1–8. https://doi.org/10.1109/FG.2013.6553805Google ScholarCross Ref
- Maximilian Schmitt, Nicholas Cummins, and Björn Schuller. 2019. Continuous Emotion Recognition in Speech — Do We Need Recurrence?Interspeech 2019 (Sept. 2019), 2808–2812. https://doi.org/10.21437/Interspeech.2019-2710Google Scholar
- Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised pre-training for speech recognition. Proc. of INTERSPEECH(2019), 3465–3469. http://arxiv.org/abs/1904.05862Google ScholarCross Ref
- Ting-Wei Sun. 2020. End-to-End Speech Emotion Recognition with Gender Information. IEEE Access PP (Aug. 2020), 1–1. https://doi.org/10.1109/ACCESS.2020.3017462Google ScholarCross Ref
- Pedro Javier Ortiz Suárez, Benoît Sagot, and Laurent Romary. 2019. Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019 (July 2019). https://doi.org/10.14618/ids-pub-9021Google Scholar
- George Trigeorgis, Fabien Ringeval, Raymond Brueckner, Erik Marchi, Mihalis A. Nicolaou, Björn Schuller, and Stefanos Zafeiriou. 2016. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5200–5204. https://doi.org/10.1109/ICASSP.2016.7472669Google ScholarDigital Library
- Christophe Vaudable and Laurence Devillers. 2012. Negative emotions detection as an indicator of dialogs quality in call centers. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing (March 2012). https://doi.org/10.1109/ICASSP.2012.6289070Google Scholar
- Laurence Vidrascu and Laurence Devillers. 2005. Detection of real-life emotions in call centers. INTERSPEECH 2005 (Jan. 2005).Google ScholarCross Ref
- Jiamei Wei, Ercheng Pei, Dongmei Jiang, Hichem Sahli, Lei Xie, and Zhong-hua Fu. 2015. Multimodal continuous affect recognition based on LSTM and multiple kernel learning. 2014 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2014 (02 2015). https://doi.org/10.1109/APSIPA.2014.7041743Google Scholar
- Felix Weninger, Fabien Ringeval, Erik Marchi, and Björn Schuller. 2016. Discriminatively trained recurrent neural networks for continuous dimensional emotion recognition from audio. (01 2016).Google Scholar
Recommendations
Enhancing intelligence in multimodal emotion assessments
Computer systems are a part of everyday life, since they influence human behavior and stimulate changes in the emotional states of the users. The assessment of users' emotions during their interaction with computer systems can help to provide tailorable ...
Influence of speakers' emotional states on voice recognition scores
COST'10: Proceedings of the 2010 international conference on Analysis of Verbal and Nonverbal Communication and EnactmentThe paper presents the voice recognition EER (Equal Error Rate) scores for speakers' basic emotional states. The database of Polish emotional speech used during the tests includes recordings of six acted emotional states (anger, sadness, happiness, fear,...
Ensemble methods for spoken emotion recognition in call-centres
Machine-based emotional intelligence is a requirement for more natural interaction between humans and computer interfaces and a basic level of accurate emotion perception is needed for computer systems to respond adequately to human emotion. Humans ...
Comments