Automatic Recognition of Speaker’s Emotional States Based on Audio and Text

Loading...
Thumbnail Image

Date

2022-08-12

Journal Title

Journal ISSN

Volume Title

Publication Type

Dissertation

Published in

Abstract

Speech emotion recognition is a subfield of computational paralinguistics that analyzes emotional manifestations using acoustic and linguistic properties of speech signals. With a rising demand for human-centric technologies, automatic speech emotion recognition becomes a topical issue addressed by many researchers around the globe, with numerous practical applications ranging from daily routine services in medicine or enterprise to fully-capable artificial intelligence like conversational agents. Emotion recognition is a multidisciplinary research field that combines various studies, on the one hand - from signal processing, statistics, and machine learning, and on the other hand - from psychology, sociology, and linguistics. The focus of this thesis is on the technical side of emotion modeling, however, we put great emphasis on the theoretical concepts of emotions and various approaches to their description. There are many aspects of emotion recognition that need to be addressed during the research and development stage. This thesis is going to focus on a few of them: modeling emotions in a dialogue, domain adaptation, compact feature representation, and interpretability of the obtained results. All these aspects are important from the point of view of practical applicability of the system in real-life scenarios. The conditions in which a system is designed to operate in real life are often very different from the laboratory environment: background noise, recording equipment, and computational power available in end-products may not match. These differences can not be ignored due to significant influence on the overall system performance. Therefore, the experiments conducted in the given framework are designed to address possible implementation issues commonly encountered while system exploitation. To address various aspects of emotion recognition, we need a comprehensive dataset that sufficiently represents both the target phenomenon and target population at which the system is directed to operate. However, in practice, the datasets that are available for research are often limited in size, number of speakers, situational context, and emotion annotation. This leads to a necessity to consider various datasets to address different aspects of speech emotion recognition, which is also the case in the present thesis. We use four emotional speech corpora, namely IEMOCAP, CreativeIT, RAMAS, and USoMS-e, to focus on different goals. Three of the aforementioned datasets have been created using actors involved in a dialogue, and one of them features personal narratives told by real people. The careful choice of data allows us to concentrate on important properties of emotions while reaching the goals of the thesis. The contributions of this thesis are manifold. First, we propose a new linguistic feature representation based on a combination of knowledge-based systems and machine learning algorithms. Second, we introduce a new hierarchical neural network architecture that is capable of modeling context on both frame level and dialogue level. Third, we design a method for increasing training data size due to effective domain adaptation approach, which can also be used to test the system in a cross-corpus setup. Next, we come up with a robust bimodal fusion of acoustic and linguistic properties of a signal to mitigate the effects of missing modalities. Therefore, the main contributions of the thesis are covered on many levels: feature level, model architecture level, and decision-making level.

Description

Faculties

Fakultät für Ingenieurwissenschaften, Informatik und Psychologie

Institutions

Institut für Nachrichtentechnik

Citation

DFG Project uulm

License

CC BY 4.0 International

Keywords

Affective computing, Computational paralinguistics, Maschinelles Lernen, Deep learning, Machine learning, DDC 000 / Computer science, information & general works, DDC 410 / Linguistics