In human perception and understanding, a number of different and complementary
cues are adopted according to different modalities. Various emotional
states in communication between humans reflect this variety of cues
across modalities. Recent developments in multi-modal emotion recognition
utilize deep-learning techniques to achieve remarkable performances,
with models based on different features suitable for text, audio and
vision. This work focuses on cross-modal fusion techniques over deep
learning models for emotion detection from spoken audio and corresponding
transcripts.
We investigate the use of long short-term memory (LSTM) recurrent
neural network (RNN) with pre-trained word embedding for text-based
emotion recognition and convolutional neural network (CNN) with utterance-level
descriptors for emotion recognition from speech. Various fusion strategies
are adopted on these models to yield an overall score for each of the
emotional categories. Intra-modality dynamics for each emotion is captured
in the neural network designed for the specific modality. Fusion techniques
are employed to obtain the inter-modality dynamics. Speaker and session-independent
experiments on IEMOCAP multi-modal emotion detection dataset show the
effectiveness of the proposed approaches. This method yields state-of-the-art
results for utterance-level emotion recognition based on speech and
text.
Cite as: Sebastian, J., Pierucci, P. (2019) Fusion Techniques for Utterance-Level Emotion Recognition Combining Speech and Transcripts. Proc. Interspeech 2019, 51-55, doi: 10.21437/Interspeech.2019-3201
@inproceedings{sebastian19_interspeech, author={Jilt Sebastian and Piero Pierucci}, title={{Fusion Techniques for Utterance-Level Emotion Recognition Combining Speech and Transcripts}}, year=2019, booktitle={Proc. Interspeech 2019}, pages={51--55}, doi={10.21437/Interspeech.2019-3201} }