Recognizing the emotional tone in spoken language is a challenging research problem that requires modeling not only the acoustic and textual modalities separately but also their cross-interactions. In this work, we introduce a hierarchical fusion scheme for sentiment analysis of spoken sentences. Two bidirectional Long-Short-Term-Memory networks (BiLSTM), followed by multiple fully connected layers, are trained in order to extract feature representations for each of the textual and audio modalities. The representations of the unimodal encoders are both fused at each layer and propagated forward, thus achieving fusion at the word, sentence and high/sentiment levels. The proposed approach of deep hierarchical fusion achieves state-of-the-art results for sentiment analysis tasks. Through an ablation study, we show that the proposed fusion method achieves greater performance gains over the unimodal baseline compared to other fusion approaches in the literature.
Cite as: Georgiou, E., Papaioannou, C., Potamianos, A. (2019) Deep Hierarchical Fusion with Application in Sentiment Analysis. Proc. Interspeech 2019, 1646-1650, doi: 10.21437/Interspeech.2019-3243
@inproceedings{georgiou19_interspeech, author={Efthymios Georgiou and Charilaos Papaioannou and Alexandros Potamianos}, title={{Deep Hierarchical Fusion with Application in Sentiment Analysis}}, year=2019, booktitle={Proc. Interspeech 2019}, pages={1646--1650}, doi={10.21437/Interspeech.2019-3243} }