Recurrent multi-head attention fusion network for combining audio and text for speech emotion recognition

Ahn, Chung-Soo; Kasun, Chamara; Sivadas, Sunil; Rajapakse, Jagath

doi:10.21437/Interspeech.2022-888

Recurrent multi-head attention fusion network for combining audio and text for speech emotion recognition

Chung-Soo Ahn, Chamara Kasun, Sunil Sivadas, Jagath Rajapakse

To infer emotions accurately from speech, fusion of audio and text is essential as words carry most information about semantics and emotions. Attention mechanism is essential component in multimodal fusion architecture as it dynamically pairs different regions within multimodal sequences. However, existing architecture lacks explicit structure to model dynamics between fused representations. Thus we propose recurrent multi-head attention in a fusion architecture, which selects salient fused representations and learns dynamics between them. Multiple 2-D attention layers select salient pairs among all possible pairs of audio and text representations, which are combined with fusion operation. Lastly, multiple fused representations are fed into recurrent unit to learn dynamics between fused representations. Our method outperforms existing approaches for fusion of audio and text for speech emotion recognition and achieves state-of-the-art accuracies on benchmark IEMOCAP dataset.

doi: 10.21437/Interspeech.2022-888

Cite as: Ahn, C.-S., Kasun, C., Sivadas, S., Rajapakse, J. (2022) Recurrent multi-head attention fusion network for combining audio and text for speech emotion recognition. Proc. Interspeech 2022, 744-748, doi: 10.21437/Interspeech.2022-888

@inproceedings{ahn22b_interspeech,
  author={Chung-Soo Ahn and Chamara Kasun and Sunil Sivadas and Jagath Rajapakse},
  title={{Recurrent multi-head attention fusion network for combining audio and text for speech emotion recognition}},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={744--748},
  doi={10.21437/Interspeech.2022-888}
}