Positional Encoding for Capturing Modality Specific Cadence for Emotion Detection

Dhamyal, Hira; Raj, Bhiksha; Singh, Rita

doi:10.21437/Interspeech.2022-11085

Positional Encoding for Capturing Modality Specific Cadence for Emotion Detection

Hira Dhamyal, Bhiksha Raj, Rita Singh

Emotion detection from a single modality, such as an audio or text stream, has been known to be a challenging task. While encouraging results have been obtained by using joint evidence from multiple streams, combining such evidence in optimal ways is an open challenge. In this paper, we claim that although the multi-modalities like audio, phoneme sequence ids and word sequence ids are related to each other, they also have their individual local 'cadence', which is important to be modelled for the task of emotion recognition. We model the local cadence by using separate `positional encodings' for each modality in a transformer architecture. Our results show that emotion detection based on this strategy is better than when the modality specific cadence is ignored or normalized out by using a shared positional encoding. We also find that capturing the modality interdependence is not as important as is capturing of the local cadence of individual modalities. We conduct our experiments on the IEMOCAP and CMU-MOSI datasets to demonstrate the effectiveness of the proposed methodology for combining multi-modal evidence.

doi: 10.21437/Interspeech.2022-11085

Cite as: Dhamyal, H., Raj, B., Singh, R. (2022) Positional Encoding for Capturing Modality Specific Cadence for Emotion Detection. Proc. Interspeech 2022, 166-170, doi: 10.21437/Interspeech.2022-11085

@inproceedings{dhamyal22_interspeech,
  author={Hira Dhamyal and Bhiksha Raj and Rita Singh},
  title={{Positional Encoding for Capturing Modality Specific Cadence for Emotion Detection}},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={166--170},
  doi={10.21437/Interspeech.2022-11085}
}