ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Mind the gap: On the value of silence representations to lexical-based speech emotion recognition

Matthew Perez, Mimansa Jaiswal, Minxue Niu, Cristina Gorrostieta, Matthew Roddy, Kye Taylor, Reza Lotfian, John Kane, Emily Mower Provost

Speech timing and non-speech regions (here referred to as ``silence"), often play a critical role in the perception of spoken language. Silence represents an important paralinguistic component in communication. For example, some of its functions include conveying emphasis, dramatization, or even sarcasm. In speech emotion recognition (SER), there has been relatively little work on investigating the utility of silence and no work regarding the effect of silence on linguistics. In this work, we present a novel framework which investigates fusing linguistic and silence representations for emotion recognition in naturalistic speech using the MSP-Podcast dataset. We investigate two methods to represent silence in SER models; the first approach uses utterance-level statistics, while the second learns a silence token embedding within a transformer language model. Our results show that modeling silence does improve SER performance and that modeling silence as a token in a transformer language model significantly improves performance on MSP-Podcast achieving a concordance correlation coefficient of .191 and .453 for activation and valence respectively. In addition, we perform analyses on the attention of silence and find that silence emphasizes the attention of its surrounding words.


doi: 10.21437/Interspeech.2022-10943

Cite as: Perez, M., Jaiswal, M., Niu, M., Gorrostieta, C., Roddy, M., Taylor, K., Lotfian, R., Kane, J., Provost, E.M. (2022) Mind the gap: On the value of silence representations to lexical-based speech emotion recognition. Proc. Interspeech 2022, 156-160, doi: 10.21437/Interspeech.2022-10943

@inproceedings{perez22_interspeech,
  author={Matthew Perez and Mimansa Jaiswal and Minxue Niu and Cristina Gorrostieta and Matthew Roddy and Kye Taylor and Reza Lotfian and John Kane and Emily Mower Provost},
  title={{Mind the gap: On the value of silence representations to lexical-based speech emotion recognition}},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={156--160},
  doi={10.21437/Interspeech.2022-10943}
}