An Efficient Temporal Modeling Approach for Speech Emotion Recognition by Mapping Varied Duration Sentences into Fixed Number of Chunks

Lin, Wei-Cheng; Busso, Carlos

doi:10.21437/Interspeech.2020-2636

An Efficient Temporal Modeling Approach for Speech Emotion Recognition by Mapping Varied Duration Sentences into Fixed Number of Chunks

Wei-Cheng Lin, Carlos Busso

Speech emotion recognition (SER) plays an important role in multiple fields such as healthcare, human-computer interaction (HCI), and security and defense. Emotional labels are often annotated at the sentence-level (i.e., one label per sentence), resulting in a sequence-to-one recognition problem. Traditionally, studies have relied on statistical descriptions, which are computed over time from low level descriptors (LLDs), creating a fixed dimension sentence-level feature representation regardless of the duration of the sentence. However sentence-level features lack temporal information, which limits the performance of SER systems. Recently, new deep learning architectures have been proposed to model temporal data. An important question is how to extract emotion-relevant features with temporal information. This study proposes a novel data processing approach that extracts a fixed number of small chunks over sentences of different durations by changing the overlap between these chunks. The approach is flexible, providing an ideal framework to combine gated network or attention mechanisms with long short-term memory (LSTM) networks. Our experimental results based on the MSP-Podcast dataset demonstrate that the proposed method not only significantly improves recognition accuracy over alternative temporal-based models relying on LSTM, but also leads to computational efficiency.

doi: 10.21437/Interspeech.2020-2636

Cite as: Lin, W.-C., Busso, C. (2020) An Efficient Temporal Modeling Approach for Speech Emotion Recognition by Mapping Varied Duration Sentences into Fixed Number of Chunks. Proc. Interspeech 2020, 2322-2326, doi: 10.21437/Interspeech.2020-2636

@inproceedings{lin20d_interspeech,
  author={Wei-Cheng Lin and Carlos Busso},
  title={{An Efficient Temporal Modeling Approach for Speech Emotion Recognition by Mapping Varied Duration Sentences into Fixed Number of Chunks}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2322--2326},
  doi={10.21437/Interspeech.2020-2636}
}