Deep Learning of Segment-Level Feature Representation with Multiple Instance Learning for Utterance-Level Speech Emotion Recognition

Mao, Shuiyang; Ching, P.C.; Lee, Tan

doi:10.21437/Interspeech.2019-1968

Deep Learning of Segment-Level Feature Representation with Multiple Instance Learning for Utterance-Level Speech Emotion Recognition

Shuiyang Mao, P.C. Ching, Tan Lee

In this paper, we propose to combine the deep learning of feature representation with multiple instance learning (MIL) to recognize emotion from speech. The key idea of our approach is to first consciously classify the emotional state of each segment. Then the utterance-level classification is constructed as an aggregation of the segment-level decisions. For the segment-level classification, we attempt two different deep neural network (DNN) architectures called SegMLP and SegCNN, respectively. SegMLP is a multilayer perceptron (MLP) that extracts high-level feature representation from the manually designed perceptual features, and SegCNN is a convolutional neural network (CNN) that automatically learn emotion-specific features from the log Mel filterbanks. Extensive emotion recognition experiments are carried out on the CASIA corpus and the IEMOCAP database. We find that: (1) the aggregation of segment-level decisions provides richer information than the statistics over the low-level descriptors (LLDs) across the whole utterance; (2) automatic feature learning outperforms manual features. Our experimental results are also compared with those of state-of-the-art methods, further demonstrating the effectiveness of the proposed approach.

doi: 10.21437/Interspeech.2019-1968

Cite as: Mao, S., Ching, P.C., Lee, T. (2019) Deep Learning of Segment-Level Feature Representation with Multiple Instance Learning for Utterance-Level Speech Emotion Recognition. Proc. Interspeech 2019, 1686-1690, doi: 10.21437/Interspeech.2019-1968

@inproceedings{mao19_interspeech,
  author={Shuiyang Mao and P.C. Ching and Tan Lee},
  title={{Deep Learning of Segment-Level Feature Representation with Multiple Instance Learning for Utterance-Level Speech Emotion Recognition}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={1686--1690},
  doi={10.21437/Interspeech.2019-1968}
}