ABSTRACT
Speech emotion recognition (SER) has become an attractive topic owing to its broad range of applications. Segmentation is often used to increase training data for SER, but the inherited label may result in low performance. In this paper, we proposed a robust noise-label-suppressed module by relabeling the segment label to suppress the bad effects of the inherited label. Firstly, the segment of the log Mel spectrogram with deltas and delta-deltas of speech was calculated. Then, speech features were extracted by feature extraction model with 3-D data. Finally, the labels of each segment were corrected by the relabel model. Experimental results on the IEMOCAP dataset illustrate that our proposed noise-label suppressed module is superior to other advanced methods and gets robust performance.
- Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation 42, 4 (2008), 335–359.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.Google ScholarCross Ref
- Xi Ma, Zhiyong Wu, Jia Jia, Mingxing Xu, Helen Meng, and Lianhong Cai. 2018. Emotion Recognition from Variable-Length Speech Segments Using Deep Learning on Spectrograms. In Interspeech. 3683–3687.Google Scholar
- Sandeep Kumar Pandey, Hanumant Singh Shekhawat, and SRM Prasanna. 2022. Attention gated tensor neural network architectures for speech emotion recognition. Biomedical Signal Processing and Control 71 (2022), 103173.Google ScholarCross Ref
- Achintya Kumar Sarkar, Zheng-Hua Tan, Hao Tang, Suwon Shon, and James Glass. 2019. Time-contrastive learning based deep bottleneck features for text-dependent speaker verification. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 8 (2019), 1267–1279.Google ScholarDigital Library
- Aharon Satt, Shai Rozenberg, and Ron Hoory. 2017. Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms. In Interspeech. 1089–1093.Google Scholar
- Haipeng Wang, Tan Lee, Cheung-Chi Leung, Bin Ma, and Haizhou Li. 2015. Acoustic segment modeling with spectral clustering methods. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, 2 (2015), 264–277.Google ScholarDigital Library
- Kai Wang, Xiaojiang Peng, Jianfei Yang, Shijian Lu, and Yu Qiao. 2020. Suppressing uncertainties for large-scale facial expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6897–6906.Google ScholarCross Ref
- Xixin Wu, Yuewen Cao, Hui Lu, Songxiang Liu, Disong Wang, Zhiyong Wu, Xunying Liu, and Helen M Meng. 2021. Speech Emotion Recognition using Sequential Capsule Networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing (2021).Google ScholarDigital Library
- Shiqing Zhang, Shiliang Zhang, Tiejun Huang, and Wen Gao. 2017. Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Transactions on Multimedia 20, 6 (2017), 1576–1590.Google ScholarCross Ref
Index Terms
- Noise-label Suppressed Module for Speech Emotion Recognition
Recommendations
Speech emotion recognition based on optimized deep features of dual-channel complementary spectrogram
AbstractSpeech emotion recognition (SER) is an essential field of artificial intelligence. Although the Mel spectrogram is commonly used in SER, it emphasizes low-frequency emotional components. In this paper, we propose VMD-Teager-Mel (VTMel) ...
Highlights- A VTMel spectrogram that supplements the Mel spectrogram is proposed, highlighting high-frequency components.
Synthesized speech for model training in cross-corpus recognition of human emotion
Recognizing speakers in emotional conditions remains a challenging issue, since speaker states such as emotion affect the acoustic parameters used in typical speaker recognition systems. Thus, it is believed that knowledge of the current speaker emotion ...
Speech Emotion Recognition by Conventional Machine Learning and Deep Learning
Hybrid Artificial Intelligent SystemsAbstractThis paper reports experimental results of speech emotion recognition by conventional machine learning methods and deep learning techniques. We use a selection of mel frequency cepstral coefficients (MFCCs) as features for the conventional machine ...
Comments