CM-TCN: Channel-Aware Multi-scale Temporal Convolutional Networks for Speech Emotion Recognition

Wu, Tianqi; Wang, Liejun; Zhang, Jiang

doi:10.1007/978-981-99-8067-3_34

Tianqi Wu¹²,
Liejun Wang¹² &
Jiang Zhang¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14449))

Included in the following conference series:

International Conference on Neural Information Processing

1087 Accesses
3 Citations

Abstract

Speech emotion recognition (SER) plays a crucial role in understanding user intent and improving human-computer interaction (HCI). Currently, the most widely used and effective methods are based on deep learning. In the existing research, the temporal information becomes more and more important in SER. Although some advanced deep learning methods can achieve good results, such as convolutional neural networks (CNN) and attention module, they often ignore the temporal information in speech, which can lead to insufficient representation and low classification accuracy. In order to make full use of temporal features, we proposed channel-aware multi-scale temporal convolutional networks (CM-TCN). Firstly, channel-aware temporal convolutional networks (CATCN) is used as the basic structure to extract multi-scale temporal features combining channel information. Then, global feature attention (GFA) captures the global information at different time scales and enhances the important information. Finally, we use the adaptive fusion module (AFM) to establish the overall dependency of different network layers and fuse features. We conduct extensive experiments on six dataset, and the experimental results demonstrate the superior performance of CM-TCN.

This research was funded by the Scientific and technological innovation 2030 major project under Grant 2022ZD0115800, Xinjiang Uygur Autonomous Region Tianshan Excellence Project under Grant 2022TSYCLJ0036.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

TACST: Time-Aware Transformer for Robust Speech Emotion Recognition

Speech Emotion Recognition Using U-Net

Towards an efficient backbone for preserving features in speech emotion recognition: deep-shallow convolution with recurrent neural network

Article 28 August 2022

References

Schuller, B., Rigoll, G., Lang, M.: Hidden Markov model-based speech emotion recognition. In: 2003–2003 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–401. IEEE, Baltimore (2003)
Google Scholar
Mower, E., Mataric, M.J., Narayanan, S.: A framework for automatic human emotion classification using emotion profiles. IEEE Trans. Audio Speech Lang. Process. 19(5), 1057–1070 (2011). https://doi.org/10.1109/TASL.2010.2076804
Article Google Scholar
Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine. In: Interspeech, pp. 223–227. ISCA, Singapore (2014)
Google Scholar
Huang, Z., Dong, M., Mao,Q., Zhan, Y.: Speech emotion recognition using CNN. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 801–804 (2014)
Google Scholar
Tarantino, L., Garner, P.N., Lazaridis, A., et al.: Self-attention for speech emotion recognition, In: Interspeech, pp. 2578–2582. ISCA, Graz (2019)
Google Scholar
Xu, M., Zhang, F., Khan, S.U.: Head fusion: improving the accuracy and robustness of speech emotion recognition on the IEMOCAP and RAVDESS dataset. IEEE Access 9, 1058–1064 (2020)
Google Scholar
Xu, M., Zhang, F., Cui, X., Zhang, W.: Speech emotion recognition with multiscale area attention and data augmentation. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6319–6323. IEEE, Toronto (2021)
Google Scholar
Jahangir, R., Teh, Y.W., Hanif, F., Mujtaba, G.: Deep learning approaches for speech emotion recognition: state of the art and research challenges. Speech Commun. 127, 73–81 (2021)
Google Scholar
Zhang, S., Tao, X., Chuang, Y., Zhao, X.: Learning the sequential temporal information with recurrent neural networks. Multimedia Tools Appl. 80(16), 23745–23812 (2021)
Google Scholar
Murugan, P.: Learning deep multimodal affective features for spontaneous speech emotion recognition. R abs 1807.02857 (2018). https://doi.org/10.48550/arXiv.1807.02857
Xie, Y., Liang, R., Liang, Z., Zhao, L.: Attention-based dense LSTM for speech emotion recognition. IEICE Trans. Inf. Syst. 102(7), 1426–1429 (2019)
Article Google Scholar
Su, B., Chang, C., Lin, Y., Lee, C.: Improving speech emotion recognition using graph attentive bi-directional gated recurrent unit network. In: Interspeech, pp. 506–510. ISCA, Shanghai (2020)
Google Scholar
Lin, W., Busso, C.: An efficient temporal modeling approach for speech emotion recognition by mapping varied duration sentences into fixed number of chunks. In: Interspeech, pp. 2322–2326. ISCA, Shanghai (2020)
Google Scholar
Wang, J., Xue, M., Culhane, R., et al.: Speech emotion recognition with dual-sequence LSTM architecture. In: ICASSP 2020–2020 International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6474–6478. IEEE, Barcelona (2020)
Google Scholar
Zhong, Y., Hu, Y., Huang, H., Silamu, W.: A lightweight model based on separable convolution for speech emotion recognition. In: Interspeech, pp. 3331–3335. ISCA, Shanghai (2020)
Google Scholar
Rajamani, S.T., Rajamani, K.T., Mallol-Ragolta, A., et al.: A novel attention-based gated recurrent unit and its efficacy in speech emotion recognition. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6294–6298. IEEE, Toronto (2021)
Google Scholar
Zhao, Z., Zheng, Y., Zhang, Z., et al.: Exploring spatio-temporal representations by integrating attention-based bidirectional-LSTM-RNNs and FCNs for speech emotion recognition. In: Interspeech, pp. 272–276.ISCA , Hyderabad (2018)
Google Scholar
Mustaqeem, S.K.: MLT-DNet: speech emotion recognition using 1D dilated CNN based on multi-learning trick approach. Expert Syst. Appl. 167, 114177 (2021)
Article Google Scholar
Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling, CoRR abs/1803.01271 (2018). https://doi.org/10.48550/arXiv.1803.01271
Salehinejad, H., Baarbe, J., Sankar, S., Barfett, J., Colak, E., Valaee, S.: Recent advances in recurrent neural networks. CoRR abs/1801.01078 (2018). https://doi.org/10.48550/arXiv.1801.01078
Farha, Y.A., Gall, J.: MS-TCN: multi-stage temporal convolutional network for action segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 45(6), 6647–6658 (2018)
Google Scholar
Zhao, Y., Wang, D., Xu, B., Zhang, T.: Monaural speech dereverberation using temporal convolutional networks with self attention. IEEE Trans. Audio Speech Lang. Process. 28, 1057–1070 (2020)
Google Scholar
Luo, Y., Mesgarani, N.: Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)
Article Google Scholar
Peng, Z., Lu, Y., Pan, S., et al.: Efficient speech emotion recognition using multi-scale CNN and attention. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3020–3024. IEEE, Toronto (2021)
Google Scholar
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141.IEEE, Salt Lake City (2018)
Google Scholar
Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., Hu, Q.: ECA-Net: efficient channel attention for deep convolutional neural networks. In:2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11531–11539. IEEE, Seattle (2020)
Google Scholar
Tao, J., Liu, F., Zhang, M., Jia, H.: Design of speech dataset for mandarin text to speech. In: Blizzard Challenge 2008 Workshop (2008)
Google Scholar
Burkhardt, F., Paeschke, A., Rolfes, M., et al.: A database of German emotional speech. In: Interspeech, pp. 1517–1520. ISCA, Lisbon (2005)
Google Scholar
Costantini, G., Iaderola, I., Paoloni, A., Todisco, M.: Emovo dataset: an Italian emotional speech database. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 3501–3504. European Language Resources Association (ELRA), Reykjavik (2014)
Google Scholar
Busso, C., Bulut, M., Lee, C.C., et al.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Res. Eval. 42(4), 335–359 (2008)
Article Google Scholar
Livingstone, S.R, Russo, F.A.: The ryerson audio-visual database of emotional speech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in North American English. PLOS ONE 13(5), e0196391 (2018)
Google Scholar
Philip J., Haq, S.: Surrey Audio-Visual Expressed Emotion (savee) Database. University of Surrey, Guildford (2014)
Google Scholar
Sun, L., Fu, S., Wang, F.: Decision tree SVM model with fisher feature selection for speech emotion recognition. EURASIP J. Audio Speech Music. Process. 2019, 2 (2019)
Article Google Scholar
Chen, L., Su, W., Feng, Y., et al.: Two-layer fuzzy multiple random forest for speech emotion recognition in human-robot interaction. Inf. Sci. 509, 150–163 (2020)
Article Google Scholar
Ye, J., Wen, X., Wang, X., et al.: GM-TCNet: gated multi-scale temporal convolutional network using emotion causality for speech emotion recognition. Speech Commun. 145, 21–35 (2022)
Article Google Scholar
Wen, X., Ye, J., Luo, Y., et al. CTL-MTNet: a novel capsnet and transfer learning-based mixed task net for single-dataset and cross-dataset speech emotion recognition. In: International Joint Conferences on Artificial Intelligence (IJCAI) 2022, Vienna, Austria, pp. 2305–2311 (2022)
Google Scholar
Tuncer, T., Dogan, S., Acharya, U.R.: Automated accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques. Knowl.-Based Syst. 211, 106547 (2021)
Article Google Scholar
Aftab, A., Morsali, A., Ghaemmaghami, S., et al.: LIGHT-SERNET: a lightweight fully convolutional neural network for speech emotion recognition. In: ICASSP 2022–2022 International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6912–6916. IEEE, Virtual and Singapore (2022)
Google Scholar
Ozer, I.: Pseudo-colored rate map representation for speech emotion recognition. Biomed. Signal Process. Control 66, 102502 (2021)
Article Google Scholar
Ancilin, J., Milton, A.: Improved speech emotion recognition with mel frequency magnitude coefficient. Appl. Acoust. 179, 108046 (2021)
Article Google Scholar
Liu, J., Song, Y., Wang, L., Dang, J., Yu, R.: Time-frequency representation learning with graph convolutional network for dialogue-level speech emotion recognition. In: Interspeech, pp. 4523–4527. ISCA, Brno (2020)
Google Scholar
Cao, Q., Hou, M., Chen, B., Zhang, Z., Lu, G.: Hierarchical network based on the fusion of static and dynamic features for speech emotion recognition. In: ICASSP 2021–2021 IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP), pp. 6334–6338. IEEE, Toronto (2021)
Google Scholar
Mustaqeem, Kwon, S.: Optimal feature selection based speech emotion recognition using two-stream deep convolutional neural network. Int. J. Intell. Syst. 36(9), 5116–5135 (2021)
Article Google Scholar
Hajarolasvadi, N., Demirel, H.: 3D CNN-based speech emotion recognition using k-means clustering and spectrograms. Entropy 21(5), 479 (2019)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Xinjiang Key Laboratory of Signal Detection and Processing, College of Information Science and Engineering, Xinjiang University, Urumqi, 830046, China
Tianqi Wu, Liejun Wang & Jiang Zhang

Authors

Tianqi Wu
View author publications
You can also search for this author in PubMed Google Scholar
Liejun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liejun Wang .

Editor information

Editors and Affiliations

Central South University, Changsha, China
Biao Luo
Chinese Academy of Sciences, Beijing, China
Long Cheng
Zhejiang University, Hangzhou, China
Zheng-Guang Wu
Guangdong University of Technology, Guangzhou, China
Hongyi Li
UNSW Sydney, Sydney, NSW, Australia
Chaojie Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, T., Wang, L., Zhang, J. (2024). CM-TCN: Channel-Aware Multi-scale Temporal Convolutional Networks for Speech Emotion Recognition. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Lecture Notes in Computer Science, vol 14449. Springer, Singapore. https://doi.org/10.1007/978-981-99-8067-3_34

Download citation

DOI: https://doi.org/10.1007/978-981-99-8067-3_34
Published: 16 November 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8066-6
Online ISBN: 978-981-99-8067-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

CM-TCN: Channel-Aware Multi-scale Temporal Convolutional Networks for Speech Emotion Recognition