Dynamic-Static Cross Attentional Feature Fusion Method for Speech Emotion Recognition

Dong, Ke; Peng, Hao; Che, Jie

doi:10.1007/978-3-031-27818-1_29

Ke Dong¹⁵,
Hao Peng^16,17 &
Jie Che¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13834))

Included in the following conference series:

International Conference on Multimedia Modeling

1261 Accesses

Abstract

The dynamic-static fusion features play an important role in speech emotion recognition (SER). However, the fusion methods of dynamic features and static features generally are simple addition or serial fusion, which might cause the loss of certain underlying emotional information. To address this issue, we proposed a dynamic-static cross attentional feature fusion method (SD-CAFF) with a cross attentional feature fusion mechanism (Cross AFF) to extract superior deep dynamic-static fusion features. To be specific, the Cross AFF is utilized to parallel fuse the deep features from the CNN/LSTM feature extraction module, which can extract the deep static features and the deep dynamic features from acoustic features (MFCC, Delta, and Delta-delta). In addition to the SD-CAFF framework, we also employed muti-task learning in the training process to further improve the accuracy of emotion recognition. The experimental results on IEMOCAP demonstrated the WA and UA of SD-CAFF are 75.78% and 74.89%, respectively, which outperformed the current SOTAs. Furthermore, SD-CAFF achieved competitive performances (WA: 56.77%; UA: 56.30%) in the comparison experiments of cross-corpus capability on MSP-IMPROV.

Supported by National Natural Science Foundation (NNSF) of China (Grant 61867005).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Busso, C., et al.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335–359 (2008). https://doi.org/10.1007/s10579-008-9076-6
Article Google Scholar
Busso, C., Parthasarathy, S., Burmania, A., AbdelWahab, M., Sadoughi, N., Provost, E.M.: MSP-IMPROV: an acted corpus of dyadic interactions to study emotion perception. IEEE Trans. Affect. Comput. 8(1), 67–80 (2016). https://doi.org/10.1109/TAFFC.2016.2515617
Article Google Scholar
Cao, Q., Hou, M., Chen, B., Zhang, Z., Lu, G.: Hierarchical network based on the fusion of static and dynamic features for speech emotion recognition. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6334–6338. IEEE (2021). https://doi.org/10.1109/icassp39728.2021.9414540
Chen, Y., Dai, X., Liu, M., Chen, D., Yuan, L., Liu, Z.: Dynamic ReLU. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 351–367. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_21
Chapter Google Scholar
Dai, Y., Gieseke, F., Oehmcke, S., Wu, Y., Barnard, K.: Attentional feature fusion. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3560–3569 (2021). https://doi.org/10.1109/WACV48630.2021.00360
Huilian, L., Weiping, H., Yan, W.: Speech emotion recognition based on BLSTM and CNN feature fusion. In: Proceedings of the 2020 4th International Conference on Digital Signal Processing, pp. 169–172 (2020). https://doi.org/10.1145/3408127.3408192
Lambrecht, L., Kreifelts, B., Wildgruber, D.: Gender differences in emotion recognition: impact of sensory modality and emotional category. Cogn. Emot. 28(3), 452–469 (2014). https://doi.org/10.1080/02699931.2013.837378
Article Google Scholar
Latif, S., Rana, R., Khalifa, S., Jurdak, R., Epps, J., Schuller, B.W.: Multi-task semi-supervised adversarial autoencoding for speech emotion recognition. IEEE Trans. Affect. Comput. 13(2), 992–1004 (2020). https://doi.org/10.1109/taffc.2020.2983669
Article Google Scholar
Li, Y., Baidoo, C., Cai, T., Kusi, G.A.: Speech emotion recognition using 1D CNN with no attention. In: 2019 23rd International Computer Science and Engineering Conference (ICSEC), pp. 351–356. IEEE (2019). https://doi.org/10.1109/ICSEC47112.2019.8974716
Liu, J., Liu, Z., Wang, L., Guo, L., Dang, J.: Speech emotion recognition with local-global aware deep representation learning. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7174–7178. IEEE (2020). https://doi.org/10.1109/icassp40776.2020.9053192
Liu, L.Y., Liu, W.Z., Zhou, J., Deng, H.Y., Feng, L.: ATDA: attentional temporal dynamic activation for speech emotion recognition. Knowl.-Based Syst. 243, 108472 (2022). https://doi.org/10.1016/j.knosys.2022.108472
Nediyanchath, A., Paramasivam, P., Yenigalla, P.: Multi-head attention for speech emotion recognition with auxiliary learning of gender recognition. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7179–7183. IEEE (2020). https://doi.org/10.1109/icassp40776.2020.9054073
Shirian, A., Guha, T.: Compact graph architecture for speech emotion recognition. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6284–6288. IEEE (2021). https://doi.org/10.1109/icassp39728.2021.9413876
Su, B.H., Chang, C.M., Lin, Y.S., Lee, C.C.: Improving speech emotion recognition using graph attentive bi-directional gated recurrent unit network. In: INTERSPEECH, pp. 506–510 (2020). https://doi.org/10.21437/interspeech.2020-1733
Sun, B., Wei, Q., Li, L., Xu, Q., He, J., Yu, L.: LSTM for dynamic emotion and group emotion recognition in the wild. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 451–457 (2016). https://doi.org/10.1145/2993148.2997640
Sun, S.: A survey of multi-view machine learning. Neural Comput. Appl. 23(7), 2031–2038 (2013). https://doi.org/10.1007/s00521-013-1362-6
Article Google Scholar
Ullah, A., Muhammad, K., Del Ser, J., Baik, S.W., de Albuquerque, V.H.C.: Activity recognition using temporal optical flow convolutional features and multilayer LSTM. IEEE Trans. Industr. Electron. 66(12), 9692–9702 (2018). https://doi.org/10.1109/TIE.2018.2881943
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017). https://doi.org/10.5555/3295222.3295349
Yang, J., Yang, J.Y., Zhang, D., Lu, J.F.: Feature fusion: parallel strategy vs. serial strategy. Pattern Recogn. 36(6), 1369–1381 (2003). https://doi.org/10.1016/S0031-3203(02)00262-5
Article MATH Google Scholar
Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 112–118. IEEE (2018). https://doi.org/10.1109/SLT.2018.8639583

Download references

Author information

Authors and Affiliations

Hefei University of Technology, Hefei, China
Ke Dong & Jie Che
Dalian University of Technology, Dalian, China
Hao Peng
Newcastle University, Newcastle, UK
Hao Peng

Authors

Ke Dong
View author publications
You can also search for this author in PubMed Google Scholar
Hao Peng
View author publications
You can also search for this author in PubMed Google Scholar
Jie Che
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ke Dong .

Editor information

Editors and Affiliations

University of Bergen, Bergen, Norway
Duc-Tien Dang-Nguyen
Dublin City University, Dublin, Ireland
Cathal Gurrin
Radboud University Nijmegen, Nijmegen, The Netherlands
Martha Larson
Dublin City University, Dublin, Ireland
Alan F. Smeaton
University of Amsterdam, Amsterdam, The Netherlands
Stevan Rudinac
National Institute of Information and Communications Technology, Tokyo, Japan
Minh-Son Dao
Department of Information Science and Media Studies, University of Bergen, Bergen, Norway
Christoph Trattner
La Trobe University, Melbourne, VIC, Australia
Phoebe Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dong, K., Peng, H., Che, J. (2023). Dynamic-Static Cross Attentional Feature Fusion Method for Speech Emotion Recognition. In: Dang-Nguyen, DT., et al. MultiMedia Modeling. MMM 2023. Lecture Notes in Computer Science, vol 13834. Springer, Cham. https://doi.org/10.1007/978-3-031-27818-1_29

Download citation

DOI: https://doi.org/10.1007/978-3-031-27818-1_29
Published: 31 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-27817-4
Online ISBN: 978-3-031-27818-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Dynamic-Static Cross Attentional Feature Fusion Method for Speech Emotion Recognition