Improving Speech Emotion Recognition by Fusing Pre-trained and Acoustic Features Using Transformer and BiLSTM

Liu, Zheng; Kang, Xin; Ren, Fuji

doi:10.1007/978-3-031-03948-5_28

Zheng Liu⁴,
Xin Kang⁴ &
Fuji Ren⁴

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 643))

Included in the following conference series:

International Conference on Intelligent Information Processing

715 Accesses
2 Citations

Abstract

With the emergence of machine learning and the deepening of human-computer interaction applications, the field of speech emotion recognition has attracted more and more attention. However, due to the high cost of speech emotion corpus construction, the speech emotion datasets are scarce. Therefore, how to obtain higher accuracy of recognition under the condition of limited corpus is one of the problems of speech emotion recognition. To solve the problem, we fused speech pre-trained features and acoustic features to enhance the generalization of speech features and proposed a novel feature fusion model based on Transformer and BiLSTM. We fused the speech pre-trained features extracted by Tera, Audio Albert, and Npc with the acoustic features of the voice, and conducted experiments on the CASIA Chinese voice emotion corpus. The results showed that our method and model achieved 94% accuracy in the Tera model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ren, F.: Affective information processing and recognizing human emotion. Electron. Notes Theor. Comput. Sci. 225, 39–50 (2009)
Article Google Scholar
Ren, F., Bao, Y.: A review on human-computer interaction and intelligent robots. Int. J. Inf. Technol. Decis. Mak. 19(1), 5–47 (2020)
Article Google Scholar
Liu, Z., et al.: Vowel priority lip matching scheme and similarity evaluation model based on humanoid robot Ren-Xin. J. Ambient Intell. Humaniz. Comput. 1–12 (2020)
Google Scholar
Deng, J., Ren, F.: Multi-label emotion detection via emotion-specified feature extraction and emotion correlation learning. IEEE Trans. Affect. Comput. (2020)
Google Scholar
Huang, Z., et al.: Facial expression imitation method for humanoid robot based on smooth-constraint reversed mechanical model (SRMM). IEEE Trans. Hum. Mach. Syst. 50(6), 538–549 (2020)
Article Google Scholar
Akçay, M.B., Oğuz, K.: Speech emotion recognition: emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun. 116, 56–76 (2020)
Article Google Scholar
Swain, M., Routray, A., Kabisatpathy, P.: Databases, features and classifiers for speech emotion recognition: a review. Int. J. Speech Technol. 21(1), 93–120 (2018)
Article Google Scholar
Zhuang, F., et al.: A comprehensive survey on transfer learning. Proc. IEEE 109(1), 43–76 (2020)
Article Google Scholar
Byun, S.-W., Lee, S.-P.: A study on a speech emotion recognition system with effective acoustic features using deep learning algorithms. Appl. Sci. 11(4), 1890 (2021)
Article Google Scholar
Ho, N.-H., Yang, H.-J., Kim, S.-H., Lee, G.: Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network. IEEE Access 8, 61672–61686 (2020)
Article Google Scholar
Kwon, S.: MLT-DNet: speech emotion recognition using 1D dilated CNN based on multi-learning trick approach. Expert Syst. Appl. 167, 114177 (2021)
Article Google Scholar
Ho, N.-H., et al.: Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network. IEEE Access 8, 61672–61686 (2020)
Article Google Scholar
Chung, Y.-A., Glass, J.: Speech2vec: a sequence-to-sequence framework for learning word embeddings from speech. Interspeech 2018 (2018)
Google Scholar
Schneider, S., Baevski, A., Collobert, R., Auli, M.: wav2vec: unsupervised pre-training for speech recognition. Interspeech (2019)
Google Scholar
Anonymous Authors. vq-wav2vec: self-supervised learning of discrete speech representations. In: ICLR 2020 Conference Blind Submission (2020)
Google Scholar
Chorowski, J., Weiss, R.J., Bengio, S., van den Oord, A.: Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM Trans. Audio Speech Lang. Process. 27(12), 2041–2053 (2019)
Article Google Scholar
Wang, W., Watters, P.A., Cao, X., Shen, L., Li, B.: Significance of phonological features in speech emotion recognition. Int. J. Speech Technol. 23(3), 633–642 (2020)
Article Google Scholar
Zhang, S., et al.: Learning deep multimodal affective features for spontaneous speech emotion recognition. Speech Commun. 127, 73–81 (2021)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. (2017)
Google Scholar
Liu, A.T., Li, S.-W., Lee, H.: Tera: selfupervised learning of transformer encoder representation for speech. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2351–2366 (2021)
Article Google Scholar
Chi, P.-H., et al.: Audio albert: a lite bert for self-supervised learning of audio representation. In: 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE (2021)
Google Scholar
Liu, A.H., Chung, Y.-A., Glass, J.: Non-autoregressive predictive coding for learning speech representations from local dependencies. arXiv preprint arXiv:2011.00406 (2020)
Google Scholar
Eyben, F., Wöllmer, M., Schuller, B.: Opensmile: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia (2010)
Google Scholar

Download references

Acknowledgments

This research has been supported by JSPS KAKENHI Grant Number 19K20345 and Grant Number 19H04215.

Author information

Authors and Affiliations

School of Information Faculty of Engineering, Tokushima University, Tokushima, 770-8506, Japan
Zheng Liu, Xin Kang & Fuji Ren

Authors

Zheng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xin Kang
View author publications
You can also search for this author in PubMed Google Scholar
Fuji Ren
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fuji Ren .

Editor information

Editors and Affiliations

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Zhongzhi Shi
IRD, Sorbonne University, Bondy, France
Jean-Daniel Zucker
Nanyang Technological University, Singapore, Singapore
Bo An

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, Z., Kang, X., Ren, F. (2022). Improving Speech Emotion Recognition by Fusing Pre-trained and Acoustic Features Using Transformer and BiLSTM. In: Shi, Z., Zucker, JD., An, B. (eds) Intelligent Information Processing XI. IIP 2022. IFIP Advances in Information and Communication Technology, vol 643. Springer, Cham. https://doi.org/10.1007/978-3-031-03948-5_28

Download citation

DOI: https://doi.org/10.1007/978-3-031-03948-5_28
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-03947-8
Online ISBN: 978-3-031-03948-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)