Abstract
With the emergence of machine learning and the deepening of human-computer interaction applications, the field of speech emotion recognition has attracted more and more attention. However, due to the high cost of speech emotion corpus construction, the speech emotion datasets are scarce. Therefore, how to obtain higher accuracy of recognition under the condition of limited corpus is one of the problems of speech emotion recognition. To solve the problem, we fused speech pre-trained features and acoustic features to enhance the generalization of speech features and proposed a novel feature fusion model based on Transformer and BiLSTM. We fused the speech pre-trained features extracted by Tera, Audio Albert, and Npc with the acoustic features of the voice, and conducted experiments on the CASIA Chinese voice emotion corpus. The results showed that our method and model achieved 94% accuracy in the Tera model.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ren, F.: Affective information processing and recognizing human emotion. Electron. Notes Theor. Comput. Sci. 225, 39–50 (2009)
Ren, F., Bao, Y.: A review on human-computer interaction and intelligent robots. Int. J. Inf. Technol. Decis. Mak. 19(1), 5–47 (2020)
Liu, Z., et al.: Vowel priority lip matching scheme and similarity evaluation model based on humanoid robot Ren-Xin. J. Ambient Intell. Humaniz. Comput. 1–12 (2020)
Deng, J., Ren, F.: Multi-label emotion detection via emotion-specified feature extraction and emotion correlation learning. IEEE Trans. Affect. Comput. (2020)
Huang, Z., et al.: Facial expression imitation method for humanoid robot based on smooth-constraint reversed mechanical model (SRMM). IEEE Trans. Hum. Mach. Syst. 50(6), 538–549 (2020)
Akçay, M.B., Oğuz, K.: Speech emotion recognition: emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun. 116, 56–76 (2020)
Swain, M., Routray, A., Kabisatpathy, P.: Databases, features and classifiers for speech emotion recognition: a review. Int. J. Speech Technol. 21(1), 93–120 (2018)
Zhuang, F., et al.: A comprehensive survey on transfer learning. Proc. IEEE 109(1), 43–76 (2020)
Byun, S.-W., Lee, S.-P.: A study on a speech emotion recognition system with effective acoustic features using deep learning algorithms. Appl. Sci. 11(4), 1890 (2021)
Ho, N.-H., Yang, H.-J., Kim, S.-H., Lee, G.: Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network. IEEE Access 8, 61672–61686 (2020)
Kwon, S.: MLT-DNet: speech emotion recognition using 1D dilated CNN based on multi-learning trick approach. Expert Syst. Appl. 167, 114177 (2021)
Ho, N.-H., et al.: Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network. IEEE Access 8, 61672–61686 (2020)
Chung, Y.-A., Glass, J.: Speech2vec: a sequence-to-sequence framework for learning word embeddings from speech. Interspeech 2018 (2018)
Schneider, S., Baevski, A., Collobert, R., Auli, M.: wav2vec: unsupervised pre-training for speech recognition. Interspeech (2019)
Anonymous Authors. vq-wav2vec: self-supervised learning of discrete speech representations. In: ICLR 2020 Conference Blind Submission (2020)
Chorowski, J., Weiss, R.J., Bengio, S., van den Oord, A.: Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM Trans. Audio Speech Lang. Process. 27(12), 2041–2053 (2019)
Wang, W., Watters, P.A., Cao, X., Shen, L., Li, B.: Significance of phonological features in speech emotion recognition. Int. J. Speech Technol. 23(3), 633–642 (2020)
Zhang, S., et al.: Learning deep multimodal affective features for spontaneous speech emotion recognition. Speech Commun. 127, 73–81 (2021)
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. (2017)
Liu, A.T., Li, S.-W., Lee, H.: Tera: selfupervised learning of transformer encoder representation for speech. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2351–2366 (2021)
Chi, P.-H., et al.: Audio albert: a lite bert for self-supervised learning of audio representation. In: 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE (2021)
Liu, A.H., Chung, Y.-A., Glass, J.: Non-autoregressive predictive coding for learning speech representations from local dependencies. arXiv preprint arXiv:2011.00406 (2020)
Eyben, F., Wöllmer, M., Schuller, B.: Opensmile: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia (2010)
Acknowledgments
This research has been supported by JSPS KAKENHI Grant Number 19K20345 and Grant Number 19H04215.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 IFIP International Federation for Information Processing
About this paper
Cite this paper
Liu, Z., Kang, X., Ren, F. (2022). Improving Speech Emotion Recognition by Fusing Pre-trained and Acoustic Features Using Transformer and BiLSTM. In: Shi, Z., Zucker, JD., An, B. (eds) Intelligent Information Processing XI. IIP 2022. IFIP Advances in Information and Communication Technology, vol 643. Springer, Cham. https://doi.org/10.1007/978-3-031-03948-5_28
Download citation
DOI: https://doi.org/10.1007/978-3-031-03948-5_28
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-03947-8
Online ISBN: 978-3-031-03948-5
eBook Packages: Computer ScienceComputer Science (R0)