Skip to main content

Improving Speech Emotion Recognition by Fusing Pre-trained and Acoustic Features Using Transformer and BiLSTM

  • Conference paper
Intelligent Information Processing XI (IIP 2022)

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 643))

Included in the following conference series:

Abstract

With the emergence of machine learning and the deepening of human-computer interaction applications, the field of speech emotion recognition has attracted more and more attention. However, due to the high cost of speech emotion corpus construction, the speech emotion datasets are scarce. Therefore, how to obtain higher accuracy of recognition under the condition of limited corpus is one of the problems of speech emotion recognition. To solve the problem, we fused speech pre-trained features and acoustic features to enhance the generalization of speech features and proposed a novel feature fusion model based on Transformer and BiLSTM. We fused the speech pre-trained features extracted by Tera, Audio Albert, and Npc with the acoustic features of the voice, and conducted experiments on the CASIA Chinese voice emotion corpus. The results showed that our method and model achieved 94% accuracy in the Tera model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 159.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Ren, F.: Affective information processing and recognizing human emotion. Electron. Notes Theor. Comput. Sci. 225, 39–50 (2009)

    Article  Google Scholar 

  • Ren, F., Bao, Y.: A review on human-computer interaction and intelligent robots. Int. J. Inf. Technol. Decis. Mak. 19(1), 5–47 (2020)

    Article  Google Scholar 

  • Liu, Z., et al.: Vowel priority lip matching scheme and similarity evaluation model based on humanoid robot Ren-Xin. J. Ambient Intell. Humaniz. Comput. 1–12 (2020)

    Google Scholar 

  • Deng, J., Ren, F.: Multi-label emotion detection via emotion-specified feature extraction and emotion correlation learning. IEEE Trans. Affect. Comput. (2020)

    Google Scholar 

  • Huang, Z., et al.: Facial expression imitation method for humanoid robot based on smooth-constraint reversed mechanical model (SRMM). IEEE Trans. Hum. Mach. Syst. 50(6), 538–549 (2020)

    Article  Google Scholar 

  • Akçay, M.B., OÄŸuz, K.: Speech emotion recognition: emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun. 116, 56–76 (2020)

    Article  Google Scholar 

  • Swain, M., Routray, A., Kabisatpathy, P.: Databases, features and classifiers for speech emotion recognition: a review. Int. J. Speech Technol. 21(1), 93–120 (2018)

    Article  Google Scholar 

  • Zhuang, F., et al.: A comprehensive survey on transfer learning. Proc. IEEE 109(1), 43–76 (2020)

    Article  Google Scholar 

  • Byun, S.-W., Lee, S.-P.: A study on a speech emotion recognition system with effective acoustic features using deep learning algorithms. Appl. Sci. 11(4), 1890 (2021)

    Article  Google Scholar 

  • Ho, N.-H., Yang, H.-J., Kim, S.-H., Lee, G.: Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network. IEEE Access 8, 61672–61686 (2020)

    Article  Google Scholar 

  • Kwon, S.: MLT-DNet: speech emotion recognition using 1D dilated CNN based on multi-learning trick approach. Expert Syst. Appl. 167, 114177 (2021)

    Article  Google Scholar 

  • Ho, N.-H., et al.: Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network. IEEE Access 8, 61672–61686 (2020)

    Article  Google Scholar 

  • Chung, Y.-A., Glass, J.: Speech2vec: a sequence-to-sequence framework for learning word embeddings from speech. Interspeech 2018 (2018)

    Google Scholar 

  • Schneider, S., Baevski, A., Collobert, R., Auli, M.: wav2vec: unsupervised pre-training for speech recognition. Interspeech (2019)

    Google Scholar 

  • Anonymous Authors. vq-wav2vec: self-supervised learning of discrete speech representations. In: ICLR 2020 Conference Blind Submission (2020)

    Google Scholar 

  • Chorowski, J., Weiss, R.J., Bengio, S., van den Oord, A.: Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM Trans. Audio Speech Lang. Process. 27(12), 2041–2053 (2019)

    Article  Google Scholar 

  • Wang, W., Watters, P.A., Cao, X., Shen, L., Li, B.: Significance of phonological features in speech emotion recognition. Int. J. Speech Technol. 23(3), 633–642 (2020)

    Article  Google Scholar 

  • Zhang, S., et al.: Learning deep multimodal affective features for spontaneous speech emotion recognition. Speech Commun. 127, 73–81 (2021)

    Article  Google Scholar 

  • Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. (2017)

    Google Scholar 

  • Liu, A.T., Li, S.-W., Lee, H.: Tera: selfupervised learning of transformer encoder representation for speech. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2351–2366 (2021)

    Article  Google Scholar 

  • Chi, P.-H., et al.: Audio albert: a lite bert for self-supervised learning of audio representation. In: 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE (2021)

    Google Scholar 

  • Liu, A.H., Chung, Y.-A., Glass, J.: Non-autoregressive predictive coding for learning speech representations from local dependencies. arXiv preprint arXiv:2011.00406 (2020)

    Google Scholar 

  • Eyben, F., Wöllmer, M., Schuller, B.: Opensmile: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia (2010)

    Google Scholar 

Download references

Acknowledgments

This research has been supported by JSPS KAKENHI Grant Number 19K20345 and Grant Number 19H04215.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fuji Ren .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 IFIP International Federation for Information Processing

About this paper

Cite this paper

Liu, Z., Kang, X., Ren, F. (2022). Improving Speech Emotion Recognition by Fusing Pre-trained and Acoustic Features Using Transformer and BiLSTM. In: Shi, Z., Zucker, JD., An, B. (eds) Intelligent Information Processing XI. IIP 2022. IFIP Advances in Information and Communication Technology, vol 643. Springer, Cham. https://doi.org/10.1007/978-3-031-03948-5_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-03948-5_28

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-03947-8

  • Online ISBN: 978-3-031-03948-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics