Skip to main content

AudioFormer: Channel Audio Encoder Based on Multi-granularity Features

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2023)

Abstract

To solve the problem of poor standardized feature extraction methods for speech emotion recognition tasks and insufficient depth representation capability for extracting acoustic samples, we first propose a Multi-granularity feature extraction method that takes into account the integrity of data features and overcomes the redundancy of existing feature extraction methods; secondly, we propose a Channel Audio Encoder Model that uses different Feature Encoders to extract High-order features. Experiments show that the proposed Multi-granularity feature-based Channel Audio Encoder achieves state-of-the-art performance in the IEMOCAP dataset. The method also experiments on a real-scene dataset to demonstrate its usability and provide a reference for aiding the diagnosis of mental illness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations (2014)

    Google Scholar 

  2. Busso, C., et al.: Iemocap: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335–359 (2008)

    Article  Google Scholar 

  3. Chen, W., Xing, X., Xu, X., Pang, J., Du, L.: Speechformer: a hierarchical efficient framework incorporating the characteristics of speech (2022)

    Google Scholar 

  4. Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Taylor, J.: Emotion recognition in hci. Signal Process. Mag. IEEE (2001)

    Google Scholar 

  5. Elman, J.L.: Finding structure in time. Cogn. Sci. 14(2), 179–211 (1990)

    Article  Google Scholar 

  6. Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inform. Process. Syst. 25(2) (2012)

    Google Scholar 

  7. Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. IEEE Computer Society (2016)

    Google Scholar 

  8. Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)

    Article  Google Scholar 

  9. Li, D., Liu, J., Yang, Z., Sun, L., Wang, Z.: Speech emotion recognition using recurrent neural networks with directional self-attention. Expert Syst. Appl. 173(3), 114683 (2021)

    Article  Google Scholar 

  10. Li, D., Zhou, Y., Wang, Z., Gao, D.: Exploiting the potentialities of features for speech emotion recognition. Inf. Sci. 548, 328–343 (2021)

    Article  Google Scholar 

  11. Mcfee, B., Raffel, C., Liang, D., Ellis, D., Nieto, O.: librosa: audio and music signal analysis in python. In: Python in Science Conference (2015)

    Google Scholar 

  12. Mirsamadi, S., Barsoum, E., Zhang, C.: Automatic speech emotion recognition using recurrent neural networks with local attention. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017)

    Google Scholar 

  13. Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K.: Recurrent models of visual attention. Adv. Neural Inform. Process. Syst. 3 (2014)

    Google Scholar 

  14. Padi, S., Manocha, D., Sriram, R.D.: Multi-window data augmentation approach for speech emotion recognition (2020)

    Google Scholar 

  15. Pang, B.: Thumbs up? sentiment classification using machine learning techniques. In: Proceedings of EMNLP, Philadelphia. PA, USA, July 2002 (2002)

    Google Scholar 

  16. Peng, Z., Lu, Y., Pan, S., Liu, Y.: Efficient speech emotion recognition using multi-scale cnn and attention (2021)

    Google Scholar 

  17. Powers, D.M.W.: Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation (2020)

    Google Scholar 

  18. Qadri, S.A.A., Gunawan, T.S., Kartiwi, M., Mansor, H., Wani, T.M.: Speech emotion recognition using feature fusion of teo and mfcc on multilingual databases (2022)

    Google Scholar 

  19. Rozgi, V., Ananthakrishnan, S., Saleem, S., Kumar, R., Prasad, R.: Ensemble of svm trees for multimodal emotion recognition. In: Signal & Information Processing Association Summit & Conference (2012)

    Google Scholar 

  20. Sahu, G.: Multimodal speech emotion recognition and ambiguity resolution (2019)

    Google Scholar 

  21. Schmid, F., Koutini, K., Widmer, G.: Low-complexity audio embedding extractors. arXiv preprint arXiv:2303.01879 (2023)

  22. Shirian, A., Guha, T.: Compact graph architecture for speech emotion recognition (2020)

    Google Scholar 

  23. Sze, V., Chen, Y.H., Yang, T.J., Emer, J.S.: Efficient processing of deep neural networks: a tutorial and survey. Proceedings of the IEEE 105(12) (2017)

    Google Scholar 

  24. Tripathi, S., Kumar, A., Ramesh, A., Singh, C., Yenigalla, P.: Deep learning based emotion recognition system using speech features and transcriptions (2019)

    Google Scholar 

  25. Vinola, C., Vimaladevi, K.: A survey on human emotion recognition approaches, databases and applications. Elect. Lett. Comput. Vis. Image Anal. 2(14), 24–44 (2015)

    Google Scholar 

  26. Xu, Y., Xu, H., Zou, J.: Hgfm : a hierarchical grained and feature model for acoustic emotion recognition. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020)

    Google Scholar 

  27. Yazdani, A., Shekofteh, Y.: A persian asr-based ser: modification of sharif emotional speech database and investigation of persian text corpora. arXiv preprint arXiv:2211.09956 (2022)

  28. Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: IEEE SLT 2018 (2018)

    Google Scholar 

  29. Yue Xibin, Hu Xiaolin, T.L.: The influence of the number of parameters in each layer of deep learning model on performance (in chinese). Comput. Sci. Appli. (2015)

    Google Scholar 

  30. Zhu, W., Li, X.: Speech emotion recognition with global-aware fusion on multi-scale feature representation (2022)

    Google Scholar 

  31. Zou, H., Si, Y., Chen, C., Rajan, D., Chng, E.S.: Speech emotion recognition with co-attention based multi-level acoustic information (2022)

    Google Scholar 

Download references

Acknowledgements

This paper is founded by Supported projects of key R & D programs in Hebei Province(No. 21373802D) and Artificial Intelligence Collaborative Education Project of the Ministry of Education(201801003011).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yunfeng Xu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, J., Xu, Y., Miao, B., Zhao, S. (2024). AudioFormer: Channel Audio Encoder Based on Multi-granularity Features. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1964. Springer, Singapore. https://doi.org/10.1007/978-981-99-8141-0_27

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8141-0_27

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8140-3

  • Online ISBN: 978-981-99-8141-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics