Abstract
To solve the problem of poor standardized feature extraction methods for speech emotion recognition tasks and insufficient depth representation capability for extracting acoustic samples, we first propose a Multi-granularity feature extraction method that takes into account the integrity of data features and overcomes the redundancy of existing feature extraction methods; secondly, we propose a Channel Audio Encoder Model that uses different Feature Encoders to extract High-order features. Experiments show that the proposed Multi-granularity feature-based Channel Audio Encoder achieves state-of-the-art performance in the IEMOCAP dataset. The method also experiments on a real-scene dataset to demonstrate its usability and provide a reference for aiding the diagnosis of mental illness.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations (2014)
Busso, C., et al.: Iemocap: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335–359 (2008)
Chen, W., Xing, X., Xu, X., Pang, J., Du, L.: Speechformer: a hierarchical efficient framework incorporating the characteristics of speech (2022)
Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Taylor, J.: Emotion recognition in hci. Signal Process. Mag. IEEE (2001)
Elman, J.L.: Finding structure in time. Cogn. Sci. 14(2), 179–211 (1990)
Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inform. Process. Syst. 25(2) (2012)
Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. IEEE Computer Society (2016)
Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)
Li, D., Liu, J., Yang, Z., Sun, L., Wang, Z.: Speech emotion recognition using recurrent neural networks with directional self-attention. Expert Syst. Appl. 173(3), 114683 (2021)
Li, D., Zhou, Y., Wang, Z., Gao, D.: Exploiting the potentialities of features for speech emotion recognition. Inf. Sci. 548, 328–343 (2021)
Mcfee, B., Raffel, C., Liang, D., Ellis, D., Nieto, O.: librosa: audio and music signal analysis in python. In: Python in Science Conference (2015)
Mirsamadi, S., Barsoum, E., Zhang, C.: Automatic speech emotion recognition using recurrent neural networks with local attention. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017)
Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K.: Recurrent models of visual attention. Adv. Neural Inform. Process. Syst. 3 (2014)
Padi, S., Manocha, D., Sriram, R.D.: Multi-window data augmentation approach for speech emotion recognition (2020)
Pang, B.: Thumbs up? sentiment classification using machine learning techniques. In: Proceedings of EMNLP, Philadelphia. PA, USA, July 2002 (2002)
Peng, Z., Lu, Y., Pan, S., Liu, Y.: Efficient speech emotion recognition using multi-scale cnn and attention (2021)
Powers, D.M.W.: Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation (2020)
Qadri, S.A.A., Gunawan, T.S., Kartiwi, M., Mansor, H., Wani, T.M.: Speech emotion recognition using feature fusion of teo and mfcc on multilingual databases (2022)
Rozgi, V., Ananthakrishnan, S., Saleem, S., Kumar, R., Prasad, R.: Ensemble of svm trees for multimodal emotion recognition. In: Signal & Information Processing Association Summit & Conference (2012)
Sahu, G.: Multimodal speech emotion recognition and ambiguity resolution (2019)
Schmid, F., Koutini, K., Widmer, G.: Low-complexity audio embedding extractors. arXiv preprint arXiv:2303.01879 (2023)
Shirian, A., Guha, T.: Compact graph architecture for speech emotion recognition (2020)
Sze, V., Chen, Y.H., Yang, T.J., Emer, J.S.: Efficient processing of deep neural networks: a tutorial and survey. Proceedings of the IEEE 105(12) (2017)
Tripathi, S., Kumar, A., Ramesh, A., Singh, C., Yenigalla, P.: Deep learning based emotion recognition system using speech features and transcriptions (2019)
Vinola, C., Vimaladevi, K.: A survey on human emotion recognition approaches, databases and applications. Elect. Lett. Comput. Vis. Image Anal. 2(14), 24–44 (2015)
Xu, Y., Xu, H., Zou, J.: Hgfm : a hierarchical grained and feature model for acoustic emotion recognition. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020)
Yazdani, A., Shekofteh, Y.: A persian asr-based ser: modification of sharif emotional speech database and investigation of persian text corpora. arXiv preprint arXiv:2211.09956 (2022)
Yoon, S., Byun, S., Jung, K.: Multimodal speech emotion recognition using audio and text. In: IEEE SLT 2018 (2018)
Yue Xibin, Hu Xiaolin, T.L.: The influence of the number of parameters in each layer of deep learning model on performance (in chinese). Comput. Sci. Appli. (2015)
Zhu, W., Li, X.: Speech emotion recognition with global-aware fusion on multi-scale feature representation (2022)
Zou, H., Si, Y., Chen, C., Rajan, D., Chng, E.S.: Speech emotion recognition with co-attention based multi-level acoustic information (2022)
Acknowledgements
This paper is founded by Supported projects of key R & D programs in Hebei Province(No. 21373802D) and Artificial Intelligence Collaborative Education Project of the Ministry of Education(201801003011).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wang, J., Xu, Y., Miao, B., Zhao, S. (2024). AudioFormer: Channel Audio Encoder Based on Multi-granularity Features. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1964. Springer, Singapore. https://doi.org/10.1007/978-981-99-8141-0_27
Download citation
DOI: https://doi.org/10.1007/978-981-99-8141-0_27
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8140-3
Online ISBN: 978-981-99-8141-0
eBook Packages: Computer ScienceComputer Science (R0)