Coarse-to-Fine Speech Emotion Recognition Based on Multi-Task Learning

Huijuan, Zhao; Ning, Ye; Ruchuan, Wang

doi:10.1007/s11265-020-01538-x

Coarse-to-Fine Speech Emotion Recognition Based on Multi-Task Learning

Published: 20 June 2020

Volume 93, pages 299–308, (2021)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Zhao Huijuan^1,2,
Ye Ning³ &
Wang Ruchuan^3,4

751 Accesses
9 Citations
Explore all metrics

Abstract

Speech emotion recognition is very challenging because the definition of emotion is uncertain and the feature representation is complex. Accurate feature representation is one of the key factors for successful speech emotion recognition. Studies have shown that 3D data composed of static, deltas and delta-deltas of log-Mel spectrum is very effective in filtering irrelevant features. The challenge of speech emotion recognition is also reflected in the necessity of fine-grained classification. Typical applications of affective computing, such as psychological counseling and emotion regulation, require fine-grained emotion recognition. Based on the two inspirations, this paper proposes an end-to-end hierarchical multi-task learning framework, from coarse to fine to achieve fine-grained emotion recognition. Using 3D data as input, in the first stage, we train the coarse emotion type, and then use the result to assist the second stage training for the fine emotion type. By conducting the comparative experiments on the IEMOCAP corpus, we find that the classification idea of coarse-to-fine has a significant performance improvement over the baseline models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech Emotion Recognition: A Brief Review of Multi-modal Multi-task Learning Approaches

Speech Emotion Recognition Using Pre-trained and Fine-Tuned Transfer Learning Approaches

An efficient speech emotion recognition based on a dual-stream CNN-transformer fusion network

Article 06 July 2023

References

Busso, C., Bulut, M., Lee, C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J. N., Lee, S., & Narayanan, S. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4), 335–359. https://doi.org/10.1007/s10579-008-9076-6.
Article Google Scholar
Chen, M., He, X., Yang, J., & Zhang, H. (2018). 3-d convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Processing Letters, 25(10), 1440–1444. https://doi.org/10.1109/LSP.2018.2860246.
Article Google Scholar
Dai, D., Wu, Z., Li, R., Wu, X., Jia, J., & Meng, H. (2019). Learning discriminative features from spectrograms using center loss for speech emotion recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019, Brighton, United Kingdom, May 12-17, 2019 (pp. 7405–7409), DOI https://doi.org/10.1109/ICASSP.2019.8683765, (to appear in print).
Derzhavina, N. M. (2019). Experience of a synthetic approach to an ecological classification of vascular epiphytes. Contemporary Problems of Ecology, 12(5), 434–443.
Article Google Scholar
He, X., Song, Y., & Zhang, Y. (2018). A coarse-to-fine scene text detection method based on skeleton-cut detector and binary-tree-search based rectification. Pattern Recognition Letters, 112, 27–33. https://doi.org/10.1016/j.patrec.2018.05.020.
Article Google Scholar
Huang, Z., & Epps, J. (2018). Prediction of emotion change from speech, 2018.
Jing, L., Chen, Y., & Tian, Y. (2020). Coarse-to-fine semantic segmentation from image-level labels. IEEE Transactions Image Processing, 29, 225–236. https://doi.org/10.1109/TIP.2019.2926748.
Article MathSciNet Google Scholar
Khalil, R. A., Jones, E., Babar, M. I., Jan, T., Zafar, M. H., & Alhussain, T. (2019). Speech emotion recognition using deep learning techniques: A review. IEEE Access 7 117327–117345.
Li, J., Qiu, M., Niu, J., Gao, W., Zong, Z., & Qin, X. (2010). Feedback dynamic algorithms for preemptable job scheduling in cloud systems. 1, 561–564.
Ma, F., Chitta, R., Zhou, J., You, Q., Sun, T., & Gao, J. (2017). Dipole: Diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, Halifax, NS, Canada, August 13 - 17, 2017 (pp. 1903–1911), DOI https://doi.org/10.1145/3097983.3098088, (to appear in print).
Ma, X., Wu, Z., Jia, J., Xu, M., Meng, H., & Cai, L. (2018). Emotion recognition from variable-length speech segments using deep learning on spectrograms. In Interspeech 2018, 19th annual conference of the international speech communication association, Hyderabad, India, 2-6 September 2018 (pp. 3683–3687), DOI https://doi.org/10.21437/Interspeech.2018-2228, (to appear in print).
Ma, Y., Liu, X., Bai, S., Wang, L., He, D., & Liu, A. (2019). Coarse-to-fine image inpainting via region-wise convolutions and non-local correlation. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019 (pp. 3123–3129), DOI https://doi.org/10.24963/ijcai.2019/433, (to appear in print).
Marinoiu, E., Zanfir, M., Olaru, V., & Sminchisescu, C. (2018). 3d human sensing, action and emotion recognition in robot assisted therapy of children with autism. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. http://openaccess.thecvf.com/content_cvpr_2018/html/Marinoiu_3D_Human_Sensing_CVPR_2018_paper.html (pp. 2158–2167), DOI 10.1109/CVPR.2018.00230, (to appear in print).
Mazic, I., Bonkovic, M., & Dzaja, B. (2015). Two-level coarse-to-fine classification algorithm for asthma wheezing recognition in children’s respiratory sounds. Biomedical Signal Processing and Control, 21, 105–118. https://doi.org/10.1016/j.bspc.2015.05.002.
Article Google Scholar
Meng, H., Yan, T., Yuan, F., & Wei, H. (2019). Speech emotion recognition from 3d log-mel spectrograms with deep learning network. IEEE Access 7 125868–125881.
Myers, G.E. (1969). William James’s Theory of Emotion. Transactions of the Charles S Peirce Society, 5(2), 67–89.
Google Scholar
Qiu, H., Noura, H., Qiu, M., Ming, Z., & Memmi, G. (2019). A user-centric data protection method for cloud storage based on invertible dwt. IEEE Transactions on Cloud Computing 1–1.
Qiu, H., Qiu, M., Zhihui, L. U., & Memmi, G. (2019). An efficient key distribution system for data fusion in v2x heterogeneous networks. Information Fusion, 50, 212–220.
Article Google Scholar
Qiu, M., Sha, E. H., Liu, M., Lin, M., Hua, S., & Yang, L. T. (2008). Energy minimization with loop fusion and multi-functional-unit scheduling for multidimensional DSP. Journal of Parallel and Distributed Computing, 68(4), 443–455. https://doi.org/10.1016/j.jpdc.2007.06.014.
Article MATH Google Scholar
Qiu, M., Sha, E. H. M., Liu, M., Lin, M., Hua, S., & Yang, L. T. (2008). Energy minimization with loop fusion and multi-functional-unit scheduling for multidimensional dsp. Journal of Parallel and Distributed Computing, 68(4), 443–455.
Article Google Scholar
Rabiee, A., Kim, T., & Lee, S. (2019). Adjusting pleasure-arousal-dominance for continuous emotional text-to-speech synthesizer. arXiv:1906.05507.
Schuller, B. W. (2018). Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends. Communications of the ACM, 61(5), 90–99. https://doi.org/10.1145/3129340.
Article Google Scholar
Tang, X., Li, K., Qiu, M., & Sha, E. H. M. (2012). A hierarchical reliability-driven scheduling algorithm in grid systems. Journal of Parallel and Distributed Computing, 72(4), 525–535.
Article Google Scholar
Wang, X., Peng, M., Pan, L., Hu, M., Jin, C., & Ren, F. (2018). Two-level attention with two-stage multi-task learning for facial emotion recognition. arXiv:1811.12139.
Wei, X., Zhang, C., Liu, L., Shen, C., & Wu, J. (2018). Coarse-to-fine: A rnn-based hierarchical attention model for vehicle re-identification. In Computer Vision - ACCV 2018 - 14th Asian Conference on Computer Vision, Perth, Australia, December 2-6, 2018, Revised Selected Papers, Part II (pp. 575–591), DOI https://doi.org/10.1007/978-3-030-20890-5_37, (to appear in print).
Xia, R., & Liu, Y. (2017). A multi-task learning framework for emotion recognition using 2d continuous space. IEEE Transactions Affective Computing, 8(1), 3–14. https://doi.org/10.1109/TAFFC.2015.2512598.
Article Google Scholar
Xu, J., Xu, R., Lu, Q., & Wang, X. (2012). Coarse-to-fine sentence-level emotion classification based on the intra-sentence features and sentential context. In 21St ACM international Conference on Information and Knowledge Management, CIKM’12, Maui, HI, USA, October 29 - November 02, 2012, pp. 2455–2458, DOI https://doi.org/10.1145/2396761.2398665, (to appear in print).
Zhang, S., Zhang, S., Huang, T., & Gao, W. (2018). Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Tranactions Multimedia, 20(6), 1576–1590. https://doi.org/10.1109/TMM.2017.2766843.
Article Google Scholar
Zhao, H., Xiao, Y., Han, J., & Zhang, Z. (2019). Compact convolutional recurrent neural networks via binarization for speech emotion recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019, Brighton, United Kingdom, May 12-17, 2019 (pp. 6690–6694), DOI https://doi.org/10.1109/ICASSP.2019.8683389, (to appear in print).

Download references

Acknowledgements

The subject is sponsored by the National Natural Science Foundation of P. R. China (No.61572260), Postgraduate Research & Practice Innovation Program of Jiangsu Province (No.46035CX17789) and MOE (Ministry of Education in China) Liberal arts and Social Sciences Foundation(No.17YJAZH071).

Author information

Authors and Affiliations

College of Internet of Things, Nanjing University of Posts and Telecommunications, Nanjing, China
Zhao Huijuan
College of Computer and Software, Nanjing Institute of Industry Technology, Nanjing, China
Zhao Huijuan
College of Computer, Nanjing University of Posts and Telecommunications, Nanjing, China
Ye Ning & Wang Ruchuan
Jiangsu High Technology Research Key Laboratory for Wireless Sensor Network, Nanjing, China
Wang Ruchuan

Authors

Zhao Huijuan
View author publications
You can also search for this author in PubMed Google Scholar
Ye Ning
View author publications
You can also search for this author in PubMed Google Scholar
Wang Ruchuan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhao Huijuan.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huijuan, Z., Ning, Y. & Ruchuan, W. Coarse-to-Fine Speech Emotion Recognition Based on Multi-Task Learning. J Sign Process Syst 93, 299–308 (2021). https://doi.org/10.1007/s11265-020-01538-x

Download citation

Received: 19 December 2019
Revised: 23 March 2020
Accepted: 14 April 2020
Published: 20 June 2020
Issue Date: March 2021
DOI: https://doi.org/10.1007/s11265-020-01538-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Coarse-to-Fine Speech Emotion Recognition Based on Multi-Task Learning

Abstract

Access this article

Similar content being viewed by others

Speech Emotion Recognition: A Brief Review of Multi-modal Multi-task Learning Approaches

Speech Emotion Recognition Using Pre-trained and Fine-Tuned Transfer Learning Approaches

An efficient speech emotion recognition based on a dual-stream CNN-transformer fusion network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Coarse-to-Fine Speech Emotion Recognition Based on Multi-Task Learning

Abstract

Access this article

Similar content being viewed by others

Speech Emotion Recognition: A Brief Review of Multi-modal Multi-task Learning Approaches

Speech Emotion Recognition Using Pre-trained and Fine-Tuned Transfer Learning Approaches

An efficient speech emotion recognition based on a dual-stream CNN-transformer fusion network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation