Audio-visual emotion recognition using multi-directional regression and Ridgelet transform

Hossain, M. Shamim; Muhammad, Ghulam

doi:10.1007/s12193-015-0207-2

Audio-visual emotion recognition using multi-directional regression and Ridgelet transform

Original Paper
Published: 26 November 2015

Volume 10, pages 325–333, (2016)
Cite this article

Journal on Multimodal User Interfaces Aims and scope Submit manuscript

M. Shamim Hossain¹ &
Ghulam Muhammad²

808 Accesses
59 Citations
Explore all metrics

Abstract

In this paper, we propose an audio-visual emotion recognition system using multi-directional regression (MDR) audio features and ridgelet transform based face image features. MDR features capture directional derivative information in a spectro-temporal domain of speech, and, thereby, suitable to encode different levels of increasing or decreasing pitch and formant frequencies. For video inputs, interest points in a time frame are detected using spectro-temporal filters, and ridgelet transform is applied to cuboids around the interest points. Two separate extreme learning machine classifiers, one for speech modality and the other for face modality, are used. The scores of these two classifiers are fused using a Bayesian sum rule to make the final decision. Experimental results on eNTERFACE database show that the proposed method achieves accuracy of 85.06 % using bimodal inputs, 64.04 % using speech only, and 58.38 % using face only; these accuracies outnumber the accuracies obtained by some other state-of-the-art systems using the same database.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Multimodal Emotion Recognition System Using Facial Landmark Analysis

Article 22 October 2018

Multi-modal Emotion Recognition Based on Speech and Image

Multimodal emotion recognition based on feature selection and extreme learning machine in video clips

Article 27 July 2021

References

Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine—belief network architecture. In: Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp I-577–580
Zhou Y, Sun Y, Zhang J, Yan Y (2009) Speech emotion recognition using both spectral and prosodic features. In: Proceedings of International Conference Information Engineering and Computer Science (ICIECS), pp 1–4
Devillers L, Vidrascu V (2006) Real-life emotion detection with lexical and paralinguistic cues on Human-Human call center dialogs. In: Proceedings of Interspeech’2006, Pittsburgh
Gharavian D, Sheikhan M, Nazerieh AR, Garoucy S (2012) Speech emotion recognition using FCBF feature selection method and GA-optimized fuzzy ARTMAP neural network. Neural Comput Appl 21(8):2115–2126. doi:10.1007/s00521-011-0643-1
Article Google Scholar
Albornoz EM, Milone DH, Rufiner HL (2011) Spoken emotion recognition using hierarchical classifiers. Comput Speech Lang 25(3):556–570
Article Google Scholar
Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of German emotional speech. In: Proceedings of Interspeech’2005, Lisbon
Bettadapura V (2012) Face expression recognition and analysis: the state of the art. College of Computing, Georgia Institute of Technology. arXiv:1203.6722v1
Senechal T, Rapp V, Salam H, Seguier R, Bailly K, Prevost L (2012) Facial action recognition combining heterogeneous features via multikernel learning. IEEE Trans Syst Man Cybern B 42(4):993–1005
Article Google Scholar
Agrawal S, Khatri P (2015) Facial expression detection techniques: based on Viola and Jones algorithm and principal component analysis. In: Proceedings of 2015 Fifth International Conference on Advanced Computing & Communication Technologies (ACCT), pp 108–112, 21-22
Majumder A, Behera L, Subramanian VK (2014) Emotion recognition from geometric facial features using self-organizing map. Pattern Recogn 47(3):1282–1293
Article Google Scholar
Pantic M, Valstar MF, Rademaker R, Maat L (2005) Web-based database for facial expression analysis. In: Proceedings of 13th ACM International Conference on Multimedia’05, pp 317–321. Database available at http://www.mmifacedb.com/
Bejani M, Gharavian D, Charkari NM (2014) Audiovisual emotion recognition using ANOVA feature selection method and multi-classifier neural networks. Neural Comput Appl 24(2):399–412
Article Google Scholar
Martin O, Kotsia I, Macq B, Pitas I (2006) The eNTERFACE’05 audiovisual emotion database. In: Proceedings of ICDEW’2006, p 8, Atlanta, April 3–8
Kachele M, Glodek M, Zharkov D, Meudt S, Schwenker F (2014) Fusion of audio-visual features using hierarchical classifier systems for the recognition of affective states and the state of depression. In: Proceedings of the International Conference on Pattern Recognition Applications and Methods (ICPRAM), pp 671–678
Jeremie N, Vincent R, Kevin B, Lionel P, Mohamed C (2014) Audio-visual emotion recognition: a dynamic, multimodal approach. In: Proceedings of 26th French conference on interaction of human-machine (IHM’14), Lille
Lin J-C, Wu C-H, Wei W-L (2012) Error weighted semi-coupled hidden Markov model for audio-visual emotion recognition. IEEE Trans Multimed 14(1):142–156
Article Google Scholar
Kim Y, Lee H, Provost EM (2013) Deep learning for robust feature generation in audiovisual emotion recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 3687–3691, 26–31 May 2013
Metallinou A, Wollmer M, Katsamanis A, Eyben F, Schuller B, Narayanan S (2012) Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Trans Affect Comput 3(2):184–198
Article Google Scholar
Mesgarani N, David S, Fritz J, Shamma S (2008) Phoneme representation and classification in primary cortex. J Acoust Soc Am 123:899–909
Article Google Scholar
Muhammad G, Mesallam T, Almalki K, Farahat M, Mahmood A, Alsulaiman M (2012) Multi directional regression (MDR) based features for automatic voice disorder detection. J Voice 26(6):817.e19–817.e27
Article Google Scholar
Do MN, Vetterli M (2003) The finite ridgelet transform for image representation. IEEE Trans Image Process 12(1):16–28
Article MathSciNet MATH Google Scholar
Huang G-B, Zhu Q-Y, Siew C-K (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1–3):489–501
Article Google Scholar
Dollar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: Proceedings of IEEE VS-PETS’2005, pp 65–72, Beijing, 15–16 Oct 2005
Starck J-L, Candès EJ, Donoho DL (2002) The curvelet transform for image denoising. IEEE Trans Image Process 11:670–684
Article MathSciNet MATH Google Scholar
Huang G-B, Zhou H, Ding X, Zhang R (2012) Extreme learning machine for regression and multiclass classification. IEEE Trans Syst Man Cybern B 42(2):513–529
Article Google Scholar
Huang W, Li N, Lin Z, Huang G-B, Zong W, Zhou J, Duan Y (2013) Liver tumor detection and segmentation using kernel-based extreme learning machine. In: Proceedings of 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC ’13), pp 3662–3665, Osaka
Savojardo C, Fariselli P, Casadio R (2013) BETAWARE: a machine-learning tool to detect and predict transmembrane beta barrel proteins in Prokaryotes. Bioinformatics 29(4):504–505
Article Google Scholar
Yin XX, Hadjiloucas S, Zhang Y (2014) Complex extreme learning machine applications in terahertz pulsed signals feature sets. Comput Methods Programs Biomed 117(2):387–403
Article Google Scholar
Hossain MS, Muhammad G, Song B, Hassan M, Alelaiwi A, Alamri A (2015) Audio-visual emotion-aware cloud gaming framework. IEEE Trans Circuits Syst Video Technol. doi:10.1109/TCSVT.2015.2444731
Kanade T, Cohn J, Tian Y (2000) Comprehensive database for facial expression analysis. In: Proceedings of IEEE international conference on face and gesture recognition (AFGR ‘00), pp 46–53
Mansoorizadeh M, Charkari NM (2010) Multimodal information fusion application to human emotion recognition from face and speech. Multimed Tools Appl 49(2):277–297
Article Google Scholar
Jiang D, Cui Y, Zhang X, Fan P, Ganzalez I, Sahli H (2011) Audio visual emotion recognition based on triple-stream dynamic bayesian network models. In: D’Mello S, et al. (eds) ACII 2011, Part I, LNCS 6974, pp 609–618
Paleari M, Huet B (June 2008) Toward emotion indexing of multi-media excerpts. in: Proceedings of International Workshop on Content Based Multimedia Indexing (CBMI), pp 425-432, London
Muhammad G, Masud M, Alelaiwi A, Rahman MA, Karime A, Alamri A, Hossain MS (2015) Spectro-temporal directional derivative based automatic speech recognition for a serious game scenario. Multimed Tools Appl 74(14):5313–5327. doi:10.1007/s11042-014-1973-7
Article Google Scholar
Jin Q, Li C, Chen S, Wu H (2015) Speech emotion recognition with acoustic and lexical features. In: Proceedings 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4749–4753, 19–24 Apr 2015
Poria S, Cambria E, Howard N, Huang G-B, Hussain A (2015) Fusing audio, visual and textual clues for sentiment analysis from multimodal content. Neurocomputing. doi:10.1016/j.neucom.2015.01.095
Hossain MS, Muhammad G (2015) Cloud-assisted speech and face recognition framework for health monitoring. Mob Netw Appl 20(3):391–399. doi:10.1007/s11036-015-0586-3
Article Google Scholar

Download references

Acknowledgments

The authors extend their appreciation to the Deanship of Scientific Research at King Saud University, Riyadh, Saudi Arabia for funding this work through the research group Project No. RGP-1436-023.

Author information

Authors and Affiliations

Software Engineering Department, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
M. Shamim Hossain
Computer Engineering Department, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
Ghulam Muhammad

Authors

M. Shamim Hossain
View author publications
You can also search for this author in PubMed Google Scholar
Ghulam Muhammad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ghulam Muhammad.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hossain, M.S., Muhammad, G. Audio-visual emotion recognition using multi-directional regression and Ridgelet transform. J Multimodal User Interfaces 10, 325–333 (2016). https://doi.org/10.1007/s12193-015-0207-2

Download citation

Received: 20 June 2015
Accepted: 16 November 2015
Published: 26 November 2015
Issue Date: December 2016
DOI: https://doi.org/10.1007/s12193-015-0207-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Audio-visual emotion recognition using multi-directional regression and Ridgelet transform

Abstract

Access this article

Similar content being viewed by others

A Multimodal Emotion Recognition System Using Facial Landmark Analysis

Multi-modal Emotion Recognition Based on Speech and Image

Multimodal emotion recognition based on feature selection and extreme learning machine in video clips

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Audio-visual emotion recognition using multi-directional regression and Ridgelet transform

Abstract

Access this article

Similar content being viewed by others

A Multimodal Emotion Recognition System Using Facial Landmark Analysis

Multi-modal Emotion Recognition Based on Speech and Image

Multimodal emotion recognition based on feature selection and extreme learning machine in video clips

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation