Multimodal shared features learning for emotion recognition by enhanced sparse local discriminative canonical correlation analysis

Fu, Jiamin; Mao, Qirong; Tu, Juanjuan; Zhan, Yongzhao

doi:10.1007/s00530-017-0547-8

Multimodal shared features learning for emotion recognition by enhanced sparse local discriminative canonical correlation analysis

Special Issue Paper
Published: 22 March 2017

Volume 25, pages 451–461, (2019)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Jiamin Fu¹,
Qirong Mao¹,
Juanjuan Tu² &
…
Yongzhao Zhan¹

785 Accesses
13 Citations
Explore all metrics

Abstract

Multimodal emotion recognition is a challenging research topic which has recently started to attract the attention of the research community. To better recognize the video users’ emotion, the research of multimodal emotion recognition based on audio and video is essential. Multimodal emotion recognition performance heavily depends on finding good shared feature representation. The good shared representation needs to consider two aspects: (1) it has the character of each modality and (2) it can balance the effect of different modalities to make the decision optimal. In the light of these, we propose a novel Enhanced Sparse Local Discriminative Canonical Correlation Analysis approach (En-SLDCCA) to learn the multimodal shared feature representation. The shared feature representation learning involves two stages. In the first stage, we pretrain the Sparse Auto-Encoder with unimodal video (or audio), so that we can obtain the hidden feature representation of video and audio separately. In the second stage, we obtain the correlation coefficients of video and audio using our En-SLDCCA approach, then we form the shared feature representation which fuses the features from video and audio using the correlation coefficients. We evaluate the performance of our method on the challenging multimodal Enterface’05 database. Experimental results reveal that our method is superior to the unimodal video (or audio) and improves significantly the performance for multimodal emotion recognition when compared with the current state of the art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video modeling and learning on Riemannian manifold for emotion recognition in the wild

Article 11 November 2015

Multi-view Emotion Recognition Using Deep Canonical Correlation Analysis

Multimodal emotion recognition using SDA-LDA algorithm in video clips

Article 04 October 2021

References

An, L., Yang, S., Bhanu, B.: Person re-identification by robust canonical correlation analysis. IEEE Signal Process. Lett. 22(8), 1103–1107 (2015). doi:10.1109/LSP.2015.2390222
Article Google Scholar
Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C.M., Kazemzadeh, A., Lee, S., Neumann, U., Narayanan, S.: Analysis of emotion recognition using facial expressions, speech and multimodal information. In: International Conference on Multimodal Interfaces, vol. 38, No. 4, pp. 205–211 (2004). doi:10.1145/1027933.1027968
Chen, L.S., Huang, T.S., Miyasato, T., Nakatsu, R.: Multimodal human emotion/expression recognition. IEEE International Conference on Automatic Face and Gesture Recognition, pp. 366–371 (1998). doi:10.1109/AFGR.1998.670976
Chen, Y., Wiesel, A., Eldar, Y.C., Hero, A.O.: Shrinkage algorithms for mmse covariance estimation. IEEE Trans. Signal Process. 58(10), 5016–5029 (2010). doi:10.1109/TSP.2010.2053029
Article MathSciNet MATH Google Scholar
Datcu, D., Rothkrantz, L.J.M.: Multimodal recognition of emotions in car environments. In: DI & I Prague (2009)
Deng, J., Zhang, Z., Marchi, E., Schuller, B.: Sparse autoencoder-based feature transfer learning for speech emotion recognition. In: Humaine Association Conference on AVTective Computing and Intelligent Interaction, vol. 7971, pp. 511–516 (2013). doi:10.1109/ACII.2013.90
Dobrisek, S., Gajsek, R., Mihelic, F., Pavesic, N., Struc, V.: Towards efficient multi-modal emotion recognition. Int. J. Adv. Robot. Syst. 10(53), 53–53 (2013). doi:10.5772/54002
Article Google Scholar
Gajsek, R., STruc V, Mihelic F, : Multi-modal emotion recognition using canonical correlations and acoustic features. In: International Conference on Pattern Recognition, vol. 82, No. 6, pp. 4133–4136 (2010). doi:10.1109/ICPR.2010.1005
Gunes, H., Piccardi, M., Pantic, M.: From the lab to the real world: affect recognition using multiple cues and modalities. InTech Education and Publishing, Croatia (2008)
Google Scholar
Han, M.J., Hsu, I.H., Song, K.T., Chang, F.Y.: A new information fusion method for svm-based robotic audio-visual emotion recognition. In: IEEE International Conference on Systems, Man and Cybernetics, pp. 2656–2661 (2007). doi:10.1109/ICSMC.2007.4413990
Hardoon, D.R., Shawe-Taylor, J.: Sparse canonical correlation analysis. Mach. Learn. 83(3), 331–353 (2011). doi:10.1007/s10994-010-5222-7
Article MathSciNet MATH Google Scholar
Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004). doi:10.1162/0899766042321814
Article MATH Google Scholar
Huang, L., Xin, L., Zhao, L., Tao, J.: Combining audio and video by dominance in bimodal emotion recognition. In: Second International Conference on Affective Computing and Intelligent Interaction, vo.l 4738, pp. 729–730 (2007). doi:10.1007/978-3-540-74889-2_71
Kapoor, A., Burleson, W., Picard, R.W.: Automatic prediction of frustration. Int. J. Hum. Comput. Stud. 65(8), 724–736 (2007). doi:10.1016/j.ijhcs.2007.02.003
Article Google Scholar
Kim, Y., Lee, H., Provost, E.M.: Deep learning for robust feature generation in audiovisual emotion recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 32, No. 3, pp. 3687–3691 (2013). doi:10.1109/ICASSP.2013.6638346
Le, D., Provost, E.M.: Emotion recognition from spontaneous speech using hidden markov models with deep belief networks. In: Automatic Speech Recognition and Understanding, pp. 216–221 (2013). doi:10.1109/ASRU.2013.6707732
Ledoit, O., Wolf, M.: Nonlinear shrinkage estimation of large-dimensional covariance matrices. Ann. Stat. 40(2), 1024–1060 (2012). doi:10.1214/12-AOS989
Article MathSciNet MATH Google Scholar
Li, Z., Tang, J.: Unsupervised feature selection via nonnegative spectral analysis and redundancy control. IEEE Trans. Image Process. 24(12), 5343–5355 (2015). doi:10.1109/TIP.2015.2479560
Article MathSciNet MATH Google Scholar
Li, Z., Tang, J.: Weakly-supervised deep matrix factorization for social image understanding. IEEE Trans. Image Process. 26(99), 276–288 (2017). doi:10.1109/TIP.2016.2624140
Article MathSciNet MATH Google Scholar
Li, Z., Liu, J., Yang, Y., Zhou, X., Lu, H.: Clustering-guided sparse structural learning for unsupervised feature selection. IEEE Trans. Knowl. Data Eng. 26(9), 2138–2150 (2014). doi:10.1109/TKDE.2013.65
Article Google Scholar
Li, Z., Liu, J., Tang, J., Lu, H.: Robust structured subspace learning for data representation. IEEE Trans. Pattern Anal. Mach. Intell. 37(10), 2085–2098 (2015). doi:10.1109/TPAMI.2015.2400461
Article Google Scholar
Liu, M., Shan, S., Wang, R., Chen, X.: Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1749–1756 (2014). doi:10.1109/CVPR.2014.226
Lu J, Hu J, Zhou X, Shang Y: Activity-based person identification using sparse coding and discriminative metric learning. In: ACM International Conference on Multimedia, pp. 1061–1064, (2012). doi:10.1145/2393347.2396383
Mansoorizadeh, M., Moghaddam Charkari, N.: Multimodal information fusion application to human emotion recognition from face and speech. Multimed. Tools Appl. 49(2), 277–297 (2010). doi:10.1007/s11042-009-0344-2
Article Google Scholar
Martin O, Kotsia I, Macq B, Pitas I (2006) The enterface’05 audio-visual emotion database. In: International Conference on Data Engineering Workshops, pp. 8–8. doi:10.1109/ICDEW.2006.145
Mroueh, Y., Marcheret, E., Goel, V.: Deep multimodal learning for audio-visual speech recognition. In: IEEE International Conference on Acoustics Speech and Signal Processing, vol. 22, pp. 1103–1107 (2015). doi:10.1109/LSP.2015.2390222
Article Google Scholar
Nie, L., Zhang, L., Yang, Y., Wang, M., Hong, R., Chua, T.S.: Beyond doctors: future health prediction from multimedia and multimodal observations. In: The 23rd ACM International Conference on Multimedia, pp. 591–600. (2015). doi:10.1145/2733373.2806217
Nie, L., Song, X., Chua, T.S.: Learning from Multiple Social Networks, pp. 118–118. Morgan & Claypool, San Rafael (2016). doi:10.2200/S00714ED1V01Y201603ICR048
Article Google Scholar
Paleari, M., Huet, B.: Toward emotion indexing of multimedia excerpts. In: International Workshop on Content-Based Multimedia Indexing, pp. 425–432, (2008). doi:10.1109/CBMI.2008.4564978
Paleari, M., Benmokhtar, R., Huet, B.: Evidence theory-based multimodal emotion recognition. In: International Multimedia Modeling Conference on Advances in Multimedia Modeling, vol. 5371, pp. 435–446 (2009). doi:10.1007/978-3-540-92892-8_44
Google Scholar
Paleari, M., Chellali, R., Huet, B.: Bimodal emotion recognition. In: International Conference on Social Robotics, vol. 6414, pp. 305–314 (2010). doi:10.1007/978-3-642-17248-9_32
Chapter Google Scholar
Peng, Y., Zhang, D., Zhang, J.: A new canonical correlation analysis algorithm with local discrimination. Neural Process. Lett. 31(1), 1–15 (2010). doi:10.1007/s11063-009-9123-3
Article Google Scholar
Pun, T., Alecu, T.I., Chanel, G., Kronegg, J.: Brain-computer interaction research at the computer vision and multimedia laboratory, university of geneva. IEEE Trans. Neural Syst. Rehabil. Eng. 14(2), 210–213 (2006). doi:10.1109/TNSRE.2006.875544
Article Google Scholar
Schmidt, E.M.: Modeling and predicting emotion in music. Emotion 5, 6-6 (2012)
Google Scholar
Schuller B, Vlasenko B, Eyben F, Rigoll G, Wendemuth A (2009) Acoustic emotion recognition: a benchmark comparison of performances. In: IEEE Workshop on Automatic Speech Recognition & Understanding, pp. 552–557, doi:10.1109/ASRU.2009.5372886
Shan, C., Gong, S., Mcowan, P.W.: Beyond facial expressions: Learning human emotion from body gestures. In: Proceedings of the British Machine Vision Conference, pp. 43.1–43.10 (2007). doi:10.5244/C.21.43
Stuhlsatz, A., Lippel, J., Zielke, T.: Feature extraction with deep neural networks by a generalized discriminant analysis. IEEE Trans. Neural Netw. Learn. Syst. 23(4), 596–608 (2012). doi:10.1109/TNNLS.2012.2183645
Article Google Scholar
Tang, J., Shu, X., Qi, G.J., Li, Z., Wang, M., Yan, S., Jain, R.: Tri-clustered tensor completion for social-aware image tag refinement. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1 (2016). doi:10.1109/TPAMI.2016.2608882
Article Google Scholar
Wang, H.: Local two-dimensional canonical correlation analysis. IEEE Signal Process. Lett. 17(11), 921–924 (2010). doi:10.1109/LSP.2010.2071863
Article Google Scholar
Wang, Y., Guan, L., Venetsanopoulos, A.N.: Audiovisual emotion recognition via cross-modal association in kernel space. In: IEEE International Conference on Multimedia and Expo, pp. 1–6 (2011). doi:10.1109/ICME.2011.6011949
Zeng, Z., Tu, J., Liu, M., Huang, T.S., Pianfetti, B., Roth, D., Levinson, S.: Audio-visual affect recognition. IEEE Trans. Multimed. 9(2), 424–428 (2007). doi:10.1109/TMM.2006.886310
Article Google Scholar
Zhao, G., Pietikainen, M.: Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 915–928 (2007). doi:10.1109/TPAMI.2007.1110
Article Google Scholar

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Nos. 61272211 and 61672267) and the General Financial Grant from the China Postdoctoral Science Foundation 2015M570413.

Author information

Authors and Affiliations

School of Computer Science and Communication Engineering, Jiangsu University, Jiangsu, Zhenjiang, China
Jiamin Fu, Qirong Mao & Yongzhao Zhan
School of Computer Science and Engineering, Jiangsu University of Science and Technology, Jiangsu, Zhenjiang, China
Juanjuan Tu

Authors

Jiamin Fu
View author publications
You can also search for this author in PubMed Google Scholar
Qirong Mao
View author publications
You can also search for this author in PubMed Google Scholar
Juanjuan Tu
View author publications
You can also search for this author in PubMed Google Scholar
Yongzhao Zhan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qirong Mao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fu, J., Mao, Q., Tu, J. et al. Multimodal shared features learning for emotion recognition by enhanced sparse local discriminative canonical correlation analysis. Multimedia Systems 25, 451–461 (2019). https://doi.org/10.1007/s00530-017-0547-8

Download citation

Published: 22 March 2017
Issue Date: October 2019
DOI: https://doi.org/10.1007/s00530-017-0547-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal shared features learning for emotion recognition by enhanced sparse local discriminative canonical correlation analysis

Abstract

Access this article

Similar content being viewed by others

Video modeling and learning on Riemannian manifold for emotion recognition in the wild

Multi-view Emotion Recognition Using Deep Canonical Correlation Analysis

Multimodal emotion recognition using SDA-LDA algorithm in video clips

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multimodal shared features learning for emotion recognition by enhanced sparse local discriminative canonical correlation analysis

Abstract

Access this article

Similar content being viewed by others

Video modeling and learning on Riemannian manifold for emotion recognition in the wild

Multi-view Emotion Recognition Using Deep Canonical Correlation Analysis

Multimodal emotion recognition using SDA-LDA algorithm in video clips

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation