Enhanced spatio-temporal 3D CNN for facial expression classification in videos

Khanna, Deepanshu; Jindal, Neeru; Rana, Prashant Singh; Singh, Harpreet

doi:10.1007/s11042-023-16066-6

Enhanced spatio-temporal 3D CNN for facial expression classification in videos

Published: 28 June 2023

Volume 83, pages 9911–9928, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Deepanshu Khanna¹,
Neeru Jindal¹,
Prashant Singh Rana² &
…
Harpreet Singh²

266 Accesses
Explore all metrics

Abstract

This article proposes a hybrid network model for video-based human facial expression recognition (FER) system consisting of an end-to-end 3D deep convolutional neural networks. The proposed network combines two commonly used deep 3-dimensional Convolutional Neural Networks (3D CNN) models, ResNet-50 and DenseNet-121, in an end-to-end manner with slight modifications. Currently, various methodologies exist for FER, such as 2-dimensional Convolutional Neural Networks (2D CNN), 2D CNN-Recurrent Neural Networks, 3D CNN, and features extracting algorithms such as PCA and Histogram of oriented gradients (HOG) combined with machine learning classifiers. For the proposed model, we choose 3D CNN over other methods since they preserve temporal information of the videos, unlike 2D CNN. Moreover, these aren’t labor-intensive such as various handcrafted feature extracting methods. The proposed system relies on the temporal averaging of information from frame sequences of the video. The databases are pre-processed to remove unwanted backgrounds for training 3D deep CNN from scratch. Initially, feature vectors from video frame sequences are extracted using the 3D ResNet model. These feature vectors are fed to the 3D DenseNet model’s blocks, which are then used to classify the predicted emotion. The model is evaluated on three benchmarking databases: Ravdess, CK + , and BAUM1s, which achieved 91.69%, 98.61%, and 73.73% accuracy for the respective databases and outperformed various existing methods. We prove that the proposed architecture works well even for the classes with less amount of training data where many existing 3D CNN networks fail.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Facial expression recognition in videos using hybrid CNN & ConvLSTM

Article 21 March 2023

Video-Based Facial Expression Recognition: A Deep Learning Approach

Dynamic Facial Expression Recognition Based on Trained Convolutional Neural Networks

Data availability

Not applicable.

Code availability

Not applicable.

References

Akilan T, Wu QJ, Safaei A, Huo J, Yang Y (2020) A 3D CNN-LSTM-Based Image-to-Image Foreground Segmentation. IEEE Trans Intell Transp Syst 21(3):959–971. https://doi.org/10.1109/TITS.2019.2900426
Article Google Scholar
Aly S, Abbott A L, Torki M (2016) A multimodal feature fusion framework for Kinect-based facial expression recognition using Dual Kernel Discriminant Analysis (DKDA). In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, pp. 1–10. https://doi.org/10.1109/WACV.2016.7477577
Bartlett MS, Littlewood G, Fasel I, Movellan JR (2003) Real-Time Face Detection and Facial Expression Recognition: Development and Applications to Human-Computer Interaction. In: 2003 Conference on Computer Vision and Pattern Recognition Workshop, Madison, WI, USA, pp. 53–53. https://doi.org/10.1109/CVPRW.2003.10057
Carreira J, Zisserman A (2017) Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp. 4724–4733. https://doi.org/10.1109/CVPR.2017.502
Chang L, Chenglin W, Yiting Q (2023) A Video Sequence Face Expression Recognition Method Based on Squeeze-and-Excitation and 3DPCA Network. Sensors 23:823. https://doi.org/10.3390/s23020823
Article Google Scholar
Deniz O, Bueno G, Salido J et al (2011) Face recognition using histograms of oriented gradients. Pattern Recogn Lett 32(12):1598–1603. https://doi.org/10.1016/j.patrec.2011.01.004
Article Google Scholar
Dhankhar P (2019) ResNet-50 and VGG-16 for recognizing Facial Emotions, 13(4):1-5. https://doi.org/10.21172/ijiet.134.18
Fan Y, Lu X, Li D, Liu Y (2016) Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction (ICMI’ 16). Association for Computing Machinery, New York, NY, USA, pp. 445–450. https://doi.org/10.1145/2993148.2997632
Ghaleb E, Popa M, Asteriadis S (2019) Multimodal and Temporal Perception of Audio-visual Cues for Emotion Recognition. In: 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII), Cambridge, United Kingdom, pp. 552–558. https://doi.org/10.1109/ACII.2019.8925444
Haddad J, Lezoray O, Hamel P (2020) 3D-CNN for Facial Emotion Recognition in Videos. In: International Symposium on Visual Computing, pp. 298–309 Springer. https://doi.org/10.1007/978-3-030-64559-5_23
Hara K, Kataoka H, Satoh Y (2018) Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? https://doi.org/10.1109/ACCESS.2019.2901521
He Z, Jin T, Basu A, Soraghan J, Caterina G D, Petropoulakis L (2019) Human Emotion Recognition in Video Using Subtraction Pre-Processing. In: Proceedings of the 2019 11th International Conference on Machine Learning and Computing (ICMLC’ 19), Association for Computing Machinery, New York, NY, USA, pp. 374–379. https://doi.org/10.1145/3318299.3318321
He K, Zhang X, Ren S, Sun J (2016) Deep Residual Learning for Image Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 770–778. https://doi.org/10.1109/CVPR.2016.90
Ho TT, Kim T, Kim WJ et al (2021) A 3D-CNN model with CT-based parametric response mapping for classifying COPD subjects. https://doi.org/10.1038/s41598-020-79336-5
Huang G, Liu Z, Maaten LVD, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, US, pp. 4700–4708. https://doi.org/10.1109/CVPR.2017.243
Ji F, Zhang H, Zhu Z, Dai W (2021) Blog text quality assessment using a 3D CNN-based statistical framework. Futur Gener Comput Syst 116:365–370. https://doi.org/10.1016/j.future.2020.10.025
Article Google Scholar
Kanade T, Cohn J F, Tian Y (2000) Comprehensive database for facial expression analysis. In: Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition (FG’00), Grenoble, France, pp. 46–53. https://doi.org/10.1109/AFGR.2000.840611
Khorrami P, Paine TL, Brady K, Dagli C, Huang TS (2016) How deep neural networks can improve emotion recognition on video data. In: 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, pp. 619–623. https://doi.org/10.1109/ICIP.2016.7532431
Klaeser A, Marszalek M, Schmid C (2008) A Spatio-Temporal Descriptor Based on 3D-Gradients. In: Proceedings of the British Machine Vision Conference, pp. 99.1–99.10. https://doi.org/10.5244/C.22.99
Li S, Deng W (2020) Deep Facial Expression Recognition: A Survey. In: IEEE Transactions on Affective Computing. https://doi.org/10.1109/TAFFC.2020.2981446
Li B, Lima D (2021) Facial expression recognition via ResNet-50. Int J Cogn Comput Eng. 57–64. https://doi.org/10.1016/j.ijcce.2021.02.002
Liu M, Shan S, Wang R, Chen X (2014) Learning Expressionless on Spatio-temporal Manifold for Dynamic Facial Expression Recognition. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, pp. 1749–1756. https://doi.org/10.1109/CVPR.2014.226
Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English (2018). https://doi.org/10.1371/journal.pone.0196391
Lopes AT, Aguiar E, Souza AFD, Oliveira-Santos T (2017) Facial expression recognition with Convolutional Neural Networks: Coping with few data and the training sample order. Pattern Recogn 61:610–628. https://doi.org/10.1016/j.patcog.2016.07.026
Article Google Scholar
Lucey P, Cohn JF, Kanade T, Saragih J, Ambadar Z, Matthews I. The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, San Francisco, CA, USA, pp. 94–101. https://doi.org/10.1109/CVPRW.2010.5543262
Miao Y, Dong H, Jaam J M A, Saddik A E (2019) A Deep Learning System for Recognizing Facial Expression in Real-Time. In: ACM Transactions on Multimedia Computing, Communications, and Applications. https://doi.org/10.1145/3311747
Mohammadi MR, Fatemizadeh E, Mahoor MH (2014) PCA-based dictionary building for accurate facial expression recognition via sparse representation. J Vis Commun Image Represent 25(5):1082–1092. https://doi.org/10.1016/j.jvcir.2014.03.006
Article Google Scholar
Peña D, Tanaka F (2020) Human Perception of Social Robot’s Emotional States via Facial and Thermal Expressions. In: Association for Computing Machinery. https://doi.org/10.1145/3388469
Rivera AR, Castillo JR, Chae OO (2013) Local Directional Number Pattern for Face Analysis: Face and Expression Recognition. IEEE Trans Image Process 22(5):1740–1752. https://doi.org/10.1109/TIP.2012.2235848
Article MathSciNet Google Scholar
Scovanner P, Ali S, Shah M (2007) A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the 15th ACM international conference on Multimedia (MM’ 07). Association for Computing Machinery, New York, NY, USA, pp. 357–360. https://doi.org/10.1145/1291233.1291311
Sharma G, Singh L, Gautam S (2019) Automatic Facial Expression Recognition Using Combined Geometric Features. In: 3D Research 10, Article 224. https://doi.org/10.1007/s13319-019-0224-0
Singh R, Saurav S, Kumar T et al (2023) Facial expression recognition in videos using hybrid CNN & ConvLSTM. Int J Inf Tecnol (2023). https://doi.org/10.1007/s41870-023-01183-0
Tariq U et al (2011) Emotion recognition from an ensemble of features. In: 2011 IEEE International Conference on Automatic Face & Gesture Recognition (FG), Santa Barbara, CA, USA, pp. 872–877. https://doi.org/10.1109/FG.2011.5771365
Villanueva MG, Zavala SR (2020) Deep Neural Network Architecture: Application for Facial Expression Recognition. IEEE Lat Am Trans 18(07):1311–1319. https://doi.org/10.1109/TLA.2020.9099774
Article Google Scholar
Yang B, Cao J, Ni R, Zhang Y (2018) Facial Expression Recognition Using Weighted Mixture Deep Neural Network Based on Double-Channel Facial Images. IEEE Access 6:4630–4640. https://doi.org/10.1109/ACCESS.2017.2784096
Article Google Scholar
Zhalehpour S, Onder O, Akhtar Z, Erdem CE (2017) BAUM-1: A Spontaneous Audio-Visual Face Database of Affective and Mental States. IEEE Trans Affect Comput 8(3):300–313. https://doi.org/10.1109/TAFFC.2016.2553038
Article Google Scholar
Zhang S, Huang T, Gao W, Tian Q (2018) Learning Affective Features with a Hybrid Deep Model for Audio-Visual Emotion Recognition. IEEE Trans Circ Syst Video Technol 28(10):3030–3043. https://doi.org/10.1109/TCSVT.2017.2719043
Article Google Scholar
Zhang S, Pan X, Cui Y, Zhao X, Liu L (2019) Learning Affective Video Features for Facial Expression Recognition via Hybrid Deep Learning. IEEE Access 7:32297–32304. https://doi.org/10.1109/ACCESS.2019.2901521
Article Google Scholar

Download references

Author information

Authors and Affiliations

Electronics and Communication Engineering Department, Thapar Institute of Engineering and Technology, Patiala, Punjab, India
Deepanshu Khanna & Neeru Jindal
Computer Science Engineering Department, Thapar Institute of Engineering and Technology, Patiala, Punjab, India
Prashant Singh Rana & Harpreet Singh

Authors

Deepanshu Khanna
View author publications
You can also search for this author in PubMed Google Scholar
Neeru Jindal
View author publications
You can also search for this author in PubMed Google Scholar
Prashant Singh Rana
View author publications
You can also search for this author in PubMed Google Scholar
Harpreet Singh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Deepanshu Khanna.

Ethics declarations

Conflict of interest

There is no conflict of interest between authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Khanna, D., Jindal, N., Rana, P.S. et al. Enhanced spatio-temporal 3D CNN for facial expression classification in videos. Multimed Tools Appl 83, 9911–9928 (2024). https://doi.org/10.1007/s11042-023-16066-6

Download citation

Received: 13 September 2021
Revised: 30 May 2023
Accepted: 18 June 2023
Published: 28 June 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s11042-023-16066-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhanced spatio-temporal 3D CNN for facial expression classification in videos

Abstract

Access this article

Similar content being viewed by others

Facial expression recognition in videos using hybrid CNN & ConvLSTM

Video-Based Facial Expression Recognition: A Deep Learning Approach

Dynamic Facial Expression Recognition Based on Trained Convolutional Neural Networks

Data availability

Code availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Enhanced spatio-temporal 3D CNN for facial expression classification in videos

Abstract

Access this article

Similar content being viewed by others

Facial expression recognition in videos using hybrid CNN & ConvLSTM

Video-Based Facial Expression Recognition: A Deep Learning Approach

Dynamic Facial Expression Recognition Based on Trained Convolutional Neural Networks

Data availability

Code availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation