Abstract
This article proposes a hybrid network model for video-based human facial expression recognition (FER) system consisting of an end-to-end 3D deep convolutional neural networks. The proposed network combines two commonly used deep 3-dimensional Convolutional Neural Networks (3D CNN) models, ResNet-50 and DenseNet-121, in an end-to-end manner with slight modifications. Currently, various methodologies exist for FER, such as 2-dimensional Convolutional Neural Networks (2D CNN), 2D CNN-Recurrent Neural Networks, 3D CNN, and features extracting algorithms such as PCA and Histogram of oriented gradients (HOG) combined with machine learning classifiers. For the proposed model, we choose 3D CNN over other methods since they preserve temporal information of the videos, unlike 2D CNN. Moreover, these aren’t labor-intensive such as various handcrafted feature extracting methods. The proposed system relies on the temporal averaging of information from frame sequences of the video. The databases are pre-processed to remove unwanted backgrounds for training 3D deep CNN from scratch. Initially, feature vectors from video frame sequences are extracted using the 3D ResNet model. These feature vectors are fed to the 3D DenseNet model’s blocks, which are then used to classify the predicted emotion. The model is evaluated on three benchmarking databases: Ravdess, CK + , and BAUM1s, which achieved 91.69%, 98.61%, and 73.73% accuracy for the respective databases and outperformed various existing methods. We prove that the proposed architecture works well even for the classes with less amount of training data where many existing 3D CNN networks fail.
Similar content being viewed by others
Data availability
Not applicable.
Code availability
Not applicable.
References
Akilan T, Wu QJ, Safaei A, Huo J, Yang Y (2020) A 3D CNN-LSTM-Based Image-to-Image Foreground Segmentation. IEEE Trans Intell Transp Syst 21(3):959–971. https://doi.org/10.1109/TITS.2019.2900426
Aly S, Abbott A L, Torki M (2016) A multimodal feature fusion framework for Kinect-based facial expression recognition using Dual Kernel Discriminant Analysis (DKDA). In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, pp. 1–10. https://doi.org/10.1109/WACV.2016.7477577
Bartlett MS, Littlewood G, Fasel I, Movellan JR (2003) Real-Time Face Detection and Facial Expression Recognition: Development and Applications to Human-Computer Interaction. In: 2003 Conference on Computer Vision and Pattern Recognition Workshop, Madison, WI, USA, pp. 53–53. https://doi.org/10.1109/CVPRW.2003.10057
Carreira J, Zisserman A (2017) Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp. 4724–4733. https://doi.org/10.1109/CVPR.2017.502
Chang L, Chenglin W, Yiting Q (2023) A Video Sequence Face Expression Recognition Method Based on Squeeze-and-Excitation and 3DPCA Network. Sensors 23:823. https://doi.org/10.3390/s23020823
Deniz O, Bueno G, Salido J et al (2011) Face recognition using histograms of oriented gradients. Pattern Recogn Lett 32(12):1598–1603. https://doi.org/10.1016/j.patrec.2011.01.004
Dhankhar P (2019) ResNet-50 and VGG-16 for recognizing Facial Emotions, 13(4):1-5. https://doi.org/10.21172/ijiet.134.18
Fan Y, Lu X, Li D, Liu Y (2016) Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction (ICMI’ 16). Association for Computing Machinery, New York, NY, USA, pp. 445–450. https://doi.org/10.1145/2993148.2997632
Ghaleb E, Popa M, Asteriadis S (2019) Multimodal and Temporal Perception of Audio-visual Cues for Emotion Recognition. In: 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII), Cambridge, United Kingdom, pp. 552–558. https://doi.org/10.1109/ACII.2019.8925444
Haddad J, Lezoray O, Hamel P (2020) 3D-CNN for Facial Emotion Recognition in Videos. In: International Symposium on Visual Computing, pp. 298–309 Springer. https://doi.org/10.1007/978-3-030-64559-5_23
Hara K, Kataoka H, Satoh Y (2018) Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? https://doi.org/10.1109/ACCESS.2019.2901521
He Z, Jin T, Basu A, Soraghan J, Caterina G D, Petropoulakis L (2019) Human Emotion Recognition in Video Using Subtraction Pre-Processing. In: Proceedings of the 2019 11th International Conference on Machine Learning and Computing (ICMLC’ 19), Association for Computing Machinery, New York, NY, USA, pp. 374–379. https://doi.org/10.1145/3318299.3318321
He K, Zhang X, Ren S, Sun J (2016) Deep Residual Learning for Image Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 770–778. https://doi.org/10.1109/CVPR.2016.90
Ho TT, Kim T, Kim WJ et al (2021) A 3D-CNN model with CT-based parametric response mapping for classifying COPD subjects. https://doi.org/10.1038/s41598-020-79336-5
Huang G, Liu Z, Maaten LVD, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, US, pp. 4700–4708. https://doi.org/10.1109/CVPR.2017.243
Ji F, Zhang H, Zhu Z, Dai W (2021) Blog text quality assessment using a 3D CNN-based statistical framework. Futur Gener Comput Syst 116:365–370. https://doi.org/10.1016/j.future.2020.10.025
Kanade T, Cohn J F, Tian Y (2000) Comprehensive database for facial expression analysis. In: Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition (FG’00), Grenoble, France, pp. 46–53. https://doi.org/10.1109/AFGR.2000.840611
Khorrami P, Paine TL, Brady K, Dagli C, Huang TS (2016) How deep neural networks can improve emotion recognition on video data. In: 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, pp. 619–623. https://doi.org/10.1109/ICIP.2016.7532431
Klaeser A, Marszalek M, Schmid C (2008) A Spatio-Temporal Descriptor Based on 3D-Gradients. In: Proceedings of the British Machine Vision Conference, pp. 99.1–99.10. https://doi.org/10.5244/C.22.99
Li S, Deng W (2020) Deep Facial Expression Recognition: A Survey. In: IEEE Transactions on Affective Computing. https://doi.org/10.1109/TAFFC.2020.2981446
Li B, Lima D (2021) Facial expression recognition via ResNet-50. Int J Cogn Comput Eng. 57–64. https://doi.org/10.1016/j.ijcce.2021.02.002
Liu M, Shan S, Wang R, Chen X (2014) Learning Expressionless on Spatio-temporal Manifold for Dynamic Facial Expression Recognition. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, pp. 1749–1756. https://doi.org/10.1109/CVPR.2014.226
Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English (2018). https://doi.org/10.1371/journal.pone.0196391
Lopes AT, Aguiar E, Souza AFD, Oliveira-Santos T (2017) Facial expression recognition with Convolutional Neural Networks: Coping with few data and the training sample order. Pattern Recogn 61:610–628. https://doi.org/10.1016/j.patcog.2016.07.026
Lucey P, Cohn JF, Kanade T, Saragih J, Ambadar Z, Matthews I. The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, San Francisco, CA, USA, pp. 94–101. https://doi.org/10.1109/CVPRW.2010.5543262
Miao Y, Dong H, Jaam J M A, Saddik A E (2019) A Deep Learning System for Recognizing Facial Expression in Real-Time. In: ACM Transactions on Multimedia Computing, Communications, and Applications. https://doi.org/10.1145/3311747
Mohammadi MR, Fatemizadeh E, Mahoor MH (2014) PCA-based dictionary building for accurate facial expression recognition via sparse representation. J Vis Commun Image Represent 25(5):1082–1092. https://doi.org/10.1016/j.jvcir.2014.03.006
Peña D, Tanaka F (2020) Human Perception of Social Robot’s Emotional States via Facial and Thermal Expressions. In: Association for Computing Machinery. https://doi.org/10.1145/3388469
Rivera AR, Castillo JR, Chae OO (2013) Local Directional Number Pattern for Face Analysis: Face and Expression Recognition. IEEE Trans Image Process 22(5):1740–1752. https://doi.org/10.1109/TIP.2012.2235848
Scovanner P, Ali S, Shah M (2007) A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the 15th ACM international conference on Multimedia (MM’ 07). Association for Computing Machinery, New York, NY, USA, pp. 357–360. https://doi.org/10.1145/1291233.1291311
Sharma G, Singh L, Gautam S (2019) Automatic Facial Expression Recognition Using Combined Geometric Features. In: 3D Research 10, Article 224. https://doi.org/10.1007/s13319-019-0224-0
Singh R, Saurav S, Kumar T et al (2023) Facial expression recognition in videos using hybrid CNN & ConvLSTM. Int J Inf Tecnol (2023). https://doi.org/10.1007/s41870-023-01183-0
Tariq U et al (2011) Emotion recognition from an ensemble of features. In: 2011 IEEE International Conference on Automatic Face & Gesture Recognition (FG), Santa Barbara, CA, USA, pp. 872–877. https://doi.org/10.1109/FG.2011.5771365
Villanueva MG, Zavala SR (2020) Deep Neural Network Architecture: Application for Facial Expression Recognition. IEEE Lat Am Trans 18(07):1311–1319. https://doi.org/10.1109/TLA.2020.9099774
Yang B, Cao J, Ni R, Zhang Y (2018) Facial Expression Recognition Using Weighted Mixture Deep Neural Network Based on Double-Channel Facial Images. IEEE Access 6:4630–4640. https://doi.org/10.1109/ACCESS.2017.2784096
Zhalehpour S, Onder O, Akhtar Z, Erdem CE (2017) BAUM-1: A Spontaneous Audio-Visual Face Database of Affective and Mental States. IEEE Trans Affect Comput 8(3):300–313. https://doi.org/10.1109/TAFFC.2016.2553038
Zhang S, Huang T, Gao W, Tian Q (2018) Learning Affective Features with a Hybrid Deep Model for Audio-Visual Emotion Recognition. IEEE Trans Circ Syst Video Technol 28(10):3030–3043. https://doi.org/10.1109/TCSVT.2017.2719043
Zhang S, Pan X, Cui Y, Zhao X, Liu L (2019) Learning Affective Video Features for Facial Expression Recognition via Hybrid Deep Learning. IEEE Access 7:32297–32304. https://doi.org/10.1109/ACCESS.2019.2901521
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
There is no conflict of interest between authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Khanna, D., Jindal, N., Rana, P.S. et al. Enhanced spatio-temporal 3D CNN for facial expression classification in videos. Multimed Tools Appl 83, 9911–9928 (2024). https://doi.org/10.1007/s11042-023-16066-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16066-6