Skip to main content

Advertisement

Log in

Enhanced spatio-temporal 3D CNN for facial expression classification in videos

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

This article proposes a hybrid network model for video-based human facial expression recognition (FER) system consisting of an end-to-end 3D deep convolutional neural networks. The proposed network combines two commonly used deep 3-dimensional Convolutional Neural Networks (3D CNN) models, ResNet-50 and DenseNet-121, in an end-to-end manner with slight modifications. Currently, various methodologies exist for FER, such as 2-dimensional Convolutional Neural Networks (2D CNN), 2D CNN-Recurrent Neural Networks, 3D CNN, and features extracting algorithms such as PCA and Histogram of oriented gradients (HOG) combined with machine learning classifiers. For the proposed model, we choose 3D CNN over other methods since they preserve temporal information of the videos, unlike 2D CNN. Moreover, these aren’t labor-intensive such as various handcrafted feature extracting methods. The proposed system relies on the temporal averaging of information from frame sequences of the video. The databases are pre-processed to remove unwanted backgrounds for training 3D deep CNN from scratch. Initially, feature vectors from video frame sequences are extracted using the 3D ResNet model. These feature vectors are fed to the 3D DenseNet model’s blocks, which are then used to classify the predicted emotion. The model is evaluated on three benchmarking databases: Ravdess, CK + , and BAUM1s, which achieved 91.69%, 98.61%, and 73.73% accuracy for the respective databases and outperformed various existing methods. We prove that the proposed architecture works well even for the classes with less amount of training data where many existing 3D CNN networks fail.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data availability

Not applicable.

Code availability

Not applicable.

References

  1. Akilan T, Wu QJ, Safaei A, Huo J, Yang Y (2020) A 3D CNN-LSTM-Based Image-to-Image Foreground Segmentation. IEEE Trans Intell Transp Syst 21(3):959–971. https://doi.org/10.1109/TITS.2019.2900426

    Article  Google Scholar 

  2. Aly S, Abbott A L, Torki M (2016) A multimodal feature fusion framework for Kinect-based facial expression recognition using Dual Kernel Discriminant Analysis (DKDA). In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, pp. 1–10. https://doi.org/10.1109/WACV.2016.7477577

  3. Bartlett MS, Littlewood G, Fasel I, Movellan JR (2003) Real-Time Face Detection and Facial Expression Recognition: Development and Applications to Human-Computer Interaction. In: 2003 Conference on Computer Vision and Pattern Recognition Workshop, Madison, WI, USA, pp. 53–53. https://doi.org/10.1109/CVPRW.2003.10057

  4. Carreira J, Zisserman A (2017) Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp. 4724–4733. https://doi.org/10.1109/CVPR.2017.502

  5. Chang L, Chenglin W, Yiting Q (2023) A Video Sequence Face Expression Recognition Method Based on Squeeze-and-Excitation and 3DPCA Network. Sensors 23:823. https://doi.org/10.3390/s23020823

    Article  Google Scholar 

  6. Deniz O, Bueno G, Salido J et al (2011) Face recognition using histograms of oriented gradients. Pattern Recogn Lett 32(12):1598–1603. https://doi.org/10.1016/j.patrec.2011.01.004

    Article  Google Scholar 

  7. Dhankhar P (2019) ResNet-50 and VGG-16 for recognizing Facial Emotions, 13(4):1-5. https://doi.org/10.21172/ijiet.134.18

  8. Fan Y, Lu X, Li D, Liu Y (2016) Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction (ICMI’ 16). Association for Computing Machinery, New York, NY, USA, pp. 445–450. https://doi.org/10.1145/2993148.2997632

  9. Ghaleb E, Popa M, Asteriadis S (2019) Multimodal and Temporal Perception of Audio-visual Cues for Emotion Recognition. In: 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII), Cambridge, United Kingdom, pp. 552–558. https://doi.org/10.1109/ACII.2019.8925444

  10. Haddad J, Lezoray O, Hamel P (2020) 3D-CNN for Facial Emotion Recognition in Videos. In: International Symposium on Visual Computing, pp. 298–309 Springer. https://doi.org/10.1007/978-3-030-64559-5_23

  11. Hara K, Kataoka H, Satoh Y (2018) Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? https://doi.org/10.1109/ACCESS.2019.2901521

  12. He Z, Jin T, Basu A, Soraghan J, Caterina G D, Petropoulakis L (2019) Human Emotion Recognition in Video Using Subtraction Pre-Processing. In: Proceedings of the 2019 11th International Conference on Machine Learning and Computing (ICMLC’ 19), Association for Computing Machinery, New York, NY, USA, pp. 374–379. https://doi.org/10.1145/3318299.3318321

  13. He K, Zhang X, Ren S, Sun J (2016) Deep Residual Learning for Image Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 770–778. https://doi.org/10.1109/CVPR.2016.90

  14. Ho TT, Kim T, Kim WJ et al (2021) A 3D-CNN model with CT-based parametric response mapping for classifying COPD subjects. https://doi.org/10.1038/s41598-020-79336-5

  15. Huang G, Liu Z, Maaten LVD, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, US, pp. 4700–4708. https://doi.org/10.1109/CVPR.2017.243

  16. Ji F, Zhang H, Zhu Z, Dai W (2021) Blog text quality assessment using a 3D CNN-based statistical framework. Futur Gener Comput Syst 116:365–370. https://doi.org/10.1016/j.future.2020.10.025

    Article  Google Scholar 

  17. Kanade T, Cohn J F, Tian Y (2000) Comprehensive database for facial expression analysis. In: Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition (FG’00), Grenoble, France, pp. 46–53. https://doi.org/10.1109/AFGR.2000.840611

  18. Khorrami P, Paine TL, Brady K, Dagli C, Huang TS (2016) How deep neural networks can improve emotion recognition on video data. In: 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, pp. 619–623. https://doi.org/10.1109/ICIP.2016.7532431

  19. Klaeser A, Marszalek M, Schmid C (2008) A Spatio-Temporal Descriptor Based on 3D-Gradients. In: Proceedings of the British Machine Vision Conference, pp. 99.1–99.10. https://doi.org/10.5244/C.22.99

  20. Li S, Deng W (2020) Deep Facial Expression Recognition: A Survey. In: IEEE Transactions on Affective Computing. https://doi.org/10.1109/TAFFC.2020.2981446

  21. Li B, Lima D (2021) Facial expression recognition via ResNet-50. Int J Cogn Comput Eng. 57–64. https://doi.org/10.1016/j.ijcce.2021.02.002

  22. Liu M, Shan S, Wang R, Chen X (2014) Learning Expressionless on Spatio-temporal Manifold for Dynamic Facial Expression Recognition. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, pp. 1749–1756. https://doi.org/10.1109/CVPR.2014.226

  23. Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English (2018). https://doi.org/10.1371/journal.pone.0196391

  24. Lopes AT, Aguiar E, Souza AFD, Oliveira-Santos T (2017) Facial expression recognition with Convolutional Neural Networks: Coping with few data and the training sample order. Pattern Recogn 61:610–628. https://doi.org/10.1016/j.patcog.2016.07.026

    Article  Google Scholar 

  25. Lucey P, Cohn JF, Kanade T, Saragih J, Ambadar Z, Matthews I. The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, San Francisco, CA, USA, pp. 94–101. https://doi.org/10.1109/CVPRW.2010.5543262

  26. Miao Y, Dong H, Jaam J M A, Saddik A E (2019) A Deep Learning System for Recognizing Facial Expression in Real-Time. In: ACM Transactions on Multimedia Computing, Communications, and Applications. https://doi.org/10.1145/3311747

  27. Mohammadi MR, Fatemizadeh E, Mahoor MH (2014) PCA-based dictionary building for accurate facial expression recognition via sparse representation. J Vis Commun Image Represent 25(5):1082–1092. https://doi.org/10.1016/j.jvcir.2014.03.006

    Article  Google Scholar 

  28. Peña D, Tanaka F (2020) Human Perception of Social Robot’s Emotional States via Facial and Thermal Expressions. In: Association for Computing Machinery. https://doi.org/10.1145/3388469

  29. Rivera AR, Castillo JR, Chae OO (2013) Local Directional Number Pattern for Face Analysis: Face and Expression Recognition. IEEE Trans Image Process 22(5):1740–1752. https://doi.org/10.1109/TIP.2012.2235848

    Article  MathSciNet  Google Scholar 

  30. Scovanner P, Ali S, Shah M (2007) A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the 15th ACM international conference on Multimedia (MM’ 07). Association for Computing Machinery, New York, NY, USA, pp. 357–360. https://doi.org/10.1145/1291233.1291311

  31. Sharma G, Singh L, Gautam S (2019) Automatic Facial Expression Recognition Using Combined Geometric Features. In: 3D Research 10, Article 224. https://doi.org/10.1007/s13319-019-0224-0

  32. Singh R, Saurav S, Kumar T et al (2023) Facial expression recognition in videos using hybrid CNN & ConvLSTM. Int J Inf Tecnol (2023). https://doi.org/10.1007/s41870-023-01183-0

  33. Tariq U et al (2011) Emotion recognition from an ensemble of features. In: 2011 IEEE International Conference on Automatic Face & Gesture Recognition (FG), Santa Barbara, CA, USA, pp. 872–877. https://doi.org/10.1109/FG.2011.5771365

  34. Villanueva MG, Zavala SR (2020) Deep Neural Network Architecture: Application for Facial Expression Recognition. IEEE Lat Am Trans 18(07):1311–1319. https://doi.org/10.1109/TLA.2020.9099774

    Article  Google Scholar 

  35. Yang B, Cao J, Ni R, Zhang Y (2018) Facial Expression Recognition Using Weighted Mixture Deep Neural Network Based on Double-Channel Facial Images. IEEE Access 6:4630–4640. https://doi.org/10.1109/ACCESS.2017.2784096

    Article  Google Scholar 

  36. Zhalehpour S, Onder O, Akhtar Z, Erdem CE (2017) BAUM-1: A Spontaneous Audio-Visual Face Database of Affective and Mental States. IEEE Trans Affect Comput 8(3):300–313. https://doi.org/10.1109/TAFFC.2016.2553038

    Article  Google Scholar 

  37. Zhang S, Huang T, Gao W, Tian Q (2018) Learning Affective Features with a Hybrid Deep Model for Audio-Visual Emotion Recognition. IEEE Trans Circ Syst Video Technol 28(10):3030–3043. https://doi.org/10.1109/TCSVT.2017.2719043

    Article  Google Scholar 

  38. Zhang S, Pan X, Cui Y, Zhao X, Liu L (2019) Learning Affective Video Features for Facial Expression Recognition via Hybrid Deep Learning. IEEE Access 7:32297–32304. https://doi.org/10.1109/ACCESS.2019.2901521

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Deepanshu Khanna.

Ethics declarations

Conflict of interest

There is no conflict of interest between authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khanna, D., Jindal, N., Rana, P.S. et al. Enhanced spatio-temporal 3D CNN for facial expression classification in videos. Multimed Tools Appl 83, 9911–9928 (2024). https://doi.org/10.1007/s11042-023-16066-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-16066-6

Keywords

Navigation