Skip to main content
Log in

Deep facial spatiotemporal network for engagement prediction in online learning

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Recently, online learning has been gradually accepted and approbated by the public. In this context, an effective prediction of students’ engagement can help teachers obtain timely feedback and make adaptive adjustments to meet learners’ needs. In this paper, we present a novel model called the Deep Facial Spatiotemporal Network (DFSTN) for engagement prediction. The model contains two modules: the pretrained SE-ResNet-50 (SENet), which is used for extracting facial spatial features, and the Long Short Term Memory (LSTM) Network with Global Attention (GALN), which is employed to generate an attentional hidden state. The training strategy of the model is different with changes of the performance metric. The DFSTN can capture facial spatial and temporal information, which is helpful for sensing the fine-grained engaged state and improving the engagement prediction performance. We evaluate the methods on the Dataset for Affective States in E-Environments (DAiSEE) and obtain an accuracy of 58.84% in four-class classification and a Mean Square Error (MSE) of 0.0422. The results show that our method outperforms many existing works in engagement prediction on DAiSEE. Additionally, the robustness of our method is also exhibited by experiments on the EmotiW-EP dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Lan M, Hew K F (2020) Examining learning engagement in moocs: a self-determination theoretical perspective using mixed method. Int J Educ Technol Higher Educ 17(1):1–24

    Article  Google Scholar 

  2. Whitehill J, Serpell Z, Lin Y-C, Foster A, Movellan J R (2014) The faces of engagement: Automatic recognition of student engagementfrom facial expressions. IEEE Trans Affect Comput 5(1):86–98

    Article  Google Scholar 

  3. Dewan M A A, Murshed M, Lin F (2019) Engagement detection in online learning: a review. Smart Learn Environ 6(1):1

    Article  Google Scholar 

  4. Ekman P, Friesen W V, Hager J C (2002) Facial action coding system: The manual on cd rom. In: A Human Face, Salt Lake City, pp 77–254

  5. Grafsgaard J F, Wiggins J B, Boyer K E, Wiebe E N, Lester J C (2013) Automatically recognizing facial indicators of frustration: a learning-centric analysis. In: 2013 humaine association conference on affective computing and intelligent interaction. IEEE, pp 159–165

  6. Bosch N, D’Mello S, Baker R, Ocumpaugh J, Shute V, Ventura M, Wang L, Zhao W (2015) Automatic detection of learning-centered affective states in the wild. In: Proceedings of the 20th international conference on intelligent user interfaces, pp 379–388

  7. Kamath A, Biswas A, Balasubramanian V (2016) A crowdsourced approach to student engagement recognition in e-learning environments. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, pp 1–9

  8. Monkaresi H, Bosch N, Calvo R A, D’Mello S K (2016) Automated detection of engagement using video-based estimation of facial expressions and heart rate. IEEE Trans Affect Comput 8(1):15–28

    Article  Google Scholar 

  9. Yang J, Wang K, Peng X, Qiao Y (2018) Deep recurrent multi-instance learning with spatio-temporal features for engagement intensity prediction. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp 594–598

  10. Niu X, Han H, Zeng J, Sun X, Shan S, Huang Y, Yang S, Chen X (2018) Automatic engagement prediction with gap feature. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp 599–603

  11. Huang T, Mei Y, Zhang H, Liu S, Yang H (2019) Fine-grained engagement recognition in online learning environment. In: 2019 IEEE 9th international conference on electronics information and emergency communication (ICEIEC). IEEE, pp 338–341

  12. Zhang K, Zhang Z, Li Z, Qiao Y (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process Lett 23(10):1499–1503

    Article  Google Scholar 

  13. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141

  14. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  15. Wen Y, Zhang K, Li Z, Qiao Y (2016) A discriminative feature learning approach for deep face recognition. In: European conference on computer vision. Springer, pp 499–515

  16. D’Mello S K, Craig S D, Sullins J, Graesser A C (2006) Predicting affective states expressed through an emote-aloud procedure from autotutor’s mixed-initiative dialogue. Int J Artif Intell Educ 16(1):3–28

    Google Scholar 

  17. O’Brien H L, Toms E G (2010) The development and evaluation of a survey to measure user engagement. J Amer Soc Inf Sci Technol 61(1):50–69

    Article  Google Scholar 

  18. Parsons J, Taylor L (2012) Student engagement: What do we know and what should we do? University of Alberta

  19. Cocea M, Weibelzahl S (2010) Disengagement detection in online learning: Validation studies and perspectives. IEEE Trans Learn Technol 4(2):114–124

    Article  Google Scholar 

  20. Aluja-Banet T, Sancho M-R, Vukic I (2019) Measuring motivation from the virtual learning environment in secondary education. J Comput Sci 36:100629

    Article  Google Scholar 

  21. Fairclough S H, Venables L (2006) Prediction of subjective states from psychophysiology: A multivariate approach. Biol Psychol 71(1):100–110

    Article  Google Scholar 

  22. Khedher A B, Jraidi I, Frasson C (2019) Tracking students’ mental engagement using eeg signals during an interaction with a virtual learning environment. J Intell Learn Syst Appl 11(1):1–14

    Google Scholar 

  23. He K, Cao X, Shi Y, Nie D, Gao Y, Shen D (2018) Pelvic organ segmentation using distinctive curve guided fully convolutional networks. IEEE Trans Med Imaging 38(2):585–595

    Article  Google Scholar 

  24. Xiao Y, Zijie Z (2020) Infrared image extraction algorithm based on adaptive growth immune field. Neural Process Lett:1–13

  25. Littlewort G, Whitehill J, Wu T, Fasel I, Frank M, Movellan J, Bartlett M (2011) The computer expression recognition toolbox (cert). Face and gesture 2011. IEEE, pp 298–305

  26. Saneiro M, Santos O C, Salmeron-Majadas S, Boticario J G (2014) Towards emotion detection in educational scenarios from facial expressions and body movements through multimodal approaches. The Scientific World Journal 2014

  27. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol 1. IEEE, pp 886–893

  28. Zhao G, Pietikainen M (2007) Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans Pattern Anal Mach Intell 29(6):915–928

    Article  Google Scholar 

  29. Qi C, Zhang J, Jia H, Mao Q, Wang L, Song H (2020) Deep face clustering using residual graph convolutional network. In: Knowl-Based Syst:106561

  30. He M, Zhang J, Shan S, Kan M, Chen X (2020) Deformable face net for pose invariant face recognition. Pattern Recogn 100:107113

    Article  Google Scholar 

  31. Nezami O M, Dras M, Hamey L, Richards D, Wan S, Paris C (2019) Automatic recognition of student engagement using deep learning and facial expression. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, pp 273–289

  32. Schulc A, Cohn J F, Shen J, Pantic M (2019) Automatic measurement of visual attention to video content using deep learning. In: 2019 16th International Conference on Machine Vision Applications (MVA). IEEE, pp 1–6

  33. Gupta A, D’Cunha A, Awasthi K, Balasubramanian V (2016) Daisee: Towards user engagement recognition in the wild. CoRR arXiv:1609.01885

  34. Zhang H, Xiao X, Huang T, Liu S, Xia Y, Li J (2019) An novel end-to-end network for automatic student engagement recognition. In: 2019 IEEE 9th International Conference on Electronics Information and Emergency Communication (ICEIEC). IEEE, pp 342–345

  35. Dhall A, Kaur A, Goecke R, Gedeon T (2018) Emotiw 2018: Audio-video, student engagement and group-level affect prediction. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp 653–656

  36. Dhall A (2019) Emotiw 2019: Automatic emotion, engagement and cohesion prediction tasks. In: 2019 International Conference on Multimodal Interaction, pp 546–550

  37. Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. In: NIPS 2014 Workshop on Deep Learning, December 2014

  38. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 1725–1732

  39. Lee J, Reade W, Sukthankar R, Toderici G et al (2018) The 2nd youtube-8m large-scale video understanding challenge. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 0–0

  40. Monfort M, Andonian A, Zhou B, Ramakrishnan K, Bargal S A, Yan T, Brown L, Fan Q, Gutfreund D, Vondrick C et al (2019) Moments in time dataset: one million videos for event understanding. IEEE Trans Pattern Anal Mach Intell 42(2):502–508

    Article  Google Scholar 

  41. Cootes T F, Edwards G J, Taylor C J (1998) Active appearance models. In: European conference on computer vision. Springer, pp 484–498

  42. Zhu X, Ramanan D (2012) Face detection, pose estimation, and landmark localization in the wild. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE, pp 2879–2886

  43. Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. CVPR 2001, vol 1. IEEE, pp I–I

  44. Cao Q, Shen L, Xie W, Parkhi O M, Zisserman A (2018) Vggface2: A dataset for recognising faces across pose and age. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, pp 67–74

  45. Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR 2015

  46. Luong M-T, Pham H, Manning C D (2015) Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp 1412–1421

  47. Kaur A, Mustafa A, Mehta L, Dhall A (2018) Prediction and localization of student engagement in the wild. In: 2018 Digital Image Computing: Techniques and Applications (DICTA). IEEE, pp 1–8

  48. Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634

  49. Xie S, Hu H, Wu Y (2019) Deep multi-path convolutional neural network joint with salient region attention for facial expression recognition. Pattern Recogn 92:177–191

    Article  Google Scholar 

  50. Wang K, Peng X, Yang J, Meng D, Qiao Y (2020) Region attention networks for pose and occlusion robust facial expression recognition. IEEE Trans Image Process 29:4057–4069

    Article  Google Scholar 

  51. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. CoRR arXiv:1409.1556

  52. Barsoum E, Zhang C, Ferrer C C, Zhang Z (2016) Training deep networks for facial expression recognition with crowd-sourced label distribution. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp 279–283

  53. Selvaraju R R, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, pp 618–626

  54. Geng L, Xu M, Wei Z, Zhou X (2019) Learning deep spatiotemporal feature for engagement recognition of online courses. In: 2019 IEEE Symposium Series on Computational Intelligence (SSCI). IEEE, pp 442–447

  55. Woolf B, Burleson W, Arroyo I, Dragon T, Cooper D, Picard R (2009) Affect-aware tutors: recognising and responding to student affect. Int J Learn Technol 4(3-4):129–164

  56. Dubbaka A, Gopalan A (2020) Detecting learner engagement in moocs using automatic facial expression recognition. In: 2020 IEEE Global Engineering Education Conference (EDUCON). IEEE, pp 447–456

  57. Hussain MS, AlZoubi O, Calvo RA, D’Mello SK (2011) Affect detection from multichannel physiology during learning sessions with autotutor. In: International Conference on Artificial Intelligence in Education. Springer, pp 131–138

  58. Baltrusaitis T, Zadeh A, Lim YC, Morency LP (2018) Openface 2.0: Facial behavior analysis toolkit. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, pp 59–66

  59. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497

  60. Chang C, Zhang C, Chen L, Liu Y (2018) An ensemble model using face and body tracking for engagement detection. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp 616–622

  61. Thomas C, Nair N, Jayagopi DB (2018) Predicting engagement intensity in the wild using temporal convolutional network. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp 604–610

Download references

Acknowledgments

This study was supported by the Key Realm R and D Program of Guangzhou under grant 202007030005, the Guangdong Natural Science Foundation under grant 2019A1515011375, the National Natural Science Foundation of China under grant 62076103, Scientific Research Foundation of Graduate School of South China Normal University under grant 2019LKXM031. Scientific Research Foundation of Graduate School of South China and the Special Funds for the Cultivation of Guangdong College Students’ Scientific and Technological Innovation (pdjh2020a0145).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yan Liang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liao, J., Liang, Y. & Pan, J. Deep facial spatiotemporal network for engagement prediction in online learning. Appl Intell 51, 6609–6621 (2021). https://doi.org/10.1007/s10489-020-02139-8

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-020-02139-8

Keywords

Navigation