Skip to main content

Advertisement

Log in

HMNet: a hierarchical multi-modal network for educational video concept prediction

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Educational video concept prediction is a challenging task in the online education system that aims to assign appropriate hierarchical concepts to the video. The key to this problem is to model and fuse the multimodal information of the video. However, most prior studies tend to ignore the incremental characteristics of the educational video, and most of the video segmentation strategies do not apply well to the educational video. Moreover, most existing methods overlook the class hierarchy and do not consider the class dependencies when predicting the hierarchical concepts of a video. To that end, in this paper, we propose a Hierarchical Multi-modal Network (HMNet) framework for predicting the hierarchical concepts of educational videos via fusing the multimodal information and modeling the class dependencies. Specifically, we first apply a video divider for extracting keyframes from the video, which considers the incremental characteristics of the educational video. The video is divided into a series of video sections with subtitles. Then, we utilize a multi-modal encoder to obtain the unified representation for multi-modality. Finally, we design a hierarchical predictor capable of fusing the multi-modality representation, modeling the class dependencies and predicting the hierarchical concepts of video in a top-down manner. Extensive experimental results on two real-world datasets demonstrate the effectiveness and explanatory power of HMNet.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. https://www.khanacademy.org.

  2. https://www.coursera.org.

  3. https://www.khanacademy.org.

  4. https://www.youtube.com.

References

  1. Gabeur V, Sun C, Alahari K, Schmid C (2020) Multi-modal transformer for video retrieval. In: European Conference on Computer Vision, pp 214–229 (2020). Springer

  2. Wang X, Zhu L, Yang Y (2021) T2vlad: global-local sequence alignment for text-video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5079–5088

  3. Liu S, Fan H, Qian S Chen Y, Ding W, Wang Z (2021) Hit: Hierarchical transformer with momentum contrast for video-text retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 11915–11925

  4. Shvetsova N, Chen B, Rouditchenko A, Thomas S, Kingsbury B, Feris RS, Harwath D, Glass J, Kuehne H (2022) Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 20020–20029

  5. Yang H, Meinel C (2014) Content based lecture video retrieval using speech and video text information. IEEE Trans Learn Technol 7(2):142–154

    Article  Google Scholar 

  6. Cooper M, Zhao J, Bhatt C, Shamma DA (2018) Moocex: exploring educational video via recommendation. In: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pp 521–524

  7. Du X, Yin H, Chen L, Wang Y, Yang Y, Zhou X (2018) Personalized video recommendation using rich contents from videos. IEEE Trans Knowl Data Eng 32(3):492–505

    Article  Google Scholar 

  8. Furini M (2018) On introducing timed tag-clouds in video lectures indexing. Multimed Tools Appl 77(1):967–984

    Article  Google Scholar 

  9. Husain M, Meena S (2019) Multimodal fusion of speech and text using semi-supervised lda for indexing lecture videos. In: 2019 National Conference on Communications (NCC), pp 1–6. IEEE

  10. Cagliero L, Canale L, Farinetti L (2019) Visa: a supervised approach to indexing video lectures with semantic annotations. In: 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), vol. 1, pp 226–235. IEEE

  11. Weston J, Bengio S, Usunier N (2011) Wsabie: scaling up to large vocabulary image annotation. In: Twenty-Second International Joint Conference on Artificial Intelligence

  12. Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Ranzato M, Mikolov T (2013) Devise: a deep visual-semantic embedding model. Advances in neural information processing systems, 26

  13. Wu C-Y, Feichtenhofer C, Fan H, He K, Krahenbuhl P, Girshick R (2019) Long-term feature banks for detailed video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 284–293

  14. Guo PJ, Kim J, Rubin R (2014) How video production affects student engagement: an empirical study of mooc videos. In: Proceedings of the First ACM Conference on Learning@ Scale Conference, pp 41–50

  15. Wang X, Huang W, Liu Q, Yin Y, Huang Z, Wu L, Ma J, Wang X (2020) Fine-grained similarity measurement between educational videos and exercises. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 331–339

  16. Papazoglou A, Ferrari V (2013) Fast object segmentation in unconstrained video. 2013 IEEE International Conference on Computer Vision, 1777–1784

  17. Yu C-P, Le HM, Zelinsky GJ, Samaras D (2015) Efficient video segmentation using parametric graph partitioning. 2015 IEEE International Conference on Computer Vision (ICCV), 3155–3163

  18. Wattanarachothai W, Patanukhom K (2015) Key frame extraction for text based video retrieval using maximally stable extremal regions. In: 2015 1st International Conference on Industrial Networks and Intelligent Systems (INISCom), pp 29–37. IEEE

  19. Jain S, Wang X, Gonzalez JE (2019) Accel: a corrective fusion network for efficient semantic segmentation on video. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8858–8867

  20. Bai X, Yang M, Lyu P, Xu Y, Luo J (2018) Integrating scene text and visual appearance for fine-grained image classification. IEEE Access 6:66322–66335

    Article  Google Scholar 

  21. Wu A, Han Y (2018) Multi-modal circulant fusion for video-to-language and backward. IJCAI 3:8

    Google Scholar 

  22. Long X, Gan C, Melo G, Liu X, Li Y, Li F, Wen S (2018) Multimodal keyless attention fusion for video classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32

  23. Vens C, Struyf J, Schietgat L, Džeroski S, Blockeel H (2008) Decision trees for hierarchical multi-label classification. Mach Learn 73(2):185

    Article  MATH  Google Scholar 

  24. Cerri R, Barros RC, de Carvalho AC, Jin Y (2016) Reduction strategies for hierarchical multi-label classification in protein function prediction. BMC Bioinform 17(1):373

    Article  Google Scholar 

  25. Wehrmann J, Cerri R, Barros R (2018) Hierarchical multi-label classification networks. In: International Conference on Machine Learning, pp 5225–5234

  26. Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 1532–1543

  27. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems, 30

  28. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778

  29. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980

  30. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  31. Wang X, Huang W, Liu Q, Yin Y, Huang Z, Wu L, Ma J, Wang X (2020) Fine-grained similarity measurement between educational videos and exercises. Proceedings of the 28th ACM International Conference on Multimedia

  32. Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp 249–256. JMLR Workshop and Conference Proceedings

  33. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958

    MathSciNet  MATH  Google Scholar 

  34. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al (2019) Pytorch: an imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703

  35. Tsoumakas G, Katakis I (2007) Multi-label classification: an overview. Int J Data Warehous Mining (IJDWM) 3(3):1–13

    Article  Google Scholar 

  36. Giunchiglia E, Lukasiewicz T (2020) Coherent hierarchical multi-label classification networks. In: 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada

Download references

Acknowledgements

This research was partially supported by grants from the National Key Research and Development Program of China (No. 2021YFF0901005), the National Natural Science Foundation of China (Grant No. 61922073, No. 62106244 and U20A20229), and the Iflytek joint research program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Enhong Chen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, W., Xiao, T., Liu, Q. et al. HMNet: a hierarchical multi-modal network for educational video concept prediction. Int. J. Mach. Learn. & Cyber. 14, 2913–2924 (2023). https://doi.org/10.1007/s13042-023-01809-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-023-01809-6

Keywords

Navigation