HMNet: a hierarchical multi-modal network for educational video concept prediction

Huang, Wei; Xiao, Tong; Liu, Qi; Huang, Zhenya; Ma, Jianhui; Chen, Enhong

doi:10.1007/s13042-023-01809-6

HMNet: a hierarchical multi-modal network for educational video concept prediction

Original Article
Published: 19 March 2023

Volume 14, pages 2913–2924, (2023)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Wei Huang ORCID: orcid.org/0000-0002-4817-8858^1,2,
Tong Xiao^1,2,
Qi Liu^1,2,
Zhenya Huang^1,2,
Jianhui Ma^1,2 &
…
Enhong Chen^1,2

377 Accesses
Explore all metrics

Abstract

Educational video concept prediction is a challenging task in the online education system that aims to assign appropriate hierarchical concepts to the video. The key to this problem is to model and fuse the multimodal information of the video. However, most prior studies tend to ignore the incremental characteristics of the educational video, and most of the video segmentation strategies do not apply well to the educational video. Moreover, most existing methods overlook the class hierarchy and do not consider the class dependencies when predicting the hierarchical concepts of a video. To that end, in this paper, we propose a Hierarchical Multi-modal Network (HMNet) framework for predicting the hierarchical concepts of educational videos via fusing the multimodal information and modeling the class dependencies. Specifically, we first apply a video divider for extracting keyframes from the video, which considers the incremental characteristics of the educational video. The video is divided into a series of video sections with subtitles. Then, we utilize a multi-modal encoder to obtain the unified representation for multi-modality. Finally, we design a hierarchical predictor capable of fusing the multi-modality representation, modeling the class dependencies and predicting the hierarchical concepts of video in a top-down manner. Extensive experimental results on two real-world datasets demonstrate the effectiveness and explanatory power of HMNet.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Following the Lecturer: Hierarchical Knowledge Concepts Prediction for Educational Videos

Static and Dynamic Concepts for Self-supervised Video Representation Learning

Video Analysis Engine for Predicting Effectiveness

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

References

Gabeur V, Sun C, Alahari K, Schmid C (2020) Multi-modal transformer for video retrieval. In: European Conference on Computer Vision, pp 214–229 (2020). Springer
Wang X, Zhu L, Yang Y (2021) T2vlad: global-local sequence alignment for text-video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5079–5088
Liu S, Fan H, Qian S Chen Y, Ding W, Wang Z (2021) Hit: Hierarchical transformer with momentum contrast for video-text retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 11915–11925
Shvetsova N, Chen B, Rouditchenko A, Thomas S, Kingsbury B, Feris RS, Harwath D, Glass J, Kuehne H (2022) Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 20020–20029
Yang H, Meinel C (2014) Content based lecture video retrieval using speech and video text information. IEEE Trans Learn Technol 7(2):142–154
Article Google Scholar
Cooper M, Zhao J, Bhatt C, Shamma DA (2018) Moocex: exploring educational video via recommendation. In: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pp 521–524
Du X, Yin H, Chen L, Wang Y, Yang Y, Zhou X (2018) Personalized video recommendation using rich contents from videos. IEEE Trans Knowl Data Eng 32(3):492–505
Article Google Scholar
Furini M (2018) On introducing timed tag-clouds in video lectures indexing. Multimed Tools Appl 77(1):967–984
Article Google Scholar
Husain M, Meena S (2019) Multimodal fusion of speech and text using semi-supervised lda for indexing lecture videos. In: 2019 National Conference on Communications (NCC), pp 1–6. IEEE
Cagliero L, Canale L, Farinetti L (2019) Visa: a supervised approach to indexing video lectures with semantic annotations. In: 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), vol. 1, pp 226–235. IEEE
Weston J, Bengio S, Usunier N (2011) Wsabie: scaling up to large vocabulary image annotation. In: Twenty-Second International Joint Conference on Artificial Intelligence
Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Ranzato M, Mikolov T (2013) Devise: a deep visual-semantic embedding model. Advances in neural information processing systems, 26
Wu C-Y, Feichtenhofer C, Fan H, He K, Krahenbuhl P, Girshick R (2019) Long-term feature banks for detailed video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 284–293
Guo PJ, Kim J, Rubin R (2014) How video production affects student engagement: an empirical study of mooc videos. In: Proceedings of the First ACM Conference on Learning@ Scale Conference, pp 41–50
Wang X, Huang W, Liu Q, Yin Y, Huang Z, Wu L, Ma J, Wang X (2020) Fine-grained similarity measurement between educational videos and exercises. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 331–339
Papazoglou A, Ferrari V (2013) Fast object segmentation in unconstrained video. 2013 IEEE International Conference on Computer Vision, 1777–1784
Yu C-P, Le HM, Zelinsky GJ, Samaras D (2015) Efficient video segmentation using parametric graph partitioning. 2015 IEEE International Conference on Computer Vision (ICCV), 3155–3163
Wattanarachothai W, Patanukhom K (2015) Key frame extraction for text based video retrieval using maximally stable extremal regions. In: 2015 1st International Conference on Industrial Networks and Intelligent Systems (INISCom), pp 29–37. IEEE
Jain S, Wang X, Gonzalez JE (2019) Accel: a corrective fusion network for efficient semantic segmentation on video. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8858–8867
Bai X, Yang M, Lyu P, Xu Y, Luo J (2018) Integrating scene text and visual appearance for fine-grained image classification. IEEE Access 6:66322–66335
Article Google Scholar
Wu A, Han Y (2018) Multi-modal circulant fusion for video-to-language and backward. IJCAI 3:8
Google Scholar
Long X, Gan C, Melo G, Liu X, Li Y, Li F, Wen S (2018) Multimodal keyless attention fusion for video classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32
Vens C, Struyf J, Schietgat L, Džeroski S, Blockeel H (2008) Decision trees for hierarchical multi-label classification. Mach Learn 73(2):185
Article MATH Google Scholar
Cerri R, Barros RC, de Carvalho AC, Jin Y (2016) Reduction strategies for hierarchical multi-label classification in protein function prediction. BMC Bioinform 17(1):373
Article Google Scholar
Wehrmann J, Cerri R, Barros R (2018) Hierarchical multi-label classification networks. In: International Conference on Machine Learning, pp 5225–5234
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 1532–1543
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems, 30
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Wang X, Huang W, Liu Q, Yin Y, Huang Z, Wu L, Ma J, Wang X (2020) Fine-grained similarity measurement between educational videos and exercises. Proceedings of the 28th ACM International Conference on Multimedia
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp 249–256. JMLR Workshop and Conference Proceedings
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
MathSciNet MATH Google Scholar
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al (2019) Pytorch: an imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703
Tsoumakas G, Katakis I (2007) Multi-label classification: an overview. Int J Data Warehous Mining (IJDWM) 3(3):1–13
Article Google Scholar
Giunchiglia E, Lukasiewicz T (2020) Coherent hierarchical multi-label classification networks. In: 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada

Download references

Acknowledgements

This research was partially supported by grants from the National Key Research and Development Program of China (No. 2021YFF0901005), the National Natural Science Foundation of China (Grant No. 61922073, No. 62106244 and U20A20229), and the Iflytek joint research program.

Author information

Authors and Affiliations

Anhui Province Key Laboratory of Big Data Analysis and Application, University of Science and Technology of China, Hefei, China
Wei Huang, Tong Xiao, Qi Liu, Zhenya Huang, Jianhui Ma & Enhong Chen
The State Key Laboratory of Cognitive Intelligence, Hefei, China
Wei Huang, Tong Xiao, Qi Liu, Zhenya Huang, Jianhui Ma & Enhong Chen

Authors

Wei Huang
View author publications
You can also search for this author inPubMed Google Scholar
Tong Xiao
View author publications
You can also search for this author inPubMed Google Scholar
Qi Liu
View author publications
You can also search for this author inPubMed Google Scholar
Zhenya Huang
View author publications
You can also search for this author inPubMed Google Scholar
Jianhui Ma
View author publications
You can also search for this author inPubMed Google Scholar
Enhong Chen
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Enhong Chen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Huang, W., Xiao, T., Liu, Q. et al. HMNet: a hierarchical multi-modal network for educational video concept prediction. Int. J. Mach. Learn. & Cyber. 14, 2913–2924 (2023). https://doi.org/10.1007/s13042-023-01809-6

Download citation

Received: 23 August 2022
Accepted: 19 February 2023
Published: 19 March 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s13042-023-01809-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HMNet: a hierarchical multi-modal network for educational video concept prediction

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Following the Lecturer: Hierarchical Knowledge Concepts Prediction for Educational Videos

Static and Dynamic Concepts for Self-supervised Video Representation Learning

Video Analysis Engine for Predicting Effectiveness

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now