Abstract
Educational video concept prediction is a challenging task in the online education system that aims to assign appropriate hierarchical concepts to the video. The key to this problem is to model and fuse the multimodal information of the video. However, most prior studies tend to ignore the incremental characteristics of the educational video, and most of the video segmentation strategies do not apply well to the educational video. Moreover, most existing methods overlook the class hierarchy and do not consider the class dependencies when predicting the hierarchical concepts of a video. To that end, in this paper, we propose a Hierarchical Multi-modal Network (HMNet) framework for predicting the hierarchical concepts of educational videos via fusing the multimodal information and modeling the class dependencies. Specifically, we first apply a video divider for extracting keyframes from the video, which considers the incremental characteristics of the educational video. The video is divided into a series of video sections with subtitles. Then, we utilize a multi-modal encoder to obtain the unified representation for multi-modality. Finally, we design a hierarchical predictor capable of fusing the multi-modality representation, modeling the class dependencies and predicting the hierarchical concepts of video in a top-down manner. Extensive experimental results on two real-world datasets demonstrate the effectiveness and explanatory power of HMNet.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Gabeur V, Sun C, Alahari K, Schmid C (2020) Multi-modal transformer for video retrieval. In: European Conference on Computer Vision, pp 214–229 (2020). Springer
Wang X, Zhu L, Yang Y (2021) T2vlad: global-local sequence alignment for text-video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5079–5088
Liu S, Fan H, Qian S Chen Y, Ding W, Wang Z (2021) Hit: Hierarchical transformer with momentum contrast for video-text retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 11915–11925
Shvetsova N, Chen B, Rouditchenko A, Thomas S, Kingsbury B, Feris RS, Harwath D, Glass J, Kuehne H (2022) Everything at once-multi-modal fusion transformer for video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 20020–20029
Yang H, Meinel C (2014) Content based lecture video retrieval using speech and video text information. IEEE Trans Learn Technol 7(2):142–154
Cooper M, Zhao J, Bhatt C, Shamma DA (2018) Moocex: exploring educational video via recommendation. In: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pp 521–524
Du X, Yin H, Chen L, Wang Y, Yang Y, Zhou X (2018) Personalized video recommendation using rich contents from videos. IEEE Trans Knowl Data Eng 32(3):492–505
Furini M (2018) On introducing timed tag-clouds in video lectures indexing. Multimed Tools Appl 77(1):967–984
Husain M, Meena S (2019) Multimodal fusion of speech and text using semi-supervised lda for indexing lecture videos. In: 2019 National Conference on Communications (NCC), pp 1–6. IEEE
Cagliero L, Canale L, Farinetti L (2019) Visa: a supervised approach to indexing video lectures with semantic annotations. In: 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), vol. 1, pp 226–235. IEEE
Weston J, Bengio S, Usunier N (2011) Wsabie: scaling up to large vocabulary image annotation. In: Twenty-Second International Joint Conference on Artificial Intelligence
Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Ranzato M, Mikolov T (2013) Devise: a deep visual-semantic embedding model. Advances in neural information processing systems, 26
Wu C-Y, Feichtenhofer C, Fan H, He K, Krahenbuhl P, Girshick R (2019) Long-term feature banks for detailed video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 284–293
Guo PJ, Kim J, Rubin R (2014) How video production affects student engagement: an empirical study of mooc videos. In: Proceedings of the First ACM Conference on Learning@ Scale Conference, pp 41–50
Wang X, Huang W, Liu Q, Yin Y, Huang Z, Wu L, Ma J, Wang X (2020) Fine-grained similarity measurement between educational videos and exercises. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 331–339
Papazoglou A, Ferrari V (2013) Fast object segmentation in unconstrained video. 2013 IEEE International Conference on Computer Vision, 1777–1784
Yu C-P, Le HM, Zelinsky GJ, Samaras D (2015) Efficient video segmentation using parametric graph partitioning. 2015 IEEE International Conference on Computer Vision (ICCV), 3155–3163
Wattanarachothai W, Patanukhom K (2015) Key frame extraction for text based video retrieval using maximally stable extremal regions. In: 2015 1st International Conference on Industrial Networks and Intelligent Systems (INISCom), pp 29–37. IEEE
Jain S, Wang X, Gonzalez JE (2019) Accel: a corrective fusion network for efficient semantic segmentation on video. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8858–8867
Bai X, Yang M, Lyu P, Xu Y, Luo J (2018) Integrating scene text and visual appearance for fine-grained image classification. IEEE Access 6:66322–66335
Wu A, Han Y (2018) Multi-modal circulant fusion for video-to-language and backward. IJCAI 3:8
Long X, Gan C, Melo G, Liu X, Li Y, Li F, Wen S (2018) Multimodal keyless attention fusion for video classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32
Vens C, Struyf J, Schietgat L, Džeroski S, Blockeel H (2008) Decision trees for hierarchical multi-label classification. Mach Learn 73(2):185
Cerri R, Barros RC, de Carvalho AC, Jin Y (2016) Reduction strategies for hierarchical multi-label classification in protein function prediction. BMC Bioinform 17(1):373
Wehrmann J, Cerri R, Barros R (2018) Hierarchical multi-label classification networks. In: International Conference on Machine Learning, pp 5225–5234
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 1532–1543
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems, 30
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Wang X, Huang W, Liu Q, Yin Y, Huang Z, Wu L, Ma J, Wang X (2020) Fine-grained similarity measurement between educational videos and exercises. Proceedings of the 28th ACM International Conference on Multimedia
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp 249–256. JMLR Workshop and Conference Proceedings
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al (2019) Pytorch: an imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703
Tsoumakas G, Katakis I (2007) Multi-label classification: an overview. Int J Data Warehous Mining (IJDWM) 3(3):1–13
Giunchiglia E, Lukasiewicz T (2020) Coherent hierarchical multi-label classification networks. In: 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada
Acknowledgements
This research was partially supported by grants from the National Key Research and Development Program of China (No. 2021YFF0901005), the National Natural Science Foundation of China (Grant No. 61922073, No. 62106244 and U20A20229), and the Iflytek joint research program.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Huang, W., Xiao, T., Liu, Q. et al. HMNet: a hierarchical multi-modal network for educational video concept prediction. Int. J. Mach. Learn. & Cyber. 14, 2913–2924 (2023). https://doi.org/10.1007/s13042-023-01809-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-023-01809-6