Abstract
Learning how to execute a complex, hands-on task in a domain such as auto maintenance, cooking, or guitar playing while relying exclusively on text instruction from a manual is often frustrating and ineffective. Despite the need for multimedia instruction to enable the learning of complex, manual tasks, learners often rely exclusively on text instruction. However, through widespread usage of user-generated content platforms, such as YouTube and TikTok, learners are no longer limited to standard text and are able to watch videos from easily accessible platforms to learn such procedural tasks. As YouTube consists of a large corpus of diverse instructional videos, the accuracy of videos on sensitive and complex tasks has yet to be validated in comparison to “golden standard” manuals. Our work provides a unique LLM-based multimodal pipeline to interpret and verify task-related key steps in a video within organized knowledge schemas, in which demonstrated video steps are automatically extracted, systematized, and validated in comparison to a text manual of official steps. Applied to a dataset of twenty-four videos on the task of flat tire replacement on a car, the LLM-based pipeline achieved high performance on our metrics, identifying an average of 98% of key task steps, with 86% precision and 92% recall across all videos.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alayrac, J.B., Bojanowski, P., Agrawal, N., Sivic, J., Laptev, I., Lacoste-Julien, S.: Unsupervised learning from narrated instruction videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4575–4583 (2016)
Ampel, B.M., Yang, C.H., Hu, J., Chen, H.: Large language models for conducting advanced text analytics information systems research (2023). arXiv preprint arXiv:2312.17278
Buch, S.V., Treschow, F.P., Svendsen, J.B., Worm, B.S.: Video-or text-based e-learning when teaching clinical procedures? A randomized controlled trial. Adv. Med. Educ. Pract. 257–262 (2014)
Chase, H.: LangChain. https://langchain.com/. Accessed on 1 Aug 2023
Dennen, V.P., Burner, K.J.: The cognitive apprenticeship model in educational practice. In: Handbook of research on educational communications and technology, pp. 425–439. Routledge (2008)
Goel, A., et al.: LLMS accelerate annotation for medical information extraction. In: Machine Learning for Health (ML4H), pp. 82–100. PMLR (2023)
Kwon, C., Stamper, J., King, J., Lam, J., Carney, J.: Multimodal data support in knowledge objects for real-time knowledge sharing. In: Proceedings of CROSSMMLA Workshop at the 13th International Conference on Learning Analytics & Knowledge (2023)
Malmaud, J., Huang, J., Rathod, V., Johnston, N., Rabinovich, A., Murphy, K.: What’s cookin’? interpreting cooking videos using text, speech and vision (2015). arXiv preprint arXiv:1503.01558
Manju, A., Valarmathie, P.: Organizing multimedia big data using semantic based video content extraction technique. In: 2015 International Conference on Soft-Computing and Networks Security (ICSNS), pp. 1–4. IEEE (2015)
Navarrete, E., Nehring, A., Schanze, S., Ewerth, R., Hoppe, A.: A closer look into recent video-based learning research: a comprehensive review of video characteristics, tools, technologies, and learning effectiveness (2023). arXiv preprint arXiv:2301.13617
Routh, D., Rao, P.P., Sharma, A., Arunjeet, K.: To compare the effectiveness of traditional textbook-based learning with video-based teaching for basic laparoscopic suturing skills training-a randomized controlled trial. Medical Journal of Dr. DY Patil University (2023)
Sonnenfeld, N., Nguyen, B., Boesser, C.T., Jentsch, F.: Modern practices for flightcrew training of procedural knowledge. In: 84th International Symposium on Aviation Psychology, p. 303 (2021)
Stamper, J., Barnes, T., Croy, M.: Enhancing the automatic generation of hints with expert seeding. In: Intelligent Tutoring Systems: 10th International Conference, ITS 2010, Pittsburgh, PA, USA, June 14–18, 2010, Proceedings, Part II 10, pp. 31–40. Springer (2010)
Topsakal, O., Akinci, T.C.: Creating large language model applications utilizing langchain: a primer on developing LLM apps fast. In: International Conference on Applied Engineering and Natural Sciences, vol. 1, pp. 1050–1056 (2023)
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022)
Zala, A., et al.: Hierarchical video-moment retrieval and step-captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23056–23065 (2023)
Zhang, X., Gao, W.: Towards LLM-based fact verification on news claims with a hierarchical step-by-step prompting method (2023). arXiv preprint arXiv:2310.00305
Zhong, Y., Yu, L., Bai, Y., Li, S., Yan, X., Li, Y.: Learning procedure-aware video representation from instructional videos and their narrations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14825–14835 (2023)
Zhu, Y., et al.: Large language models for information retrieval: A survey (2023). arXiv preprint arXiv:2308.07107
Acknowledgments
This work was supported by US Navy STTR #N68335-21-C-0438.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Kwon, C., King, J., Carney, J., Stamper, J. (2024). A Schema-Based Approach to the Linkage of Multimodal Learning Sources with Generative AI. In: Olney, A.M., Chounta, IA., Liu, Z., Santos, O.C., Bittencourt, I.I. (eds) Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky. AIED 2024. Communications in Computer and Information Science, vol 2151. Springer, Cham. https://doi.org/10.1007/978-3-031-64312-5_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-64312-5_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-64311-8
Online ISBN: 978-3-031-64312-5
eBook Packages: Computer ScienceComputer Science (R0)