Skip to main content

Abstract

Learning how to execute a complex, hands-on task in a domain such as auto maintenance, cooking, or guitar playing while relying exclusively on text instruction from a manual is often frustrating and ineffective. Despite the need for multimedia instruction to enable the learning of complex, manual tasks, learners often rely exclusively on text instruction. However, through widespread usage of user-generated content platforms, such as YouTube and TikTok, learners are no longer limited to standard text and are able to watch videos from easily accessible platforms to learn such procedural tasks. As YouTube consists of a large corpus of diverse instructional videos, the accuracy of videos on sensitive and complex tasks has yet to be validated in comparison to “golden standard” manuals. Our work provides a unique LLM-based multimodal pipeline to interpret and verify task-related key steps in a video within organized knowledge schemas, in which demonstrated video steps are automatically extracted, systematized, and validated in comparison to a text manual of official steps. Applied to a dataset of twenty-four videos on the task of flat tire replacement on a car, the LLM-based pipeline achieved high performance on our metrics, identifying an average of 98% of key task steps, with 86% precision and 92% recall across all videos.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Alayrac, J.B., Bojanowski, P., Agrawal, N., Sivic, J., Laptev, I., Lacoste-Julien, S.: Unsupervised learning from narrated instruction videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4575–4583 (2016)

    Google Scholar 

  2. Ampel, B.M., Yang, C.H., Hu, J., Chen, H.: Large language models for conducting advanced text analytics information systems research (2023). arXiv preprint arXiv:2312.17278

  3. Buch, S.V., Treschow, F.P., Svendsen, J.B., Worm, B.S.: Video-or text-based e-learning when teaching clinical procedures? A randomized controlled trial. Adv. Med. Educ. Pract. 257–262 (2014)

    Google Scholar 

  4. Chase, H.: LangChain. https://langchain.com/. Accessed on 1 Aug 2023

  5. Dennen, V.P., Burner, K.J.: The cognitive apprenticeship model in educational practice. In: Handbook of research on educational communications and technology, pp. 425–439. Routledge (2008)

    Google Scholar 

  6. Goel, A., et al.: LLMS accelerate annotation for medical information extraction. In: Machine Learning for Health (ML4H), pp. 82–100. PMLR (2023)

    Google Scholar 

  7. Kwon, C., Stamper, J., King, J., Lam, J., Carney, J.: Multimodal data support in knowledge objects for real-time knowledge sharing. In: Proceedings of CROSSMMLA Workshop at the 13th International Conference on Learning Analytics & Knowledge (2023)

    Google Scholar 

  8. Malmaud, J., Huang, J., Rathod, V., Johnston, N., Rabinovich, A., Murphy, K.: What’s cookin’? interpreting cooking videos using text, speech and vision (2015). arXiv preprint arXiv:1503.01558

  9. Manju, A., Valarmathie, P.: Organizing multimedia big data using semantic based video content extraction technique. In: 2015 International Conference on Soft-Computing and Networks Security (ICSNS), pp. 1–4. IEEE (2015)

    Google Scholar 

  10. Navarrete, E., Nehring, A., Schanze, S., Ewerth, R., Hoppe, A.: A closer look into recent video-based learning research: a comprehensive review of video characteristics, tools, technologies, and learning effectiveness (2023). arXiv preprint arXiv:2301.13617

  11. Routh, D., Rao, P.P., Sharma, A., Arunjeet, K.: To compare the effectiveness of traditional textbook-based learning with video-based teaching for basic laparoscopic suturing skills training-a randomized controlled trial. Medical Journal of Dr. DY Patil University (2023)

    Google Scholar 

  12. Sonnenfeld, N., Nguyen, B., Boesser, C.T., Jentsch, F.: Modern practices for flightcrew training of procedural knowledge. In: 84th International Symposium on Aviation Psychology, p. 303 (2021)

    Google Scholar 

  13. Stamper, J., Barnes, T., Croy, M.: Enhancing the automatic generation of hints with expert seeding. In: Intelligent Tutoring Systems: 10th International Conference, ITS 2010, Pittsburgh, PA, USA, June 14–18, 2010, Proceedings, Part II 10, pp. 31–40. Springer (2010)

    Google Scholar 

  14. Topsakal, O., Akinci, T.C.: Creating large language model applications utilizing langchain: a primer on developing LLM apps fast. In: International Conference on Applied Engineering and Natural Sciences, vol. 1, pp. 1050–1056 (2023)

    Google Scholar 

  15. Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022)

    Google Scholar 

  16. Zala, A., et al.: Hierarchical video-moment retrieval and step-captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23056–23065 (2023)

    Google Scholar 

  17. Zhang, X., Gao, W.: Towards LLM-based fact verification on news claims with a hierarchical step-by-step prompting method (2023). arXiv preprint arXiv:2310.00305

  18. Zhong, Y., Yu, L., Bai, Y., Li, S., Yan, X., Li, Y.: Learning procedure-aware video representation from instructional videos and their narrations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14825–14835 (2023)

    Google Scholar 

  19. Zhu, Y., et al.: Large language models for information retrieval: A survey (2023). arXiv preprint arXiv:2308.07107

Download references

Acknowledgments

This work was supported by US Navy STTR #N68335-21-C-0438.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christine Kwon .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kwon, C., King, J., Carney, J., Stamper, J. (2024). A Schema-Based Approach to the Linkage of Multimodal Learning Sources with Generative AI. In: Olney, A.M., Chounta, IA., Liu, Z., Santos, O.C., Bittencourt, I.I. (eds) Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky. AIED 2024. Communications in Computer and Information Science, vol 2151. Springer, Cham. https://doi.org/10.1007/978-3-031-64312-5_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-64312-5_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-64311-8

  • Online ISBN: 978-3-031-64312-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics