A Schema-Based Approach to the Linkage of Multimodal Learning Sources with Generative AI

Kwon, Christine; King, James; Carney, John; Stamper, John

doi:10.1007/978-3-031-64312-5_1

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2151))

Included in the following conference series:

International Conference on Artificial Intelligence in Education

1107 Accesses

Abstract

Learning how to execute a complex, hands-on task in a domain such as auto maintenance, cooking, or guitar playing while relying exclusively on text instruction from a manual is often frustrating and ineffective. Despite the need for multimedia instruction to enable the learning of complex, manual tasks, learners often rely exclusively on text instruction. However, through widespread usage of user-generated content platforms, such as YouTube and TikTok, learners are no longer limited to standard text and are able to watch videos from easily accessible platforms to learn such procedural tasks. As YouTube consists of a large corpus of diverse instructional videos, the accuracy of videos on sensitive and complex tasks has yet to be validated in comparison to “golden standard” manuals. Our work provides a unique LLM-based multimodal pipeline to interpret and verify task-related key steps in a video within organized knowledge schemas, in which demonstrated video steps are automatically extracted, systematized, and validated in comparison to a text manual of official steps. Applied to a dataset of twenty-four videos on the task of flat tire replacement on a car, the LLM-based pipeline achieved high performance on our metrics, identifying an average of 98% of key task steps, with 86% precision and 92% recall across all videos.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

A survey on knowledge-enhanced multimodal learning

Article Open access 09 September 2024

Learning Video Context as Interleaved Multimodal Sequences

References

Alayrac, J.B., Bojanowski, P., Agrawal, N., Sivic, J., Laptev, I., Lacoste-Julien, S.: Unsupervised learning from narrated instruction videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4575–4583 (2016)
Google Scholar
Ampel, B.M., Yang, C.H., Hu, J., Chen, H.: Large language models for conducting advanced text analytics information systems research (2023). arXiv preprint arXiv:2312.17278
Buch, S.V., Treschow, F.P., Svendsen, J.B., Worm, B.S.: Video-or text-based e-learning when teaching clinical procedures? A randomized controlled trial. Adv. Med. Educ. Pract. 257–262 (2014)
Google Scholar
Chase, H.: LangChain. https://langchain.com/. Accessed on 1 Aug 2023
Dennen, V.P., Burner, K.J.: The cognitive apprenticeship model in educational practice. In: Handbook of research on educational communications and technology, pp. 425–439. Routledge (2008)
Google Scholar
Goel, A., et al.: LLMS accelerate annotation for medical information extraction. In: Machine Learning for Health (ML4H), pp. 82–100. PMLR (2023)
Google Scholar
Kwon, C., Stamper, J., King, J., Lam, J., Carney, J.: Multimodal data support in knowledge objects for real-time knowledge sharing. In: Proceedings of CROSSMMLA Workshop at the 13th International Conference on Learning Analytics & Knowledge (2023)
Google Scholar
Malmaud, J., Huang, J., Rathod, V., Johnston, N., Rabinovich, A., Murphy, K.: What’s cookin’? interpreting cooking videos using text, speech and vision (2015). arXiv preprint arXiv:1503.01558
Manju, A., Valarmathie, P.: Organizing multimedia big data using semantic based video content extraction technique. In: 2015 International Conference on Soft-Computing and Networks Security (ICSNS), pp. 1–4. IEEE (2015)
Google Scholar
Navarrete, E., Nehring, A., Schanze, S., Ewerth, R., Hoppe, A.: A closer look into recent video-based learning research: a comprehensive review of video characteristics, tools, technologies, and learning effectiveness (2023). arXiv preprint arXiv:2301.13617
Routh, D., Rao, P.P., Sharma, A., Arunjeet, K.: To compare the effectiveness of traditional textbook-based learning with video-based teaching for basic laparoscopic suturing skills training-a randomized controlled trial. Medical Journal of Dr. DY Patil University (2023)
Google Scholar
Sonnenfeld, N., Nguyen, B., Boesser, C.T., Jentsch, F.: Modern practices for flightcrew training of procedural knowledge. In: 84th International Symposium on Aviation Psychology, p. 303 (2021)
Google Scholar
Stamper, J., Barnes, T., Croy, M.: Enhancing the automatic generation of hints with expert seeding. In: Intelligent Tutoring Systems: 10th International Conference, ITS 2010, Pittsburgh, PA, USA, June 14–18, 2010, Proceedings, Part II 10, pp. 31–40. Springer (2010)
Google Scholar
Topsakal, O., Akinci, T.C.: Creating large language model applications utilizing langchain: a primer on developing LLM apps fast. In: International Conference on Applied Engineering and Natural Sciences, vol. 1, pp. 1050–1056 (2023)
Google Scholar
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022)
Google Scholar
Zala, A., et al.: Hierarchical video-moment retrieval and step-captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23056–23065 (2023)
Google Scholar
Zhang, X., Gao, W.: Towards LLM-based fact verification on news claims with a hierarchical step-by-step prompting method (2023). arXiv preprint arXiv:2310.00305
Zhong, Y., Yu, L., Bai, Y., Li, S., Yan, X., Li, Y.: Learning procedure-aware video representation from instructional videos and their narrations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14825–14835 (2023)
Google Scholar
Zhu, Y., et al.: Large language models for information retrieval: A survey (2023). arXiv preprint arXiv:2308.07107

Download references

Acknowledgments

This work was supported by US Navy STTR #N68335-21-C-0438.

Author information

Authors and Affiliations

Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Christine Kwon & John Stamper
MARi LLC, Alexandria, VA, 22314, USA
James King & John Carney

Authors

Christine Kwon
View author publications
You can also search for this author in PubMed Google Scholar
James King
View author publications
You can also search for this author in PubMed Google Scholar
John Carney
View author publications
You can also search for this author in PubMed Google Scholar
John Stamper
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christine Kwon .

Editor information

Editors and Affiliations

University of Memphis, Memphis, TN, USA
Andrew M. Olney
University of Duisburg-Essen, Duisburg, Germany
Irene-Angelica Chounta
Jinan University, Guangzhou, China
Zitao Liu
UNED, Madrid, Spain
Olga C. Santos
Universidade Federal de Alagoas, Maceio, Brazil
Ig Ibert Bittencourt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kwon, C., King, J., Carney, J., Stamper, J. (2024). A Schema-Based Approach to the Linkage of Multimodal Learning Sources with Generative AI. In: Olney, A.M., Chounta, IA., Liu, Z., Santos, O.C., Bittencourt, I.I. (eds) Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky. AIED 2024. Communications in Computer and Information Science, vol 2151. Springer, Cham. https://doi.org/10.1007/978-3-031-64312-5_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-64312-5_1
Published: 02 July 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-64311-8
Online ISBN: 978-3-031-64312-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Schema-Based Approach to the Linkage of Multimodal Learning Sources with Generative AI

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

A survey on knowledge-enhanced multimodal learning

Learning Video Context as Interleaved Multimodal Sequences

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Schema-Based Approach to the Linkage of Multimodal Learning Sources with Generative AI

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

A survey on knowledge-enhanced multimodal learning

Learning Video Context as Interleaved Multimodal Sequences

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation