Can Large Language Models Recognize and Respond to Student Misconceptions?

Smart, Francis; Bos, Nathan D.; Bos, Jaelyn T.

doi:10.1007/978-3-031-60609-0_21

Francis Smart²⁶,
Nathan D. Bos²⁷ &
Jaelyn T. Bos²⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14727))

Included in the following conference series:

International Conference on Human-Computer Interaction

647 Accesses
1 Altmetric

Abstract

Expert human tutors can observe learner mistakes to understand their misconceptions and procedural errors. Highly capable, but opaque large language models have shown remarkable abilities across numerous domains, and may be useful for adaptive instruction in a variety of ways. Working with publicly available data from the National Assessment of Educational Progress, (388 questions selected from 4th, 8th and 12th grade math and science) we examined these three questions:

1)
Do language models find the same problems difficult as students do? We found statistically significant, but small similarities in performance that differ somewhat by model.
2)
Do language models have the same pattern of errors as students? Our findings reveal that, under the “minimal “ prompts, the models often mirror students in choosing the same incorrect answers. However, this alignment decreases when prompt models used “chain of thoughts”.
3)
Can language models interpret and explain students’ wrong answers? We presented frequently-chosen wrong answers to NAEP items to GPT-4 and an experienced science teacher, and compared their explanations. There was a good correspondence between these explanations, with 81% being fully or partially in agreement.

Discussion focuses on how these capabilities can be used for test design and adaptive instruction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Can Large Language Models Generate Middle School Mathematics Explanations Better Than Human Teachers?

Larger and more instructable language models become less reliable

Article Open access 25 September 2024

Gold-Style Learning Theory

References

D’Mello, S.: Expert tutors feedback is immediate, direct, and discriminating (2010)
Google Scholar
McKendree, J.: Effective feedback content for tutoring complex skills. Hum.-Comp. Interact. 5, 381–413 (1990). https://doi.org/10.1207/s15327051hci0504_2
Article Google Scholar
Koedinger, K.R., Aleven, V.: Exploring the assistance dilemma in experiments with cognitive tutors. Educ. Psychol. Rev. 19, 239–264 (2007). https://doi.org/10.1007/s10648-007-9049-0
Article Google Scholar
Kantack, N., Cohen, N., Bos, N., Lowman, C., Everett, J., Endres, T.: Instructive artificial intelligence (AI) for human training, assistance, and explainability. In: Artificial Intelligence and Machine Learning for Multi-domain Operations Applications IV, pp. 45–54. SPIE (2022). https://doi.org/10.1117/12.2618616
Bezirhan, U., Davier, M.: Automated reading passage generation with OpenAI’s large language model (2023)
Google Scholar
Raina, V., Gales, M.: Multiple-choice question generation: towards an automated assessment framework (2022)
Google Scholar
Wang, Z., Valdez, J., Basu Mallick, D., Baraniuk, R.G.: Towards human-like educational question generation with large language models. In: Rodrigo, M.M., Matsuda, N., Cristea, A.I., Dimitrova, V. (eds.) AIED 2022. LNCS, vol. 13355, pp. 153–166. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-11644-5_13
Chapter Google Scholar
von Davier, M.: Training Optimus Prime, M.D.: Generating medical certification items by fine-tuning OpenAI’s gpt2 transformer model (2019). http://arxiv.org/abs/1908.08594, https://doi.org/10.48550/arXiv.1908.08594
Settles, B., LaFlair, G.T., Hagiwara, M.: Machine learning-driven language assessment. Trans. Assoc. Comput. Linguist. 8, 247–263 (2020). https://doi.org/10.1162/tacl_a_00310
Article Google Scholar
Hocky, G.M., White, A.D.: Natural language processing models that automate programming will transform chemistry research and teaching. Digit. Discov. 1, 79–83 (2022)
Article Google Scholar
Moore, S., Nguyen, H.A., Bier, N., Domadia, T., Stamper, J.: Assessing the quality of student-generated short answer questions using GPT-3. In: Hilliger, I., Muñoz-Merino, P.J., De Laet, T., Ortega-Arranz, A., Farrell, T. (eds.) EC-TEL 2022. LNCS, vol. 13450, pp. 243–257. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16290-9_18
Chapter Google Scholar
Walsh, J.: Lesson plan generation using natural language processing: prompting best practices with openai’s GPT-3 model (2022)
Google Scholar
Mizumoto, A., Eguchi, M.: Exploring the potential of using an AI language model for automated essay scoring. Res. Methods Appl. Linguist. 2, 100050 (2023)
Article Google Scholar
Wu, X., He, X., Liu, T., Liu, N., Zhai, X.: Matching exemplar as next sentence prediction (MeNSP): zero-shot prompt learning for automatic scoring in science education. In: Wang, N., Rebolledo-Mendez, G., Matsuda, N., Santos, O.C., Dimitrova, V. (eds.) AIED 2023. LNCS, vol. 13916, pp. 401–413. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-36272-9_33
Chapter Google Scholar
Matelsky, J.K., Parodi, F., Liu, T., Lange, R.D., Kording, K.P.: A large language model-assisted education tool to provide feedback on open-ended responses (2023)
Google Scholar
Peng, B., Galley, M., He, P., Cheng, H., Xie, Y., Hu, Y., Gao, J.: Check your facts and try again: Improving large language models with external knowledge and automated feedback (2023)
Google Scholar
Rae, J.W., et al.: Scaling language models (2021)
Google Scholar
Rudolph, J., Tan, S., Tan, S.: War of the chatbots: Bard, Bing Chat, ChatGPT, Ernie and beyond. The new AI gold rush and its impact on higher education. J. Appl. Learn. Teach. 6 (2023)
Google Scholar
Lu, P., et al.: Mathvista: evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 (2023)
White, J., et al.: A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023)
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models (2023). http://arxiv.org/abs/2201.11903, https://doi.org/10.48550/arXiv.2201.11903

Download references

Author information

Authors and Affiliations

Michigan State University, East Lansing, USA
Francis Smart
Johns Hopkins University, Baltimore, USA
Nathan D. Bos
University of California, Santa Cruz, Santa Cruz, USA
Jaelyn T. Bos

Authors

Francis Smart
View author publications
You can also search for this author in PubMed Google Scholar
Nathan D. Bos
View author publications
You can also search for this author in PubMed Google Scholar
Jaelyn T. Bos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nathan D. Bos .

Editor information

Editors and Affiliations

Soar Technology, Inc., Orlando, FL, USA
Robert A. Sottilare
Fraunhofer FKIE, Wachtberg, Germany
Jessica Schwarz

Ethics declarations

The authors have no competing interests to declare that are relevant to the content of this article.

Appendix A. Detailed Language Model Performance by Subject and Grade

Accuracy of Models Answering NAEP Items, Miminal Prompting

Grade	Subject	students	llama7b	llama13b	llama70b	gpt35	gpt4
4	Math	55.2%	40.6%	54.3%	52.7%	62.0%	75.9%
4	Science	60.7%	72.0%	80.0%	86.6%	92.0%	100.0%
8	Math	53.5%	36.6%	42.1%	45.9%	55.7%	70.9%
8	Science	54.4%	60.0%	65.3%	78.6%	85.3%	100.0%
12	Math	53.2%	45.9%	45.3%	48.0%	57.3%	57.3%
12	Science	45.1%	70.2%	71.6%	80.7%	82.4%	92.1%

Accuracy of Models Answering NAEP Items, Chain of Thought Prompting

Grade	Subject	students	llama7b	llama13b	llama70b	gpt35	gpt4	Gemini pro
4	Math	55.2%	45.1%	48.4%	68.1%	85.9%	96.6%	81.0%
4	Science	60.7%	65.3%	73.3%	85.3%	96.0%	93.3%	95.8%
8	Math	53.5%	36.4%	38.8%	59.2%	82.7%	96.4%	76.8%
8	Science	54.4%	52.0%	65.3%	76.0%	84.0%	92.0%	97.3%
12	Math	53.2%	33.7%	49.7%	56.3%	82.0%	93.4%	76.4%
12	Science	45.1%	54.0%	74.6%	78.8%	86.8%	94.7%	82.9%

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Smart, F., Bos, N.D., Bos, J.T. (2024). Can Large Language Models Recognize and Respond to Student Misconceptions?. In: Sottilare, R.A., Schwarz, J. (eds) Adaptive Instructional Systems. HCII 2024. Lecture Notes in Computer Science, vol 14727. Springer, Cham. https://doi.org/10.1007/978-3-031-60609-0_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-60609-0_21
Published: 01 June 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-60608-3
Online ISBN: 978-3-031-60609-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Can Large Language Models Recognize and Respond to Student Misconceptions?

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Can Large Language Models Generate Middle School Mathematics Explanations Better Than Human Teachers?

Larger and more instructable language models become less reliable

Gold-Style Learning Theory

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Ethics declarations

Appendix A. Detailed Language Model Performance by Subject and Grade

Appendix A. Detailed Language Model Performance by Subject and Grade

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us