Abstract
Expert human tutors can observe learner mistakes to understand their misconceptions and procedural errors. Highly capable, but opaque large language models have shown remarkable abilities across numerous domains, and may be useful for adaptive instruction in a variety of ways. Working with publicly available data from the National Assessment of Educational Progress, (388 questions selected from 4th, 8th and 12th grade math and science) we examined these three questions:
-
1)
Do language models find the same problems difficult as students do? We found statistically significant, but small similarities in performance that differ somewhat by model.
-
2)
Do language models have the same pattern of errors as students? Our findings reveal that, under the “minimal “ prompts, the models often mirror students in choosing the same incorrect answers. However, this alignment decreases when prompt models used “chain of thoughts”.
-
3)
Can language models interpret and explain students’ wrong answers? We presented frequently-chosen wrong answers to NAEP items to GPT-4 and an experienced science teacher, and compared their explanations. There was a good correspondence between these explanations, with 81% being fully or partially in agreement.
Discussion focuses on how these capabilities can be used for test design and adaptive instruction.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
D’Mello, S.: Expert tutors feedback is immediate, direct, and discriminating (2010)
McKendree, J.: Effective feedback content for tutoring complex skills. Hum.-Comp. Interact. 5, 381–413 (1990). https://doi.org/10.1207/s15327051hci0504_2
Koedinger, K.R., Aleven, V.: Exploring the assistance dilemma in experiments with cognitive tutors. Educ. Psychol. Rev. 19, 239–264 (2007). https://doi.org/10.1007/s10648-007-9049-0
Kantack, N., Cohen, N., Bos, N., Lowman, C., Everett, J., Endres, T.: Instructive artificial intelligence (AI) for human training, assistance, and explainability. In: Artificial Intelligence and Machine Learning for Multi-domain Operations Applications IV, pp. 45–54. SPIE (2022). https://doi.org/10.1117/12.2618616
Bezirhan, U., Davier, M.: Automated reading passage generation with OpenAI’s large language model (2023)
Raina, V., Gales, M.: Multiple-choice question generation: towards an automated assessment framework (2022)
Wang, Z., Valdez, J., Basu Mallick, D., Baraniuk, R.G.: Towards human-like educational question generation with large language models. In: Rodrigo, M.M., Matsuda, N., Cristea, A.I., Dimitrova, V. (eds.) AIED 2022. LNCS, vol. 13355, pp. 153–166. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-11644-5_13
von Davier, M.: Training Optimus Prime, M.D.: Generating medical certification items by fine-tuning OpenAI’s gpt2 transformer model (2019). http://arxiv.org/abs/1908.08594, https://doi.org/10.48550/arXiv.1908.08594
Settles, B., LaFlair, G.T., Hagiwara, M.: Machine learning-driven language assessment. Trans. Assoc. Comput. Linguist. 8, 247–263 (2020). https://doi.org/10.1162/tacl_a_00310
Hocky, G.M., White, A.D.: Natural language processing models that automate programming will transform chemistry research and teaching. Digit. Discov. 1, 79–83 (2022)
Moore, S., Nguyen, H.A., Bier, N., Domadia, T., Stamper, J.: Assessing the quality of student-generated short answer questions using GPT-3. In: Hilliger, I., Muñoz-Merino, P.J., De Laet, T., Ortega-Arranz, A., Farrell, T. (eds.) EC-TEL 2022. LNCS, vol. 13450, pp. 243–257. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16290-9_18
Walsh, J.: Lesson plan generation using natural language processing: prompting best practices with openai’s GPT-3 model (2022)
Mizumoto, A., Eguchi, M.: Exploring the potential of using an AI language model for automated essay scoring. Res. Methods Appl. Linguist. 2, 100050 (2023)
Wu, X., He, X., Liu, T., Liu, N., Zhai, X.: Matching exemplar as next sentence prediction (MeNSP): zero-shot prompt learning for automatic scoring in science education. In: Wang, N., Rebolledo-Mendez, G., Matsuda, N., Santos, O.C., Dimitrova, V. (eds.) AIED 2023. LNCS, vol. 13916, pp. 401–413. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-36272-9_33
Matelsky, J.K., Parodi, F., Liu, T., Lange, R.D., Kording, K.P.: A large language model-assisted education tool to provide feedback on open-ended responses (2023)
Peng, B., Galley, M., He, P., Cheng, H., Xie, Y., Hu, Y., Gao, J.: Check your facts and try again: Improving large language models with external knowledge and automated feedback (2023)
Rae, J.W., et al.: Scaling language models (2021)
Rudolph, J., Tan, S., Tan, S.: War of the chatbots: Bard, Bing Chat, ChatGPT, Ernie and beyond. The new AI gold rush and its impact on higher education. J. Appl. Learn. Teach. 6 (2023)
Lu, P., et al.: Mathvista: evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 (2023)
White, J., et al.: A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023)
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models (2023). http://arxiv.org/abs/2201.11903, https://doi.org/10.48550/arXiv.2201.11903
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
The authors have no competing interests to declare that are relevant to the content of this article.
Appendix A. Detailed Language Model Performance by Subject and Grade
Appendix A. Detailed Language Model Performance by Subject and Grade
Accuracy of Models Answering NAEP Items, Miminal Prompting
Grade | Subject | students | llama7b | llama13b | llama70b | gpt35 | gpt4 |
---|---|---|---|---|---|---|---|
4 | Math | 55.2% | 40.6% | 54.3% | 52.7% | 62.0% | 75.9% |
4 | Science | 60.7% | 72.0% | 80.0% | 86.6% | 92.0% | 100.0% |
8 | Math | 53.5% | 36.6% | 42.1% | 45.9% | 55.7% | 70.9% |
8 | Science | 54.4% | 60.0% | 65.3% | 78.6% | 85.3% | 100.0% |
12 | Math | 53.2% | 45.9% | 45.3% | 48.0% | 57.3% | 57.3% |
12 | Science | 45.1% | 70.2% | 71.6% | 80.7% | 82.4% | 92.1% |
Accuracy of Models Answering NAEP Items, Chain of Thought Prompting
Grade | Subject | students | llama7b | llama13b | llama70b | gpt35 | gpt4 | Gemini pro |
---|---|---|---|---|---|---|---|---|
4 | Math | 55.2% | 45.1% | 48.4% | 68.1% | 85.9% | 96.6% | 81.0% |
4 | Science | 60.7% | 65.3% | 73.3% | 85.3% | 96.0% | 93.3% | 95.8% |
8 | Math | 53.5% | 36.4% | 38.8% | 59.2% | 82.7% | 96.4% | 76.8% |
8 | Science | 54.4% | 52.0% | 65.3% | 76.0% | 84.0% | 92.0% | 97.3% |
12 | Math | 53.2% | 33.7% | 49.7% | 56.3% | 82.0% | 93.4% | 76.4% |
12 | Science | 45.1% | 54.0% | 74.6% | 78.8% | 86.8% | 94.7% | 82.9% |
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Smart, F., Bos, N.D., Bos, J.T. (2024). Can Large Language Models Recognize and Respond to Student Misconceptions?. In: Sottilare, R.A., Schwarz, J. (eds) Adaptive Instructional Systems. HCII 2024. Lecture Notes in Computer Science, vol 14727. Springer, Cham. https://doi.org/10.1007/978-3-031-60609-0_21
Download citation
DOI: https://doi.org/10.1007/978-3-031-60609-0_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-60608-3
Online ISBN: 978-3-031-60609-0
eBook Packages: Computer ScienceComputer Science (R0)