Skip to main content

Can Large Language Models Recognize and Respond to Student Misconceptions?

  • Conference paper
  • First Online:
Adaptive Instructional Systems (HCII 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14727))

Included in the following conference series:

Abstract

Expert human tutors can observe learner mistakes to understand their misconceptions and procedural errors. Highly capable, but opaque large language models have shown remarkable abilities across numerous domains, and may be useful for adaptive instruction in a variety of ways. Working with publicly available data from the National Assessment of Educational Progress, (388 questions selected from 4th, 8th and 12th grade math and science) we examined these three questions:

  1. 1)

    Do language models find the same problems difficult as students do? We found statistically significant, but small similarities in performance that differ somewhat by model.

  2. 2)

    Do language models have the same pattern of errors as students? Our findings reveal that, under the “minimal “ prompts, the models often mirror students in choosing the same incorrect answers. However, this alignment decreases when prompt models used “chain of thoughts”.

  3. 3)

    Can language models interpret and explain students’ wrong answers? We presented frequently-chosen wrong answers to NAEP items to GPT-4 and an experienced science teacher, and compared their explanations. There was a good correspondence between these explanations, with 81% being fully or partially in agreement.

Discussion focuses on how these capabilities can be used for test design and adaptive instruction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. D’Mello, S.: Expert tutors feedback is immediate, direct, and discriminating (2010)

    Google Scholar 

  2. McKendree, J.: Effective feedback content for tutoring complex skills. Hum.-Comp. Interact. 5, 381–413 (1990). https://doi.org/10.1207/s15327051hci0504_2

    Article  Google Scholar 

  3. Koedinger, K.R., Aleven, V.: Exploring the assistance dilemma in experiments with cognitive tutors. Educ. Psychol. Rev. 19, 239–264 (2007). https://doi.org/10.1007/s10648-007-9049-0

    Article  Google Scholar 

  4. Kantack, N., Cohen, N., Bos, N., Lowman, C., Everett, J., Endres, T.: Instructive artificial intelligence (AI) for human training, assistance, and explainability. In: Artificial Intelligence and Machine Learning for Multi-domain Operations Applications IV, pp. 45–54. SPIE (2022). https://doi.org/10.1117/12.2618616

  5. Bezirhan, U., Davier, M.: Automated reading passage generation with OpenAI’s large language model (2023)

    Google Scholar 

  6. Raina, V., Gales, M.: Multiple-choice question generation: towards an automated assessment framework (2022)

    Google Scholar 

  7. Wang, Z., Valdez, J., Basu Mallick, D., Baraniuk, R.G.: Towards human-like educational question generation with large language models. In: Rodrigo, M.M., Matsuda, N., Cristea, A.I., Dimitrova, V. (eds.) AIED 2022. LNCS, vol. 13355, pp. 153–166. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-11644-5_13

    Chapter  Google Scholar 

  8. von Davier, M.: Training Optimus Prime, M.D.: Generating medical certification items by fine-tuning OpenAI’s gpt2 transformer model (2019). http://arxiv.org/abs/1908.08594, https://doi.org/10.48550/arXiv.1908.08594

  9. Settles, B., LaFlair, G.T., Hagiwara, M.: Machine learning-driven language assessment. Trans. Assoc. Comput. Linguist. 8, 247–263 (2020). https://doi.org/10.1162/tacl_a_00310

    Article  Google Scholar 

  10. Hocky, G.M., White, A.D.: Natural language processing models that automate programming will transform chemistry research and teaching. Digit. Discov. 1, 79–83 (2022)

    Article  Google Scholar 

  11. Moore, S., Nguyen, H.A., Bier, N., Domadia, T., Stamper, J.: Assessing the quality of student-generated short answer questions using GPT-3. In: Hilliger, I., Muñoz-Merino, P.J., De Laet, T., Ortega-Arranz, A., Farrell, T. (eds.) EC-TEL 2022. LNCS, vol. 13450, pp. 243–257. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16290-9_18

    Chapter  Google Scholar 

  12. Walsh, J.: Lesson plan generation using natural language processing: prompting best practices with openai’s GPT-3 model (2022)

    Google Scholar 

  13. Mizumoto, A., Eguchi, M.: Exploring the potential of using an AI language model for automated essay scoring. Res. Methods Appl. Linguist. 2, 100050 (2023)

    Article  Google Scholar 

  14. Wu, X., He, X., Liu, T., Liu, N., Zhai, X.: Matching exemplar as next sentence prediction (MeNSP): zero-shot prompt learning for automatic scoring in science education. In: Wang, N., Rebolledo-Mendez, G., Matsuda, N., Santos, O.C., Dimitrova, V. (eds.) AIED 2023. LNCS, vol. 13916, pp. 401–413. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-36272-9_33

    Chapter  Google Scholar 

  15. Matelsky, J.K., Parodi, F., Liu, T., Lange, R.D., Kording, K.P.: A large language model-assisted education tool to provide feedback on open-ended responses (2023)

    Google Scholar 

  16. Peng, B., Galley, M., He, P., Cheng, H., Xie, Y., Hu, Y., Gao, J.: Check your facts and try again: Improving large language models with external knowledge and automated feedback (2023)

    Google Scholar 

  17. Rae, J.W., et al.: Scaling language models (2021)

    Google Scholar 

  18. Rudolph, J., Tan, S., Tan, S.: War of the chatbots: Bard, Bing Chat, ChatGPT, Ernie and beyond. The new AI gold rush and its impact on higher education. J. Appl. Learn. Teach. 6 (2023)

    Google Scholar 

  19. Lu, P., et al.: Mathvista: evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 (2023)

  20. White, J., et al.: A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023)

  21. Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models (2023). http://arxiv.org/abs/2201.11903, https://doi.org/10.48550/arXiv.2201.11903

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nathan D. Bos .

Editor information

Editors and Affiliations

Ethics declarations

The authors have no competing interests to declare that are relevant to the content of this article.

Appendix A. Detailed Language Model Performance by Subject and Grade

Appendix A. Detailed Language Model Performance by Subject and Grade

Accuracy of Models Answering NAEP Items, Miminal Prompting

Grade

Subject

students

llama7b

llama13b

llama70b

gpt35

gpt4

4

Math

55.2%

40.6%

54.3%

52.7%

62.0%

75.9%

4

Science

60.7%

72.0%

80.0%

86.6%

92.0%

100.0%

8

Math

53.5%

36.6%

42.1%

45.9%

55.7%

70.9%

8

Science

54.4%

60.0%

65.3%

78.6%

85.3%

100.0%

12

Math

53.2%

45.9%

45.3%

48.0%

57.3%

57.3%

12

Science

45.1%

70.2%

71.6%

80.7%

82.4%

92.1%

Accuracy of Models Answering NAEP Items, Chain of Thought Prompting

Grade

Subject

students

llama7b

llama13b

llama70b

gpt35

gpt4

Gemini pro

4

Math

55.2%

45.1%

48.4%

68.1%

85.9%

96.6%

81.0%

4

Science

60.7%

65.3%

73.3%

85.3%

96.0%

93.3%

95.8%

8

Math

53.5%

36.4%

38.8%

59.2%

82.7%

96.4%

76.8%

8

Science

54.4%

52.0%

65.3%

76.0%

84.0%

92.0%

97.3%

12

Math

53.2%

33.7%

49.7%

56.3%

82.0%

93.4%

76.4%

12

Science

45.1%

54.0%

74.6%

78.8%

86.8%

94.7%

82.9%

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Smart, F., Bos, N.D., Bos, J.T. (2024). Can Large Language Models Recognize and Respond to Student Misconceptions?. In: Sottilare, R.A., Schwarz, J. (eds) Adaptive Instructional Systems. HCII 2024. Lecture Notes in Computer Science, vol 14727. Springer, Cham. https://doi.org/10.1007/978-3-031-60609-0_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-60609-0_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-60608-3

  • Online ISBN: 978-3-031-60609-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics