Learning When to Defer to Humans for Short Answer Grading

Li, Zhaohui; Zhang, Chengning; Jin, Yumi; Cang, Xuesong; Puntambekar, Sadhana; Passonneau, Rebecca J.

doi:10.1007/978-3-031-36272-9_34

Zhaohui Li¹²,
Chengning Zhang¹²,
Yumi Jin¹²,
Xuesong Cang¹³,
Sadhana Puntambekar¹³ &
…
Rebecca J. Passonneau¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13916))

Included in the following conference series:

International Conference on Artificial Intelligence in Education

3080 Accesses

Abstract

To assess student knowledge, educators face a tradeoff between open-ended versus fixed-response questions. Open-ended questions are easier to formulate, and provide greater insight into student learning, but are burdensome. Machine learning methods that could reduce the assessment burden also have a cost, given that large datasets of reliably assessed examples (labeled data) are required for training and testing. We address the human costs of assessment and data labeling using selective prediction, where the output of a machine learned model is used when the model makes a confident decision, but otherwise the model defers to a human decision-maker. The goal is to defer less often while maintaining human assessment quality on the total output. We refer to the deferral criteria as a deferral policy, and we show it is possible to learn when to defer. We first trained an autograder on a combination of historical data and a small amount of newly labeled data, achieving moderate performance. We then used the autograder output as input to a logistic regression to learn when to defer. The learned logistic regression equation constitutes a deferral policy. Tests of the selective prediction method on a held out test set showed that human-level assessment quality can be achieved with a major reduction of human effort.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Baikadi, A., et al.: An apprenticeship model for human and AI collaborative essay grading. In: Trattner, C., Parra, D., Riche, N. (eds.) Joint Proceedings of the ACM IUI 2019 Workshops Co-located with the 24th ACM Conference on Intelligent User Interfaces (ACM IUI 2019), vol. 2327 (2019). http://ceur-ws.org/Vol-2327/IUI19WS-UIBK-2.pdf
Bondi, E., et al.: Role of human-AI interaction in selective prediction. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36 (5), pp. 5286–5294 (2022). https://doi.org/10.1609/aaai.v36i5.20465
Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 670–680. ACL, Copenhagen (2017). https://doi.org/10.18653/v1/D17-1070
De-Arteaga, M., Fogliato, R., Chouldechova, A.: A Case for humans-in-the-loop: decisions in the presence of erroneous algorithmic scores. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–12. ACM, Honolulu HI USA (2020). https://doi.org/10.1145/3313831.3376638
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the ACLm, pp. 4171–4186. ACL (2019). https://doi.org/10.18653/v1/N19-1423
Dzikovska, M., et al.: SemEval-2013 task 7: the joint student response analysis and 8th recognizing textual entailment challenge. In: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pp. 263–274. ACL (2013). https://aclanthology.org/S13-2045
Garg, S., Moschitti, A.: Will this question be answered? Question filtering via answer model distillation for efficient question answering. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7329–7346. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.emnlp-main.583, https://aclanthology.org/2021.emnlp-main.583
Ghavidel, H., Zouaq, A., Desmarais, M.: Using BERT and XLNET for the automatic short answer grading task. In: Proceedings of the 12th International Conference on Computer Supported Education, pp. 58–67. SCITEPRESS - Science and Technology Publications, Prague, Czech Republic (2020). https://doi.org/10.5220/0009422400580067
Graham, S., Kiuhara, S.A., MacKay, M.: The effects of writing on learning in science, social studies, and mathematics: a meta-analysis. Rev. Educ. Res. 90(2), 179–226 (2020). https://doi.org/10.3102/0034654320914744
Article Google Scholar
Green, B., Chen, Y.: Algorithm-in-the-loop decision making. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34(9), pp. 13663–13664 (2020). https://doi.org/10.1609/aaai.v34i09.7115
Hendrycks, D., Gimpel, K.: A baseline for detecting misclassified and out-of-distribution examples in neural networks. http://arxiv.org/abs/1610.02136, number: arXiv:1610.02136
Kamath, A., Jia, R., Liang, P.: Selective question answering under domain shift. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5684–5696. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-main.503, https://aclanthology.org/2020.acl-main.503
Li, Z., Tomar, Y., Passonneau, R.J.: A Semantic feature-wise transformation relation network for automatic short answer grading. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6030–6040. ACL (2021). https://doi.org/10.18653/v1/2021.emnlp-main.487
Liu, T., Ding, W., Wang, Z., Tang, J., Huang, G.Y., Liu, Z.: Automatic short answer grading via multiway attention networks. In: Isotani, S., Millán, E., Ogan, A., Hastings, P., McLaren, B., Luckin, R. (eds.) Artificial Intelligence in Education, pp. 169–173. Springer International Publishing, Cham (2019). https://doi.org/10.1007/978-3-030-23207-8_32
Chapter Google Scholar
Mosqueira-Rey, E., Hernández-Pereira, E., Alonso-Ríos, D., Bobes-Bascarán, J., Fernández-Leal, A.: Human-in-the-loop machine learning: a state of the art. Artif. Intell. Rev. (2022). https://doi.org/10.1007/s10462-022-10246-w
Article Google Scholar
Passonneau, R.J., Carpenter, B.: The benefits of a model of annotation. Trans. ACL 2, 311–326 (2014). https://doi.org/10.1162/tacl_a_00185
Article Google Scholar
Riordan, B., Horbach, A., Cahill, A., Zesch, T., Lee, C.M.: Investigating neural architectures for short answer scoring. In: Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 159–168. ACL, Copenhagen (2017). https://doi.org/10.18653/v1/W17-5017
Saha, S., Dhamecha, T.I., Marvaniya, S., Sindhgatta, R., Sengupta, B.: Sentence level or token level features for automatic short answer grading?: Use both. In: Penstein Rosé, C., et al. (eds.) Artificial Intelligence in Education, pp. 503–517. Springer International Publishing, Cham (2018). https://doi.org/10.1007/978-3-319-93843-1_37
Chapter Google Scholar
Sung, C., Dhamecha, T., Saha, S., Ma, T., Reddy, V., Arora, R.: Pre-training BERT on domain resources for short answer grading. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 6071–6075. ACL, Hong Kong (2019). https://doi.org/10.18653/v1/D19-1628
Varshney, N., Mishra, S., Baral, C.: Investigating selective prediction approaches across several tasks in IID, OOD, and adversarial settings. In: Findings of the Association for Computational Linguistics: ACL 2022, pp. 1995–2002. Association for Computational Linguistics, Dublin, Ireland (2022). https://doi.org/10.18653/v1/2022.findings-acl.158
Wang, T., Inoue, N., Ouchi, H., Mizumoto, T., Inui, K.: Inject rubrics into short answer grading system. In: Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), pp. 175–182. ACL, Hong Kong (2019). https://doi.org/10.18653/v1/D19-6119
Wiener, Y.: Theoretical foundations of selective prediction. Ph.D. thesis, Technion - Israel Institute of Technology, Israel (2013)
Google Scholar
Xin, J., Tang, R., Yu, Y., Lin, J.: The art of abstention: selective prediction and error regularization for natural language processing. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL) and the 11th International Joint Conference on Natural Language Processing (IJCNLP), pp. 1040–1051. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.acl-long.84

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Pennsylvania State University, State College, USA
Zhaohui Li, Chengning Zhang, Yumi Jin & Rebecca J. Passonneau
Department of Educational Psychology, University of Wisconsin-Madison, Madison, USA
Xuesong Cang & Sadhana Puntambekar

Authors

Zhaohui Li
View author publications
You can also search for this author in PubMed Google Scholar
Chengning Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yumi Jin
View author publications
You can also search for this author in PubMed Google Scholar
Xuesong Cang
View author publications
You can also search for this author in PubMed Google Scholar
Sadhana Puntambekar
View author publications
You can also search for this author in PubMed Google Scholar
Rebecca J. Passonneau
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhaohui Li .

Editor information

Editors and Affiliations

University of Southern California, Los Angeles, CA, USA
Ning Wang
University of British Columbia, Vancouver, BC, Canada
Genaro Rebolledo-Mendez
North Carolina State University, Raleigh, NC, USA
Noboru Matsuda
Despacho 3.01, UNED-Grupo de Investigación aDeNu, Madrid, Spain
Olga C. Santos
University of Leeds, Leeds, UK
Vania Dimitrova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, Z., Zhang, C., Jin, Y., Cang, X., Puntambekar, S., Passonneau, R.J. (2023). Learning When to Defer to Humans for Short Answer Grading. In: Wang, N., Rebolledo-Mendez, G., Matsuda, N., Santos, O.C., Dimitrova, V. (eds) Artificial Intelligence in Education. AIED 2023. Lecture Notes in Computer Science(), vol 13916. Springer, Cham. https://doi.org/10.1007/978-3-031-36272-9_34

Download citation

DOI: https://doi.org/10.1007/978-3-031-36272-9_34
Published: 26 June 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-36271-2
Online ISBN: 978-3-031-36272-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Learning When to Defer to Humans for Short Answer Grading