Skip to main content
Log in

Modeling essay grading with pre-trained BERT features

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Writing essays is an important skill which enables one to clearly write the ideas and understanding of certain topic with the help of language articulation and examples. Writing essay is a skill so is the grading of those essays. It requires a lot of efforts to grade these essays and the task becomes tedious and repetitive when the student to teacher ratio is high. As with any other repetitive task, the intervention of technology for automated essay grading has been thought of long back. However, the main challenge in automated essay grading lies in the understanding of language construction, word usage and presentation of idea/ argument/ narration. Language complexity makes natural language understanding a challenging task. In this work, we show our experiments with pre-trained static word embeddings like GloVe, fastText and pre-trained contextual model Bidirectional Encoder Representations from Transformers (BERT) for the task of automated essay grading. For the regression task, we have used Long Short-Term Memory (LSTM) and Support Vector Regression (SVR) models under various feature settings framed from the learnt embeddings. The results are shown with the ASAP-AES dataset on all 8 prompts. Our work shows average Quadratic Weighted Kappa (QWK) of 0.81 and 0.71 with SVR and LSTM on in-domain test set essays, respectively. The SVR model shows a better QWK than the human-human agreement of 0.75. To the best of our knowledge, our SVR model with pre-trained BERT embeddings achieve the highest average QWK reported on ASAP-AES data set. We further show the performance of our approach with adversary samples generated using permuted essays and off-topic essays. We experimentally show that our LSTM model though does not show high QWK score with human assigned grade but is robust against the adversarial settings considered.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. https://www.kaggle.com/c/asap-aes

  2. https://nlp.stanford.edu/projects/glove/

  3. https://fasttext.cc/docs/en/english-vectors.html

  4. https://www.nltk.org/

  5. https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1

References

  1. Chen H, He B (2013) Automated essay scoring by maximizing human-machine agreement. In:Proceedings Of The 2013 conference on empirical methods in natural language processing, pp 1741–1752

  2. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A, Kaiser L, Polosukhin I (2017) Attention is All You Need. arXiv:1706.03762

  3. Pennington J, Socher R, Manning C (2014) Glove: Global vectors for word representation. In:Proceedings Of The 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532-1543

  4. Bojanowski P, Grave E, Joulin A, Mikolov T (2016) Enriching word vectors with subword information. arXiv:1607.04606

  5. Sharma A, Jayagopi D (2018) Automated grading of handwritten essays. In:2018 16th International conference on frontiers in handwriting recognition (ICFHR). pp 279–284

  6. Sharma A, Jayagopi D (2018) Handwritten essay grading on mobiles using MDLSTM model and word embeddings. In:11th Indian conference on computer vision, graphics and image processing (ICVGIP)

  7. Devlin J, Chang M, Lee K, Toutanova K (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805

  8. Zhang H, Litman D (2018) Co-attention based neural network for source-dependent essay scoring. In:Proceedings Of the thirteenth workshop on innovative use of NLP For building educational applications, pp 399-409 (2018,6). https://www.aclweb.org/anthology/W18-0549

  9. Liu J, Xu Y, Zhao L (2019) Automated essay scoring based on two-stage learning. arXiv:1901.07744

  10. Jin C, He B, Hui K, Sun L (2018) TDNN: a two-stage deep neural network for prompt-independent automated essay scoring. In:Proceedings Of The 56th annual meeting of the association for computational linguistics (vol 1: Long Papers), pp 1088–1097 (2018,7). https://www.aclweb.org/anthology/P18-1100

  11. Horbach A, Scholten-Akoun D, Ding Y, Zesch T (2017) Fine-grained essay scoring of a complex writing task for native speakers. In:Proceedings Of The 12th workshop on innovative use of NLP for building educational applications, pp 357–366 (2017,9). https://www.aclweb.org/anthology/W17-5040

  12. Tay Y, Phan M, Tuan L, Hui S (2018) SkipFlow: Incorporating neural coherence features for end-to-end automatic text scoring. In:Thirty-second AAAI conference on artificial intelligence

  13. Cozma M, Butnaru A, Ionescu R (2018) Automated essay scoring with string kernels and word embeddings. In:Proceedings Of The 56th annual meeting of the association for computational linguistics (vol 2: Short Papers), pp 503–509 (2018,7). https://www.aclweb.org/anthology/P18-2080

  14. Zesch T, Wojatzki M, Scholten-Akoun D (2015) Task-independent features for automated essay grading. In:Proceedings of the tenth workshop on innovative use of NLP for building educational applications, pp 224–232

  15. Mahana M, Johns M, Apte A (2012) Automated essay grading using machine learning. Session, Stanford University, Mach. Learn

  16. Mikolov T, Grave E, Bojanowski, P, Puhrsch C, Joulin A (2018) Advances in pre-training distributed word representations. In:Proceedings Of the international conference on language resources and evaluation (LREC 2018)

  17. Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. Advances In Neural Information Processing Systems. pp 3111–3119

  18. Kingma D, Ba J (2015) Adam: A Method for Stochastic Optimization. In:3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, conference track proceedings. arXiv:1412.6980

  19. Hashimoto K, Xiong C, Tsuruoka Y, Socher R (2016) A joint many-task model: Growing a neural network for multiple nlp tasks. arXiv:1611.01587

  20. Conneau A, Lample G, Ranzato M, Denoyer L, Jégou H (2017) Word translation without parallel data. arXiv:1710.04087

  21. Yu L, Wang J, Lai K, Zhang X (2017) Refining word embeddings for sentiment analysis. In:Proceedings Of The 2017 conference on empirical methods in natural language processing. pp 534–539 (2017,9). https://www.aclweb.org/anthology/D17-1056

  22. Lample G, Ott M, Conneau A, Denoyer L, Ranzato M (2018) Phrase-based and neural unsupervised machine translation. arXiv:1804.07755

  23. Ethayarajh K (2019) How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings. In: Proceedings Of The 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp 55-65

  24. Schuster M, Nakajima K (2012) Japanese and korean voice search. In:2012 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 5149-5152

  25. Ben-Simon A, Bennett R (2007) Toward more substantively meaningful automated essay scoring. The Journal Of Technology, Learning And Assessment 6

  26. Attali Y, Burstein J (2006) Automated essay scoring with e-rater® V. 2. The Journal Of Technology, Learning And Assessment 4

  27. Darling-Hammond L, Herman J, Pellegrino J, Abedi J, Aber J, Baker E, Bennett R, Gordon E, Haertel E, Hakuta K (2013) & Others Criteria for High-Quality Assessment. Stanford Center For Opportunity Policy In Education, Stanford, CA

  28. Foltz P, Laham D, Landauer T (1999) The intelligent essay assessor: Applications to educational technology. Interactive Multimedia Electronic Journal Of Computer-Enhanced Learning 1:939–944

    Google Scholar 

  29. Malouff J, Thorsteinsson E (2016) Bias in grading: A meta-analysis of experimental research findings. Aust J Educ 60:245–256

    Article  Google Scholar 

  30. Norton L (1990) Essay-writing: what really counts. High Educ 20:411–442

  31. Guan J, Yang Z, Zhang R, Hu Z, Huang M (2023) Generating coherent narratives by learning dynamic and discrete entity states with a contrastive framework. In: Proceedings of the AAAI conference on artificial intelligence. pp 12836–12844

  32. Goyal T, Li JJ, Durrett G (2022) SNaC: coherence error detection for narrative summarization. In: Proceedings of the 2022 conference on empirical methods in natural language processing. pp 444–463

  33. Sharma A, Katlaa R, Kaur G, Jayagopi DB (2023) Full-page hand-writing recognition and automated essay scoring for in-the-wild essays. Multimedia Tools and Applications, pp 1–24

  34. McCarthy KS, Roscoe RD, Allen LK, Likens AD, McNamara DS (2022) Automated writing evaluation: Does spelling and grammar feedback support high-quality writing and revision. Assessing Writing 52:100608

    Article  Google Scholar 

  35. Mizumoto A, Eguchi M (2023) Exploring the potential of using an AI language model for automated essay scoring. Res Methods Appl Linguist 2(2):100050

  36. Naismith B, Mulcaire P, Burstein J (2023) Automated evaluation of written discourse coherence using GPT-4. In: Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pp 394–403

  37. Ariely M, Nazaretsky T, Alexandron G (2023) Machine learning and Hebrew NLP for automated assessment of open-ended questions in biology. Int J Artif Intell Educ 33(1):1–34

    Article  Google Scholar 

  38. Chuang PL, Yan X (2022) An investigation of the relationship between argument structure and essay quality in assessed writing. J Second Lang Writ 56:100892

    Article  Google Scholar 

Download references

Funding

Author Annapurna Sharma is supported by Visvesvaraya PhD Scheme, Ministry of Electronics and Information Technology (MeitY), Government of India under the grant number– MEITY-PHD-2541.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Annapurna Sharma.

Ethics declarations

Competing of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sharma, A., Jayagopi, D.B. Modeling essay grading with pre-trained BERT features. Appl Intell 54, 4979–4993 (2024). https://doi.org/10.1007/s10489-024-05410-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-024-05410-4

Keywords

Navigation