Abstract
Writing essays is an important skill which enables one to clearly write the ideas and understanding of certain topic with the help of language articulation and examples. Writing essay is a skill so is the grading of those essays. It requires a lot of efforts to grade these essays and the task becomes tedious and repetitive when the student to teacher ratio is high. As with any other repetitive task, the intervention of technology for automated essay grading has been thought of long back. However, the main challenge in automated essay grading lies in the understanding of language construction, word usage and presentation of idea/ argument/ narration. Language complexity makes natural language understanding a challenging task. In this work, we show our experiments with pre-trained static word embeddings like GloVe, fastText and pre-trained contextual model Bidirectional Encoder Representations from Transformers (BERT) for the task of automated essay grading. For the regression task, we have used Long Short-Term Memory (LSTM) and Support Vector Regression (SVR) models under various feature settings framed from the learnt embeddings. The results are shown with the ASAP-AES dataset on all 8 prompts. Our work shows average Quadratic Weighted Kappa (QWK) of 0.81 and 0.71 with SVR and LSTM on in-domain test set essays, respectively. The SVR model shows a better QWK than the human-human agreement of 0.75. To the best of our knowledge, our SVR model with pre-trained BERT embeddings achieve the highest average QWK reported on ASAP-AES data set. We further show the performance of our approach with adversary samples generated using permuted essays and off-topic essays. We experimentally show that our LSTM model though does not show high QWK score with human assigned grade but is robust against the adversarial settings considered.



Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Chen H, He B (2013) Automated essay scoring by maximizing human-machine agreement. In:Proceedings Of The 2013 conference on empirical methods in natural language processing, pp 1741–1752
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A, Kaiser L, Polosukhin I (2017) Attention is All You Need. arXiv:1706.03762
Pennington J, Socher R, Manning C (2014) Glove: Global vectors for word representation. In:Proceedings Of The 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532-1543
Bojanowski P, Grave E, Joulin A, Mikolov T (2016) Enriching word vectors with subword information. arXiv:1607.04606
Sharma A, Jayagopi D (2018) Automated grading of handwritten essays. In:2018 16th International conference on frontiers in handwriting recognition (ICFHR). pp 279–284
Sharma A, Jayagopi D (2018) Handwritten essay grading on mobiles using MDLSTM model and word embeddings. In:11th Indian conference on computer vision, graphics and image processing (ICVGIP)
Devlin J, Chang M, Lee K, Toutanova K (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Zhang H, Litman D (2018) Co-attention based neural network for source-dependent essay scoring. In:Proceedings Of the thirteenth workshop on innovative use of NLP For building educational applications, pp 399-409 (2018,6). https://www.aclweb.org/anthology/W18-0549
Liu J, Xu Y, Zhao L (2019) Automated essay scoring based on two-stage learning. arXiv:1901.07744
Jin C, He B, Hui K, Sun L (2018) TDNN: a two-stage deep neural network for prompt-independent automated essay scoring. In:Proceedings Of The 56th annual meeting of the association for computational linguistics (vol 1: Long Papers), pp 1088–1097 (2018,7). https://www.aclweb.org/anthology/P18-1100
Horbach A, Scholten-Akoun D, Ding Y, Zesch T (2017) Fine-grained essay scoring of a complex writing task for native speakers. In:Proceedings Of The 12th workshop on innovative use of NLP for building educational applications, pp 357–366 (2017,9). https://www.aclweb.org/anthology/W17-5040
Tay Y, Phan M, Tuan L, Hui S (2018) SkipFlow: Incorporating neural coherence features for end-to-end automatic text scoring. In:Thirty-second AAAI conference on artificial intelligence
Cozma M, Butnaru A, Ionescu R (2018) Automated essay scoring with string kernels and word embeddings. In:Proceedings Of The 56th annual meeting of the association for computational linguistics (vol 2: Short Papers), pp 503–509 (2018,7). https://www.aclweb.org/anthology/P18-2080
Zesch T, Wojatzki M, Scholten-Akoun D (2015) Task-independent features for automated essay grading. In:Proceedings of the tenth workshop on innovative use of NLP for building educational applications, pp 224–232
Mahana M, Johns M, Apte A (2012) Automated essay grading using machine learning. Session, Stanford University, Mach. Learn
Mikolov T, Grave E, Bojanowski, P, Puhrsch C, Joulin A (2018) Advances in pre-training distributed word representations. In:Proceedings Of the international conference on language resources and evaluation (LREC 2018)
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. Advances In Neural Information Processing Systems. pp 3111–3119
Kingma D, Ba J (2015) Adam: A Method for Stochastic Optimization. In:3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, conference track proceedings. arXiv:1412.6980
Hashimoto K, Xiong C, Tsuruoka Y, Socher R (2016) A joint many-task model: Growing a neural network for multiple nlp tasks. arXiv:1611.01587
Conneau A, Lample G, Ranzato M, Denoyer L, Jégou H (2017) Word translation without parallel data. arXiv:1710.04087
Yu L, Wang J, Lai K, Zhang X (2017) Refining word embeddings for sentiment analysis. In:Proceedings Of The 2017 conference on empirical methods in natural language processing. pp 534–539 (2017,9). https://www.aclweb.org/anthology/D17-1056
Lample G, Ott M, Conneau A, Denoyer L, Ranzato M (2018) Phrase-based and neural unsupervised machine translation. arXiv:1804.07755
Ethayarajh K (2019) How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings. In: Proceedings Of The 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp 55-65
Schuster M, Nakajima K (2012) Japanese and korean voice search. In:2012 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 5149-5152
Ben-Simon A, Bennett R (2007) Toward more substantively meaningful automated essay scoring. The Journal Of Technology, Learning And Assessment 6
Attali Y, Burstein J (2006) Automated essay scoring with e-rater® V. 2. The Journal Of Technology, Learning And Assessment 4
Darling-Hammond L, Herman J, Pellegrino J, Abedi J, Aber J, Baker E, Bennett R, Gordon E, Haertel E, Hakuta K (2013) & Others Criteria for High-Quality Assessment. Stanford Center For Opportunity Policy In Education, Stanford, CA
Foltz P, Laham D, Landauer T (1999) The intelligent essay assessor: Applications to educational technology. Interactive Multimedia Electronic Journal Of Computer-Enhanced Learning 1:939–944
Malouff J, Thorsteinsson E (2016) Bias in grading: A meta-analysis of experimental research findings. Aust J Educ 60:245–256
Norton L (1990) Essay-writing: what really counts. High Educ 20:411–442
Guan J, Yang Z, Zhang R, Hu Z, Huang M (2023) Generating coherent narratives by learning dynamic and discrete entity states with a contrastive framework. In: Proceedings of the AAAI conference on artificial intelligence. pp 12836–12844
Goyal T, Li JJ, Durrett G (2022) SNaC: coherence error detection for narrative summarization. In: Proceedings of the 2022 conference on empirical methods in natural language processing. pp 444–463
Sharma A, Katlaa R, Kaur G, Jayagopi DB (2023) Full-page hand-writing recognition and automated essay scoring for in-the-wild essays. Multimedia Tools and Applications, pp 1–24
McCarthy KS, Roscoe RD, Allen LK, Likens AD, McNamara DS (2022) Automated writing evaluation: Does spelling and grammar feedback support high-quality writing and revision. Assessing Writing 52:100608
Mizumoto A, Eguchi M (2023) Exploring the potential of using an AI language model for automated essay scoring. Res Methods Appl Linguist 2(2):100050
Naismith B, Mulcaire P, Burstein J (2023) Automated evaluation of written discourse coherence using GPT-4. In: Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pp 394–403
Ariely M, Nazaretsky T, Alexandron G (2023) Machine learning and Hebrew NLP for automated assessment of open-ended questions in biology. Int J Artif Intell Educ 33(1):1–34
Chuang PL, Yan X (2022) An investigation of the relationship between argument structure and essay quality in assessed writing. J Second Lang Writ 56:100892
Funding
Author Annapurna Sharma is supported by Visvesvaraya PhD Scheme, Ministry of Electronics and Information Technology (MeitY), Government of India under the grant number– MEITY-PHD-2541.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sharma, A., Jayagopi, D.B. Modeling essay grading with pre-trained BERT features. Appl Intell 54, 4979–4993 (2024). https://doi.org/10.1007/s10489-024-05410-4
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05410-4