Automated Essay Scoring and the Deep Learning Black Box: How Are Rubric Scores Determined?

Kumar, Vivekanandan S.; Boulanger, David

doi:10.1007/s40593-020-00211-5

Automated Essay Scoring and the Deep Learning Black Box: How Are Rubric Scores Determined?

Published: 15 September 2020

Volume 31, pages 538–584, (2021)
Cite this article

Download PDF

International Journal of Artificial Intelligence in Education Aims and scope Submit manuscript

Automated Essay Scoring and the Deep Learning Black Box: How Are Rubric Scores Determined?

Download PDF

6447 Accesses
Explore all metrics

Abstract

This article investigates the feasibility of using automated scoring methods to evaluate the quality of student-written essays. In 2012, Kaggle hosted an Automated Student Assessment Prize contest to find effective solutions to automated testing and grading. This article: a) analyzes the datasets from the contest – which contained hand-graded essays – to measure their suitability for developing competent automated grading tools; b) evaluates the potential for deep learning in automated essay scoring (AES) to produce sophisticated testing and grading algorithms; c) advocates for thorough and transparent performance reports on AES research, which will facilitate fairer comparisons among various AES systems and permit study replication; d) uses both deep neural networks and state-of-the-art NLP tools to predict finer-grained rubric scores, to illustrate how rubric scores are determined from a linguistic perspective, and to uncover important features of an effective rubric scoring model. This study’s findings first highlight the level of agreement that exists between two human raters for each rubric as captured in the investigated essay dataset, that is, 0.60 on average as measured by the quadratic weighted kappa (QWK). Only one related study has been found in the literature which also performed rubric score predictions through models trained on the same dataset. At best, the predictive models had an average agreement level (QWK) of 0.53 with the human raters, below the level of agreement among human raters. In contrast, this research’s findings report an average agreement level per rubric with the two human raters’ resolved scores of 0.72 (QWK), well beyond the agreement level between the two human raters. Further, the AES system proposed in this article predicts holistic essay scores through its predicted rubric scores and produces a QWK of 0.78, a competitive performance according to recent literature where cutting-edge AES tools generate agreement levels between 0.77 and 0.81, results computed as per the same procedure as in this article. This study’s AES system goes one step further toward interpretability and the provision of high-level explanations to justify the predicted holistic and rubric scores. It contends that predicting rubric scores is essential to automated essay scoring, because it reveals the reasoning behind AIED-based AES systems. Will building AIED accountability improve the trustworthiness of the formative feedback generated by AES? Will AIED-empowered AES systems thoroughly mimic, or even outperform, a competent human rater? Will such machine-grading systems be subjected to verification by human raters, thus paving the way for a human-in-the-loop assessment mechanism? Will trust in new generations of AES systems be improved with the addition of models that explain the inner workings of a deep learning black box? This study seeks to expand these horizons of AES to make the technique practical, explainable, and trustable.

The Promises and Challenges of Artificial Intelligence for Teachers: a Systematic Review of Research

Article Open access 25 March 2022

Ismail Celik, Muhterem Dindar, … Sanna Järvelä

Detection of GPT-4 Generated Text in Higher Education: Combining Academic Judgement and Software to Identify Generative AI Tool Misuse

Article 31 October 2023

Mike Perkins, Jasper Roe, … Don Hickerson

Re-evaluating GPT-4’s bar exam performance

Article Open access 30 March 2024

Eric Martínez

Introduction

Recent advances in deep learning and natural language processing (NLP) have challenged automated testing and grading methods to improve their performance and to harness valuable hand-graded essay datasets – such as the free Automated Student Assessment Prize (ASAP) datasets – to accurately measure performance. Presently, reports about the performance of automated essay scoring (AES) systems commonly – and perhaps inadvertently – lack transparency. Such ambiguity in research outcomes of AES techniques hinders performance evaluations and comparative analyses of techniques. This article argues that AES research requires proper protocols to describe methodologies and to report outcomes. Additionally, the article reviews state-of-the-art AES systems assessed using ASAP’s seventh dataset to: a) underscore features that facilitate reasonable evaluation of AES performances; b) describe cutting-edge natural language processing tools, explaining the extent to which writing metrics can now capture and indicate performance; c) predict rubric scores using six different feature-based multi-layer perceptron deep neural network architectures and compare their performance; and d) assess the importance of the features present in each of the rubric scoring models.

The following section provides background information on the datasets used in this study that are also extensively exploited by the research community to train and evaluate AES systems. The third section synthesizes relevant literature about recent developments in AES, compares contemporary AES systems, and evaluates their features. The fourth section examines methodologies that support finer-grained rubric score prediction. The fifth and sixth sections explore the distribution of holistic and rubric scores, delineate the performance of naïve and “smart” deep/shallow neural network predictors, and discuss implications. The seventh section initiates a discussion on the linguistic aspects considered by the rubric scoring models and how each rubric scoring model differs from each other. Finally, the last section summarizes conclusions, highlights limitations, and discusses next stages of AES research.

Background: The Automated Student Assessment Prize

In 2012, the Hewlett-Packard Foundation funded an Automated Student Assessment Prize (ASAP) contest to evaluate both the progress of automated essay scoring and its readiness to be implemented across the United States in state-wide writing assessments (Shermis 2014). Kaggle^{Footnote 1} collected eight essay datasets from state-wide assessments of student-written essays – which Grade 7 to Grade 10 students from six different states in the USA had written. Kaggle then subcontracted commercial vendors to grade the essays adhering to a thorough scoring process.

Each essay dataset originated from a single assessment for a specific grade (7–10) in a specific state. The ASAP contest asked participants to develop AES systems to automatically grade the essays in the database and report on the level of agreement between the machine grader and human graders, measured by the quadratic weighted kappa. This article argues that the performance comparison process was neither effective nor balanced since, as Table 1 demonstrates, each dataset had a unique underlying writing construct. Instead, AES performance should be analyzed per writing task instead of being analyzed globally.

Table 1 Characteristics of ASAP’s original essay datasets (Shermis 2014)

Automated Essay Scoring and the Deep Learning Black Box: How Are Rubric Scores Determined?

Abstract

Similar content being viewed by others

The Promises and Challenges of Artificial Intelligence for Teachers: a Systematic Review of Research

Detection of GPT-4 Generated Text in Higher Education: Combining Academic Judgement and Software to Identify Generative AI Tool Misuse

Re-evaluating GPT-4’s bar exam performance

Introduction

Background: The Automated Student Assessment Prize

Related Work

Methodology

Natural Language Processing

Grammar and Mechanics Error Tool (GAMET)

Sentiment Analysis and Cognition Engine (SEANCE)

Tool for the Automatic Analysis of Cohesion (TAACO)

Tool for the Automatic Analysis of Lexical Diversity (TAALED)

Tool for the Automatic Analysis of Lexical Sophistication (TAALES)

Tool for the Automatic Assessment of Syntactic Sophistication and Complexity (TAASSC)

How the Tools Were Applied

Results

Discussion: Performance of Linguistic Indices-Based Deep Learning

Discussion: What are the Most Important Features per Rubric?

Ideas Rubric

Organization Rubric

Style Rubric

Conventions Rubric

Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendices

Appendix 1

Appendix 2

Appendix 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation