Skip to main content
Log in

A systematic review of automated writing evaluation systems

  • Published:
Education and Information Technologies Aims and scope Submit manuscript

A Correction to this article was published on 01 August 2022

This article has been updated

Abstract

Automated writing evaluation (AWE) systems are developed based on interdisciplinary research and technological advances such as natural language processing, computer sciences, and latent semantic analysis. Despite a steady increase in research publications in this area, the results of AWE investigations are often mixed, and their validity may be questionable. To yield a deeper understanding of the validity of AWE systems, we conducted a systematic review of the empirical AWE research. Using Scopus, we identified 105 published papers on AWE scoring systems and coded them within an argument-based validation framework. The major findings are: (i) AWE scoring research had a rising trend, but was heterogeneous in terms of the language environments, ecological settings, and educational level; (ii) a disproportionate number of studies were carried out on each validity inference, with the evaluation inference receiving the most research attention, and the domain description inference being the neglected one, and (iii) most studies adopted quantitative methods and yielded positive results that backed each inference, while some studies also presented counterevidence. Lack of research on the domain description (i.e., the correspondence between the AWE systems and real-life writing tasks) combined with the heterogeneous contexts indicated that construct representation in the AWE scoring field needs extensive investigation. Implications and directions for future research are also discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Data availability

Data sharing is not applicable to this article as the datasets generated during the current study are proprietary of Scopus. Using the search code discussed in the paper, interested readers who have access to Scopus can replicate the dataset.

Change history

Notes

  1. The papers included in the review are numbered and listed in a supplementary file which can be found in the Appendix.

References

* Refers to papers that are also included in the dataset.

  • Aryadoust, V. (2013). Building a validity argument for a listening test of academic proficiency. Cambridge Scholars Publishing

  • *Attali, Y. (2015). Reliability-based feature weighting for automated essay scoring [Article]. Applied Psychological Measurement, 39(4), 303-313. https://doi.org/10.1177/0146621614561630

  • Bennett, R. E., & Bejar, I. I. (1998). Validity and automated scoring: It’s not only the scoring. Educational Measurement: Issues and Practice, 17(4), 9–17. https://doi.org/10.1111/j.1745-3992.1998.tb00631.x

    Article  Google Scholar 

  • Bridgeman, B. (2013). Human ratings and automated essay evaluation. In M. D. Shermis & J. Burstein (Eds.), Handbook of Automated Essay Evaluation: Current Applications and New Directions pp. 243–254. Routledge/Taylor & Francis Group.

  • *Bridgeman, B., & Ramineni, C. (2017). Design and evaluation of automated writing evaluation models: Relationships with writing in naturalistic settings [Article]. Assessing Writing, 34, 62-71. https://doi.org/10.1016/j.asw.2017.10.001

  • *Burstein, J., Elliot, N., & Molloy, H. (2016). Informing automated writing evaluation using the lens of genre: Two studies [Article]. CALICO Journal, 33(1), 117-141. https://doi.org/10.1558/cj.v33i1.26374

  • Burstein, J., Riordan, B., & McCaffrey, D. (2020). Expanding automated writing evaluation. In D. Yan, A. A. Rupp, & P. Foltz (Eds.), Handbook of automated scoring: Theory into practice (pp. 329–346). Taylor and Francis Group/CRC Press.

    Chapter  Google Scholar 

  • Chapelle, C., Enright, M., & Jamieson, J. (2008). Building a validity argument for the test of English as a foreign language. Routledge.

    Google Scholar 

  • *Cohen, Y., Levi, E., & Ben-Simon, A. (2018). Validating human and automated scoring of essays against “True” scores. Applied Measurement in Education, 31(3), 241–250. https://doi.org/10.1080/08957347.2018.1464450

  • Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281–302. https://doi.org/10.1037/h0040957

    Article  Google Scholar 

  • Deane, P. (2013). On the relation between automated essay scoring and modern views of the writing construct. Assessing Writing, 18(1), 7–24. https://doi.org/10.1016/j.asw.2012.10.002

    Article  Google Scholar 

  • Dursun, A., & Li, Z. (2021). A systematic review of argument-based validation studies in the field of Language Testing (2000–2018). In C. Chapelle & E. Voss (Eds.), Validity argument in language testing: Case studies of validation research (Cambridge Applied Linguistics) (pp. 45–70). Cambridge University Press.

    Chapter  Google Scholar 

  • Ericsson, P. F., & Haswell, R. (Eds.). (2006). Machine scoring of student essays: Truth and consequences. Utah State University Press.

    Google Scholar 

  • Enright, M. K., & Quinlan, T. (2010). Complementing human judgment of essays written by English language learners with e-rater® scoring [Article]. Language Testing, 27(3), 317–334. https://doi.org/10.1177/0265532210363144

    Article  Google Scholar 

  • Fan, J., & Yan, X. (2020). Assessing speaking proficiency: A narrative review of speaking assessment research within the argument-based validation framework. Frontiers in Psychology, 11, 330. https://doi.org/10.3389/fpsyg.2020.00330

    Article  Google Scholar 

  • *Gerard, L. F., & Linn, M. C. (2016). Using automated scores of student essays to support teacher guidance in classroom inquiry. Journal of Science Teacher Education, 27(1), 111-129. https://doi.org/10.1007/s10972-016-9455-6

  • *Grimes, D., & Warschauer, M. (2010). Utility in a fallible tool: A multi-site case study of automated writing evaluation. Journal of Technology, Learning, and Assessment, 8(6), 1–44. Retrieved from http://www.jtla.org

  • Hockly, N. (2018). Automated writing evaluation. ELT Journal, 73(1), 82–88. https://doi.org/10.1093/elt/ccy044

    Article  Google Scholar 

  • Im, G. H., Shin, D., & Cheng, L. (2019). Critical review of validation models and practices in language testing: Their limitations and future directions for validation research. Language Testing in Asia, 9(1), 14.

    Article  Google Scholar 

  • *James, C. L. (2008). Electronic scoring of essays: Does topic matter? Assessing Writing, 13(2), 80-92. https://doi.org/10.1016/j.asw.2008.05.001

  • Kane, M. (2013). Validating the Interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. https://doi.org/10.1111/jedm.12000

    Article  Google Scholar 

  • Keith, T. Z. (2003). Validity and automated essay scoring systems. In M. D. Shermis & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 147–168). Erlbaum.

    Google Scholar 

  • *Klobucar, A., Elliot, N., Deess, P., Rudniy, O., & Joshi, K. (2013). Automated scoring in context: Rapid assessment for placed students. Assessing Writing, 18(1), 62–84. https://doi.org/10.1016/j.asw.2012.10.001

  • Lamprianou, I., Tsagari, D., & Kyriakou, N. (2020). The longitudinal stability of rating characteristics in an EFL examination: Methodological and substantive considerations. Language Testing. https://doi.org/10.1177/0265532220940960

    Article  Google Scholar 

  • Lee, Y. W., Gentile, C., & Kantor, R. (2010). Toward automated multi-trait scoring ofessays: Investigating links among holistic, analytic, and text feature scores [Article]. Applied Linguistics, 31(3), 391–417. https://doi.org/10.1093/applin/amp040.

    Article  Google Scholar 

  • *Li, J., Link, S., & Hegelheimer, V. (2015). Rethinking the role of automated writing evaluation (AWE) feedback in ESL writing instruction. Journal of Second Language Writing, 27, 1-18. https://doi.org/10.1016/j.jslw.2014.10.004

  • Li, S., & Wang, H. (2018). Traditional literature review and research synthesis. In A. Phakiti, P. De Costa, L. Plonsky, & S. Starfield (Eds.), The Palgrave Handbook of applied linguistics research methodology (pp. 123–144). Palgrave-MacMillan.

    Chapter  Google Scholar 

  • Liu, S., & Kunnan, A. J. (2016). Investigating the application of automated writing evaluation to Chinese undergraduate English majors: A case study of WritetoLearn. CALICO Journal, 33(1), 71–91. https://doi.org/10.1558/cj.v33i1.26380.

    Article  Google Scholar 

  • Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). American Council on Education and Macmillan.

    Google Scholar 

  • Mislevy, R. (2020). An evidentiary-reasoning perspective on automated scoring: Commentary on part I. In D. Yan, A. A. Rupp, & P. Foltz (Eds.), Handbook of automated scoring: Theory into practice (pp. 151–167). Taylor and Francis Group/CRC Press.

    Chapter  Google Scholar 

  • National Council of Teachers of English. (2013). NCTE position statement on machine scoring. https://ncte.org/statement/machine_scoring/

  • Phakiti, A., De Costa, P., Plonsky, L., & Starfield, S. (2018). Applied linguistics research: Current issues, methods, and trends. In A. Phakiti, P. De Costa, L. Plonsky, & S. Starfield (Eds.) The Palgrave Handbook of Applied Linguistics Research Methodology pp. 5–29. Palgrave-MacMillan

  • *Perelman, L. (2014). When "the state of the art" is counting words. Assessing Writing, 21, 104-111. https://doi.org/10.1016/j.asw.2014.05.001

  • *Powers, D. E., Burstein, J. C., Chodorow, M., Fowles, M. E., & Kukich, K. (2002a). Stumping e-rater: challenging the validity of automated essay scoring. Computers in Human Behavior, 18(2), 103–134. https://doi.org/10.1016/s0747-5632(01)00052-8

  • *Powers, D. E., Burstein, J. C., Chodorow, M. S., Fowles, M. E., & Kukich, K. (2002b). Comparing the validity of automated and human scoring of essays. Journal of Educational Computing Research, 26(4), 407-425. https://doi.org/10.1092/UP3H-M3TE-Q290-QJ2T

  • *Qian, L., Zhao, Y., & Cheng, Y. (2020). Evaluating China’s Automated Essay Scoring System iWrite [Article]. Journal of Educational Computing Research, 58(4), 771-790. https://doi.org/10.1177/0735633119881472

  • Ramesh, D., & Sanampudi, S. K. (2021). An automated essay scoring systems: A systematic literature review. Artificial Intelligence Review, 55(3), 2495–2527. https://doi.org/10.1007/s10462-021-10068-2

    Article  Google Scholar 

  • Ramineni, C., & Williamson, D. M. (2013). Automated essay scoring: Psychometricguidelines and practices. Assessing Writing, 18(1), 25–39. https://doi.org/10.1016/j.asw.2012.10.004.

    Article  Google Scholar 

  • *Ramineni, C., & Williamson, D. (2018). Understanding mean score differences between the e-rater® automated scoring engine and humans for demographically based groups in the GRE® General Test. ETS Research Report Series, 2018(1), 1-31. https://doi.org/10.1002/ets2.12192

  • *Reilly, E. D., Stafford, R. E., Williams, K. M., & Corliss, S. B. (2014). Evaluating the validity and applicability of automated essay scoring in two massive open online courses. International Review of Research in Open and Distance Learning, 15(5), 83–98. https://doi.org/10.19173/irrodl.v15i5.1857

  • Reilly, E. D., Williams, K. M., Stafford, R. E., Corliss, S. B., Walkow, J. C., & Kidwell, D. K. (2016). Global times call for global measures: Investigating automated essay scoring in linguisticallydiverse MOOCs. Online Learning Journal, 20(2). https://doi.org/10.24059/olj.v20i2.638; https://doi.org/10.19173/irrodl.v15i5.1857

  • Riazi, M., Shi, L., & Haggerty, J. (2018). Analysis of the empirical research in the journal of second language writing at its 25th year (1992–2016). Journal of Second Language Writing, 41, 41–54. https://doi.org/10.1016/j.jslw.2018.07.002

    Article  Google Scholar 

  • Richardson, M. & Clesham, R. (2021) ‘Rise of the machines? The evolving role of AI technologies in high-stakes assessment’. London Review of Education, 19 (1), 9, 1–13. https://doi.org/10.14324/LRE.19.1.09

  • Rotou, O., & Rupp, A. A. (2020). Evaluations of Automated Scoring Systems inPractice. ETS Research Report Series, 2020(1), 1–18. https://doi.org/10.1002/ets2.12293.

    Article  Google Scholar 

  • Sarkis-Onofre, R., Catalá-López, F., Aromataris, E., & Lockwood, C. (2021). How to properly use the PRISMA Statement. Systematic Reviews, 10(1). https://doi.org/10.1186/s13643-021-01671-z

  • Sawaki, Y., & Xi, X. (2019). Univariate generalizability theory in language assessment. In V. Aryadoust & M. Raquel (Eds.), Quantitative data analysis for language assessment (Vol. 1, pp. 30–53). Routledge.

    Google Scholar 

  • Schotten, M., Aisati, M., Meester, W. J. N., Steigninga, S., & Ross, C. A. (2018). A brief history of Scopus: The world’s largest abstract and citation database of scientific literature. In F. J. Cantu-Ortiz (Ed.), Research analytics: Boosting university productivity and competitiveness through Scientometrics (pp. 33–57). Taylor & Francis.

    Google Scholar 

  • *Shermis, M. D. (2014). State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration. Assessing Writing, 20, 53-76.

  • Shermis, M. D. (2020). International application of Automated Scoring. In D. Yan, A. A. Rupp, & P. Foltz (Eds.), Handbook of automated scoring: Theory into practice (pp. 113–132). Taylor and Francis Group/CRC Press.

    Chapter  Google Scholar 

  • Shermis, M. D., & Burstein, J. (2003). Introduction. In M. D. Shermis & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. xiii–xvi). Lawrence Erlbaum Associates.

    Chapter  Google Scholar 

  • Shermis, M. D., Burstein, J., & Bursky, S. A. (2013). Introduction to automated essay evaluation. In M. D. Shermis & J. Burstein (Eds.), Handbook of automated essay evaluation: Current applications and new directions (pp. 1–15). Routledge/Taylor & Francis Group.

    Chapter  Google Scholar 

  • Shermis, M., Burstein, J., Elliot, N., Miel, S., & Foltz, P. (2016). Automated writing evaluation: A growing body of knowledge. In C. MacArthur, S. Graham, & J. Fitzgerald (Eds.), Handbook of writing research (pp. 395–409). Guilford Press.

    Google Scholar 

  • Shin, J., & Gierl, M. J. (2020). More efficient processes for creating automated essayscoring frameworks: A demonstration of two algorithms. Language Testing, 38(2), 247–272. https://doi.org/10.1177/0265532220937830.

    Article  Google Scholar 

  • Stevenson, M., & Phakiti, A. (2014). The effects of computer-generated feedback on the quality of writing. Assessing Writing, 19, 51–65. https://doi.org/10.1016/j.asw.2013.11.007

    Article  Google Scholar 

  • Stevenson, M. (2016). A critical interpretative synthesis: The integration ofAutomated Writing Evaluation into classroom writing instruction. Computers and Composition, 42, 1–16. https://doi.org/10.1016/j.compcom.2016.05.001.

    Article  Google Scholar 

  • Stevenson, M., & Phakiti, A. (2019). Automated feedback and second language writing. In K. Hyland & F. Hyland (Eds.), Feedback in second language writing: Contexts and issues (pp. 125–142). Cambridge University Press. https://doi.org/10.1017/9781108635547.009

    Chapter  Google Scholar 

  • Toulmin, S. E. (2003). The uses of argument (Updated). Cambridge University Press.

    Book  Google Scholar 

  • *Tsai, M. H. (2012). The consistency between human raters and an automated essay scoring system in Grading High School Students' English writing. Action in Teacher Education, 34(4), 328-335. https://doi.org/10.1080/01626620.2012.717033

  • Vojak, C., Kline, S., Cope, B., McCarthey, S., & Kalantzis, M. (2011). New spaces and old places: An analysis of writing assessment software. Computers and Composition, 28(2), 97–111.

    Article  Google Scholar 

  • *Vajjala, S. (2018). Automated assessment of non-native learner essays: Investigating the role of linguistic features [Article]. International Journal of Artificial Intelligence in Education, 28(1), 79-105. https://doi.org/10.1007/s40593-017-0142-3

  • Ware, P. (2011). Computer-generated feedback on student writing. TESOL Quarterly, 45(4), 769–774. https://doi.org/10.5054/tq.2011.272525

    Article  Google Scholar 

  • Warschauer, M., & Ware, P. (2006). Automated writing evaluation: Defining the classroom research agenda. Language Teaching Research, 10(2), 157–180. https://doi.org/10.1191/1362168806lr190oa

    Article  Google Scholar 

  • Weigle, S. C. (2013a). English as a second language writing and automated essay evaluation. In M. D. Shermis & J. Burstein (Eds.), Handbook of automated essay evaluation: Current applications and new directions (pp. 36–54). Routledge/Taylor & Francis Group.

    Google Scholar 

  • Weigle, S. C. (2013b). English language learners and automated scoring of essays: Critical considerations. Assessing Writing, 18(1), 85–99. https://doi.org/10.1016/j.asw.2012.10.006

    Article  Google Scholar 

  • *Wilson, J. (2017). Associated effects of automated essay evaluation software on growth in writing quality for students with and without disabilities. Reading and Writing, 30(4), 691-718. https://doi.org/10.1007/s11145-016-9695-z

  • Williamson, D., Xi, X., & Breyer, F. (2012). A Framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13. https://doi.org/10.1111/j.1745-3992.2011.00223.x

    Article  Google Scholar 

  • Xi, X. (2010). Automated scoring and feedback systems: Where are we and where are we heading? Language Testing, 27(3), 291–300. https://doi.org/10.1177/0265532210364643

    Article  Google Scholar 

  • Zheng, Y., & Yu, S. (2019). What has been assessed in writing and how? Empirical evidence from Assessing Writing (2000–2018). Assessing Writing, 42, 100421. https://doi.org/10.1016/j.asw.2019.100421

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vahid Aryadoust.

Ethics declarations

Human and animal rights and informed consent

The study does NOT include any human participants and/or animals. According to the Research Ethics Committee of Nanyang Technological University, where there are no human or animal subjects in the study, no ethical approval is required. As a result, no informed consent was necessary in the study.

Conflict of interest

No potential conflict of interest was reported by the author(s).

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 46 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huawei, S., Aryadoust, V. A systematic review of automated writing evaluation systems. Educ Inf Technol 28, 771–795 (2023). https://doi.org/10.1007/s10639-022-11200-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10639-022-11200-7

Keyword

Navigation