A systematic review of automated writing evaluation systems

Huawei, Shi; Aryadoust, Vahid

doi:10.1007/s10639-022-11200-7

A systematic review of automated writing evaluation systems

Published: 07 July 2022

Volume 28, pages 771–795, (2023)
Cite this article

Education and Information Technologies Aims and scope Submit manuscript

2426 Accesses
12 Citations
10 Altmetric
Explore all metrics

A Correction to this article was published on 01 August 2022

This article has been updated

Abstract

Automated writing evaluation (AWE) systems are developed based on interdisciplinary research and technological advances such as natural language processing, computer sciences, and latent semantic analysis. Despite a steady increase in research publications in this area, the results of AWE investigations are often mixed, and their validity may be questionable. To yield a deeper understanding of the validity of AWE systems, we conducted a systematic review of the empirical AWE research. Using Scopus, we identified 105 published papers on AWE scoring systems and coded them within an argument-based validation framework. The major findings are: (i) AWE scoring research had a rising trend, but was heterogeneous in terms of the language environments, ecological settings, and educational level; (ii) a disproportionate number of studies were carried out on each validity inference, with the evaluation inference receiving the most research attention, and the domain description inference being the neglected one, and (iii) most studies adopted quantitative methods and yielded positive results that backed each inference, while some studies also presented counterevidence. Lack of research on the domain description (i.e., the correspondence between the AWE systems and real-life writing tasks) combined with the heterogeneous contexts indicated that construct representation in the AWE scoring field needs extensive investigation. Implications and directions for future research are also discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards automated writing evaluation: A comprehensive review with bibliometric, scientometric, and meta-analytic approaches

Article 23 March 2024

Automated writing evaluation systems: A systematic review of Grammarly, Pigai, and Criterion with a perspective on future directions in the age of generative artificial intelligence

Article 03 January 2024

Trends in automated writing evaluation systems research for teaching, learning, and assessment: A bibliometric analysis

Article 11 August 2023

Data availability

Data sharing is not applicable to this article as the datasets generated during the current study are proprietary of Scopus. Using the search code discussed in the paper, interested readers who have access to Scopus can replicate the dataset.

Change history

01 August 2022
A Correction to this paper has been published: https://doi.org/10.1007/s10639-022-11260-9

Notes

The papers included in the review are numbered and listed in a supplementary file which can be found in the Appendix.

References

* Refers to papers that are also included in the dataset.

Aryadoust, V. (2013). Building a validity argument for a listening test of academic proficiency. Cambridge Scholars Publishing
*Attali, Y. (2015). Reliability-based feature weighting for automated essay scoring [Article]. Applied Psychological Measurement, 39(4), 303-313. https://doi.org/10.1177/0146621614561630
Bennett, R. E., & Bejar, I. I. (1998). Validity and automated scoring: It’s not only the scoring. Educational Measurement: Issues and Practice, 17(4), 9–17. https://doi.org/10.1111/j.1745-3992.1998.tb00631.x
Article Google Scholar
Bridgeman, B. (2013). Human ratings and automated essay evaluation. In M. D. Shermis & J. Burstein (Eds.), Handbook of Automated Essay Evaluation: Current Applications and New Directions pp. 243–254. Routledge/Taylor & Francis Group.
*Bridgeman, B., & Ramineni, C. (2017). Design and evaluation of automated writing evaluation models: Relationships with writing in naturalistic settings [Article]. Assessing Writing, 34, 62-71. https://doi.org/10.1016/j.asw.2017.10.001
*Burstein, J., Elliot, N., & Molloy, H. (2016). Informing automated writing evaluation using the lens of genre: Two studies [Article]. CALICO Journal, 33(1), 117-141. https://doi.org/10.1558/cj.v33i1.26374
Burstein, J., Riordan, B., & McCaffrey, D. (2020). Expanding automated writing evaluation. In D. Yan, A. A. Rupp, & P. Foltz (Eds.), Handbook of automated scoring: Theory into practice (pp. 329–346). Taylor and Francis Group/CRC Press.
Chapter Google Scholar
Chapelle, C., Enright, M., & Jamieson, J. (2008). Building a validity argument for the test of English as a foreign language. Routledge.
Google Scholar
*Cohen, Y., Levi, E., & Ben-Simon, A. (2018). Validating human and automated scoring of essays against “True” scores. Applied Measurement in Education, 31(3), 241–250. https://doi.org/10.1080/08957347.2018.1464450
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281–302. https://doi.org/10.1037/h0040957
Article Google Scholar
Deane, P. (2013). On the relation between automated essay scoring and modern views of the writing construct. Assessing Writing, 18(1), 7–24. https://doi.org/10.1016/j.asw.2012.10.002
Article Google Scholar
Dursun, A., & Li, Z. (2021). A systematic review of argument-based validation studies in the field of Language Testing (2000–2018). In C. Chapelle & E. Voss (Eds.), Validity argument in language testing: Case studies of validation research (Cambridge Applied Linguistics) (pp. 45–70). Cambridge University Press.
Chapter Google Scholar
Ericsson, P. F., & Haswell, R. (Eds.). (2006). Machine scoring of student essays: Truth and consequences. Utah State University Press.
Google Scholar
Enright, M. K., & Quinlan, T. (2010). Complementing human judgment of essays written by English language learners with e-rater® scoring [Article]. Language Testing, 27(3), 317–334. https://doi.org/10.1177/0265532210363144
Article Google Scholar
Fan, J., & Yan, X. (2020). Assessing speaking proficiency: A narrative review of speaking assessment research within the argument-based validation framework. Frontiers in Psychology, 11, 330. https://doi.org/10.3389/fpsyg.2020.00330
Article Google Scholar
*Gerard, L. F., & Linn, M. C. (2016). Using automated scores of student essays to support teacher guidance in classroom inquiry. Journal of Science Teacher Education, 27(1), 111-129. https://doi.org/10.1007/s10972-016-9455-6
*Grimes, D., & Warschauer, M. (2010). Utility in a fallible tool: A multi-site case study of automated writing evaluation. Journal of Technology, Learning, and Assessment, 8(6), 1–44. Retrieved from http://www.jtla.org
Hockly, N. (2018). Automated writing evaluation. ELT Journal, 73(1), 82–88. https://doi.org/10.1093/elt/ccy044
Article Google Scholar
Im, G. H., Shin, D., & Cheng, L. (2019). Critical review of validation models and practices in language testing: Their limitations and future directions for validation research. Language Testing in Asia, 9(1), 14.
Article Google Scholar
*James, C. L. (2008). Electronic scoring of essays: Does topic matter? Assessing Writing, 13(2), 80-92. https://doi.org/10.1016/j.asw.2008.05.001
Kane, M. (2013). Validating the Interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. https://doi.org/10.1111/jedm.12000
Article Google Scholar
Keith, T. Z. (2003). Validity and automated essay scoring systems. In M. D. Shermis & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 147–168). Erlbaum.
Google Scholar
*Klobucar, A., Elliot, N., Deess, P., Rudniy, O., & Joshi, K. (2013). Automated scoring in context: Rapid assessment for placed students. Assessing Writing, 18(1), 62–84. https://doi.org/10.1016/j.asw.2012.10.001
Lamprianou, I., Tsagari, D., & Kyriakou, N. (2020). The longitudinal stability of rating characteristics in an EFL examination: Methodological and substantive considerations. Language Testing. https://doi.org/10.1177/0265532220940960
Article Google Scholar
Lee, Y. W., Gentile, C., & Kantor, R. (2010). Toward automated multi-trait scoring ofessays: Investigating links among holistic, analytic, and text feature scores [Article]. Applied Linguistics, 31(3), 391–417. https://doi.org/10.1093/applin/amp040.
Article Google Scholar
*Li, J., Link, S., & Hegelheimer, V. (2015). Rethinking the role of automated writing evaluation (AWE) feedback in ESL writing instruction. Journal of Second Language Writing, 27, 1-18. https://doi.org/10.1016/j.jslw.2014.10.004
Li, S., & Wang, H. (2018). Traditional literature review and research synthesis. In A. Phakiti, P. De Costa, L. Plonsky, & S. Starfield (Eds.), The Palgrave Handbook of applied linguistics research methodology (pp. 123–144). Palgrave-MacMillan.
Chapter Google Scholar
Liu, S., & Kunnan, A. J. (2016). Investigating the application of automated writing evaluation to Chinese undergraduate English majors: A case study of WritetoLearn. CALICO Journal, 33(1), 71–91. https://doi.org/10.1558/cj.v33i1.26380.
Article Google Scholar
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). American Council on Education and Macmillan.
Google Scholar
Mislevy, R. (2020). An evidentiary-reasoning perspective on automated scoring: Commentary on part I. In D. Yan, A. A. Rupp, & P. Foltz (Eds.), Handbook of automated scoring: Theory into practice (pp. 151–167). Taylor and Francis Group/CRC Press.
Chapter Google Scholar
National Council of Teachers of English. (2013). NCTE position statement on machine scoring. https://ncte.org/statement/machine_scoring/
Phakiti, A., De Costa, P., Plonsky, L., & Starfield, S. (2018). Applied linguistics research: Current issues, methods, and trends. In A. Phakiti, P. De Costa, L. Plonsky, & S. Starfield (Eds.) The Palgrave Handbook of Applied Linguistics Research Methodology pp. 5–29. Palgrave-MacMillan
*Perelman, L. (2014). When "the state of the art" is counting words. Assessing Writing, 21, 104-111. https://doi.org/10.1016/j.asw.2014.05.001
*Powers, D. E., Burstein, J. C., Chodorow, M., Fowles, M. E., & Kukich, K. (2002a). Stumping e-rater: challenging the validity of automated essay scoring. Computers in Human Behavior, 18(2), 103–134. https://doi.org/10.1016/s0747-5632(01)00052-8
*Powers, D. E., Burstein, J. C., Chodorow, M. S., Fowles, M. E., & Kukich, K. (2002b). Comparing the validity of automated and human scoring of essays. Journal of Educational Computing Research, 26(4), 407-425. https://doi.org/10.1092/UP3H-M3TE-Q290-QJ2T
*Qian, L., Zhao, Y., & Cheng, Y. (2020). Evaluating China’s Automated Essay Scoring System iWrite [Article]. Journal of Educational Computing Research, 58(4), 771-790. https://doi.org/10.1177/0735633119881472
Ramesh, D., & Sanampudi, S. K. (2021). An automated essay scoring systems: A systematic literature review. Artificial Intelligence Review, 55(3), 2495–2527. https://doi.org/10.1007/s10462-021-10068-2
Article Google Scholar
Ramineni, C., & Williamson, D. M. (2013). Automated essay scoring: Psychometricguidelines and practices. Assessing Writing, 18(1), 25–39. https://doi.org/10.1016/j.asw.2012.10.004.
Article Google Scholar
*Ramineni, C., & Williamson, D. (2018). Understanding mean score differences between the e-rater® automated scoring engine and humans for demographically based groups in the GRE® General Test. ETS Research Report Series, 2018(1), 1-31. https://doi.org/10.1002/ets2.12192
*Reilly, E. D., Stafford, R. E., Williams, K. M., & Corliss, S. B. (2014). Evaluating the validity and applicability of automated essay scoring in two massive open online courses. International Review of Research in Open and Distance Learning, 15(5), 83–98. https://doi.org/10.19173/irrodl.v15i5.1857
Reilly, E. D., Williams, K. M., Stafford, R. E., Corliss, S. B., Walkow, J. C., & Kidwell, D. K. (2016). Global times call for global measures: Investigating automated essay scoring in linguisticallydiverse MOOCs. Online Learning Journal, 20(2). https://doi.org/10.24059/olj.v20i2.638; https://doi.org/10.19173/irrodl.v15i5.1857
Riazi, M., Shi, L., & Haggerty, J. (2018). Analysis of the empirical research in the journal of second language writing at its 25th year (1992–2016). Journal of Second Language Writing, 41, 41–54. https://doi.org/10.1016/j.jslw.2018.07.002
Article Google Scholar
Richardson, M. & Clesham, R. (2021) ‘Rise of the machines? The evolving role of AI technologies in high-stakes assessment’. London Review of Education, 19 (1), 9, 1–13. https://doi.org/10.14324/LRE.19.1.09
Rotou, O., & Rupp, A. A. (2020). Evaluations of Automated Scoring Systems inPractice. ETS Research Report Series, 2020(1), 1–18. https://doi.org/10.1002/ets2.12293.
Article Google Scholar
Sarkis-Onofre, R., Catalá-López, F., Aromataris, E., & Lockwood, C. (2021). How to properly use the PRISMA Statement. Systematic Reviews, 10(1). https://doi.org/10.1186/s13643-021-01671-z
Sawaki, Y., & Xi, X. (2019). Univariate generalizability theory in language assessment. In V. Aryadoust & M. Raquel (Eds.), Quantitative data analysis for language assessment (Vol. 1, pp. 30–53). Routledge.
Google Scholar
Schotten, M., Aisati, M., Meester, W. J. N., Steigninga, S., & Ross, C. A. (2018). A brief history of Scopus: The world’s largest abstract and citation database of scientific literature. In F. J. Cantu-Ortiz (Ed.), Research analytics: Boosting university productivity and competitiveness through Scientometrics (pp. 33–57). Taylor & Francis.
Google Scholar
*Shermis, M. D. (2014). State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration. Assessing Writing, 20, 53-76.
Shermis, M. D. (2020). International application of Automated Scoring. In D. Yan, A. A. Rupp, & P. Foltz (Eds.), Handbook of automated scoring: Theory into practice (pp. 113–132). Taylor and Francis Group/CRC Press.
Chapter Google Scholar
Shermis, M. D., & Burstein, J. (2003). Introduction. In M. D. Shermis & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. xiii–xvi). Lawrence Erlbaum Associates.
Chapter Google Scholar
Shermis, M. D., Burstein, J., & Bursky, S. A. (2013). Introduction to automated essay evaluation. In M. D. Shermis & J. Burstein (Eds.), Handbook of automated essay evaluation: Current applications and new directions (pp. 1–15). Routledge/Taylor & Francis Group.
Chapter Google Scholar
Shermis, M., Burstein, J., Elliot, N., Miel, S., & Foltz, P. (2016). Automated writing evaluation: A growing body of knowledge. In C. MacArthur, S. Graham, & J. Fitzgerald (Eds.), Handbook of writing research (pp. 395–409). Guilford Press.
Google Scholar
Shin, J., & Gierl, M. J. (2020). More efficient processes for creating automated essayscoring frameworks: A demonstration of two algorithms. Language Testing, 38(2), 247–272. https://doi.org/10.1177/0265532220937830.
Article Google Scholar
Stevenson, M., & Phakiti, A. (2014). The effects of computer-generated feedback on the quality of writing. Assessing Writing, 19, 51–65. https://doi.org/10.1016/j.asw.2013.11.007
Article Google Scholar
Stevenson, M. (2016). A critical interpretative synthesis: The integration ofAutomated Writing Evaluation into classroom writing instruction. Computers and Composition, 42, 1–16. https://doi.org/10.1016/j.compcom.2016.05.001.
Article Google Scholar
Stevenson, M., & Phakiti, A. (2019). Automated feedback and second language writing. In K. Hyland & F. Hyland (Eds.), Feedback in second language writing: Contexts and issues (pp. 125–142). Cambridge University Press. https://doi.org/10.1017/9781108635547.009
Chapter Google Scholar
Toulmin, S. E. (2003). The uses of argument (Updated). Cambridge University Press.
Book Google Scholar
*Tsai, M. H. (2012). The consistency between human raters and an automated essay scoring system in Grading High School Students' English writing. Action in Teacher Education, 34(4), 328-335. https://doi.org/10.1080/01626620.2012.717033
Vojak, C., Kline, S., Cope, B., McCarthey, S., & Kalantzis, M. (2011). New spaces and old places: An analysis of writing assessment software. Computers and Composition, 28(2), 97–111.
Article Google Scholar
*Vajjala, S. (2018). Automated assessment of non-native learner essays: Investigating the role of linguistic features [Article]. International Journal of Artificial Intelligence in Education, 28(1), 79-105. https://doi.org/10.1007/s40593-017-0142-3
Ware, P. (2011). Computer-generated feedback on student writing. TESOL Quarterly, 45(4), 769–774. https://doi.org/10.5054/tq.2011.272525
Article Google Scholar
Warschauer, M., & Ware, P. (2006). Automated writing evaluation: Defining the classroom research agenda. Language Teaching Research, 10(2), 157–180. https://doi.org/10.1191/1362168806lr190oa
Article Google Scholar
Weigle, S. C. (2013a). English as a second language writing and automated essay evaluation. In M. D. Shermis & J. Burstein (Eds.), Handbook of automated essay evaluation: Current applications and new directions (pp. 36–54). Routledge/Taylor & Francis Group.
Google Scholar
Weigle, S. C. (2013b). English language learners and automated scoring of essays: Critical considerations. Assessing Writing, 18(1), 85–99. https://doi.org/10.1016/j.asw.2012.10.006
Article Google Scholar
*Wilson, J. (2017). Associated effects of automated essay evaluation software on growth in writing quality for students with and without disabilities. Reading and Writing, 30(4), 691-718. https://doi.org/10.1007/s11145-016-9695-z
Williamson, D., Xi, X., & Breyer, F. (2012). A Framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13. https://doi.org/10.1111/j.1745-3992.2011.00223.x
Article Google Scholar
Xi, X. (2010). Automated scoring and feedback systems: Where are we and where are we heading? Language Testing, 27(3), 291–300. https://doi.org/10.1177/0265532210364643
Article Google Scholar
Zheng, Y., & Yu, S. (2019). What has been assessed in writing and how? Empirical evidence from Assessing Writing (2000–2018). Assessing Writing, 42, 100421. https://doi.org/10.1016/j.asw.2019.100421
Article Google Scholar

Download references

Author information

Authors and Affiliations

National Institute of Education, Nanyang Technological University, Singapore, Singapore
Shi Huawei & Vahid Aryadoust

Authors

Shi Huawei
View author publications
You can also search for this author in PubMed Google Scholar
Vahid Aryadoust
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vahid Aryadoust.

Ethics declarations

Human and animal rights and informed consent

The study does NOT include any human participants and/or animals. According to the Research Ethics Committee of Nanyang Technological University, where there are no human or animal subjects in the study, no ethical approval is required. As a result, no informed consent was necessary in the study.

Conflict of interest

No potential conflict of interest was reported by the author(s).

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 46 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huawei, S., Aryadoust, V. A systematic review of automated writing evaluation systems. Educ Inf Technol 28, 771–795 (2023). https://doi.org/10.1007/s10639-022-11200-7

Download citation

Received: 13 February 2022
Accepted: 28 June 2022
Published: 07 July 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s10639-022-11200-7

Keyword

Automated writing evaluation; argument-based validation; automated essay scoring

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A systematic review of automated writing evaluation systems

Abstract

Access this article

Similar content being viewed by others

Towards automated writing evaluation: A comprehensive review with bibliometric, scientometric, and meta-analytic approaches

Automated writing evaluation systems: A systematic review of Grammarly, Pigai, and Criterion with a perspective on future directions in the age of generative artificial intelligence

Trends in automated writing evaluation systems research for teaching, learning, and assessment: A bibliometric analysis

Data availability

Change history

01 August 2022

Notes

References

* Refers to papers that are also included in the dataset.

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Human and animal rights and informed consent

Conflict of interest

Additional information

Publisher's note

Supplementary Information

Supplementary file1 (DOCX 46 KB)

Rights and permissions

About this article

Cite this article

Keyword

Navigation

A systematic review of automated writing evaluation systems

Abstract

Access this article

Similar content being viewed by others

Towards automated writing evaluation: A comprehensive review with bibliometric, scientometric, and meta-analytic approaches

Automated writing evaluation systems: A systematic review of Grammarly, Pigai, and Criterion with a perspective on future directions in the age of generative artificial intelligence

Trends in automated writing evaluation systems research for teaching, learning, and assessment: A bibliometric analysis

Data availability

Change history

01 August 2022

Notes

References

* Refers to papers that are also included in the dataset.

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Human and animal rights and informed consent

Conflict of interest

Additional information

Publisher's note

Supplementary Information

Supplementary file1 (DOCX 46 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keyword

Search

Navigation