Abstract
Automated writing evaluation (AWE) systems are developed based on interdisciplinary research and technological advances such as natural language processing, computer sciences, and latent semantic analysis. Despite a steady increase in research publications in this area, the results of AWE investigations are often mixed, and their validity may be questionable. To yield a deeper understanding of the validity of AWE systems, we conducted a systematic review of the empirical AWE research. Using Scopus, we identified 105 published papers on AWE scoring systems and coded them within an argument-based validation framework. The major findings are: (i) AWE scoring research had a rising trend, but was heterogeneous in terms of the language environments, ecological settings, and educational level; (ii) a disproportionate number of studies were carried out on each validity inference, with the evaluation inference receiving the most research attention, and the domain description inference being the neglected one, and (iii) most studies adopted quantitative methods and yielded positive results that backed each inference, while some studies also presented counterevidence. Lack of research on the domain description (i.e., the correspondence between the AWE systems and real-life writing tasks) combined with the heterogeneous contexts indicated that construct representation in the AWE scoring field needs extensive investigation. Implications and directions for future research are also discussed.
Similar content being viewed by others
Data availability
Data sharing is not applicable to this article as the datasets generated during the current study are proprietary of Scopus. Using the search code discussed in the paper, interested readers who have access to Scopus can replicate the dataset.
Change history
01 August 2022
A Correction to this paper has been published: https://doi.org/10.1007/s10639-022-11260-9
Notes
The papers included in the review are numbered and listed in a supplementary file which can be found in the Appendix.
References
* Refers to papers that are also included in the dataset.
Aryadoust, V. (2013). Building a validity argument for a listening test of academic proficiency. Cambridge Scholars Publishing
*Attali, Y. (2015). Reliability-based feature weighting for automated essay scoring [Article]. Applied Psychological Measurement, 39(4), 303-313. https://doi.org/10.1177/0146621614561630
Bennett, R. E., & Bejar, I. I. (1998). Validity and automated scoring: It’s not only the scoring. Educational Measurement: Issues and Practice, 17(4), 9–17. https://doi.org/10.1111/j.1745-3992.1998.tb00631.x
Bridgeman, B. (2013). Human ratings and automated essay evaluation. In M. D. Shermis & J. Burstein (Eds.), Handbook of Automated Essay Evaluation: Current Applications and New Directions pp. 243–254. Routledge/Taylor & Francis Group.
*Bridgeman, B., & Ramineni, C. (2017). Design and evaluation of automated writing evaluation models: Relationships with writing in naturalistic settings [Article]. Assessing Writing, 34, 62-71. https://doi.org/10.1016/j.asw.2017.10.001
*Burstein, J., Elliot, N., & Molloy, H. (2016). Informing automated writing evaluation using the lens of genre: Two studies [Article]. CALICO Journal, 33(1), 117-141. https://doi.org/10.1558/cj.v33i1.26374
Burstein, J., Riordan, B., & McCaffrey, D. (2020). Expanding automated writing evaluation. In D. Yan, A. A. Rupp, & P. Foltz (Eds.), Handbook of automated scoring: Theory into practice (pp. 329–346). Taylor and Francis Group/CRC Press.
Chapelle, C., Enright, M., & Jamieson, J. (2008). Building a validity argument for the test of English as a foreign language. Routledge.
*Cohen, Y., Levi, E., & Ben-Simon, A. (2018). Validating human and automated scoring of essays against “True” scores. Applied Measurement in Education, 31(3), 241–250. https://doi.org/10.1080/08957347.2018.1464450
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281–302. https://doi.org/10.1037/h0040957
Deane, P. (2013). On the relation between automated essay scoring and modern views of the writing construct. Assessing Writing, 18(1), 7–24. https://doi.org/10.1016/j.asw.2012.10.002
Dursun, A., & Li, Z. (2021). A systematic review of argument-based validation studies in the field of Language Testing (2000–2018). In C. Chapelle & E. Voss (Eds.), Validity argument in language testing: Case studies of validation research (Cambridge Applied Linguistics) (pp. 45–70). Cambridge University Press.
Ericsson, P. F., & Haswell, R. (Eds.). (2006). Machine scoring of student essays: Truth and consequences. Utah State University Press.
Enright, M. K., & Quinlan, T. (2010). Complementing human judgment of essays written by English language learners with e-rater® scoring [Article]. Language Testing, 27(3), 317–334. https://doi.org/10.1177/0265532210363144
Fan, J., & Yan, X. (2020). Assessing speaking proficiency: A narrative review of speaking assessment research within the argument-based validation framework. Frontiers in Psychology, 11, 330. https://doi.org/10.3389/fpsyg.2020.00330
*Gerard, L. F., & Linn, M. C. (2016). Using automated scores of student essays to support teacher guidance in classroom inquiry. Journal of Science Teacher Education, 27(1), 111-129. https://doi.org/10.1007/s10972-016-9455-6
*Grimes, D., & Warschauer, M. (2010). Utility in a fallible tool: A multi-site case study of automated writing evaluation. Journal of Technology, Learning, and Assessment, 8(6), 1–44. Retrieved from http://www.jtla.org
Hockly, N. (2018). Automated writing evaluation. ELT Journal, 73(1), 82–88. https://doi.org/10.1093/elt/ccy044
Im, G. H., Shin, D., & Cheng, L. (2019). Critical review of validation models and practices in language testing: Their limitations and future directions for validation research. Language Testing in Asia, 9(1), 14.
*James, C. L. (2008). Electronic scoring of essays: Does topic matter? Assessing Writing, 13(2), 80-92. https://doi.org/10.1016/j.asw.2008.05.001
Kane, M. (2013). Validating the Interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. https://doi.org/10.1111/jedm.12000
Keith, T. Z. (2003). Validity and automated essay scoring systems. In M. D. Shermis & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 147–168). Erlbaum.
*Klobucar, A., Elliot, N., Deess, P., Rudniy, O., & Joshi, K. (2013). Automated scoring in context: Rapid assessment for placed students. Assessing Writing, 18(1), 62–84. https://doi.org/10.1016/j.asw.2012.10.001
Lamprianou, I., Tsagari, D., & Kyriakou, N. (2020). The longitudinal stability of rating characteristics in an EFL examination: Methodological and substantive considerations. Language Testing. https://doi.org/10.1177/0265532220940960
Lee, Y. W., Gentile, C., & Kantor, R. (2010). Toward automated multi-trait scoring ofessays: Investigating links among holistic, analytic, and text feature scores [Article]. Applied Linguistics, 31(3), 391–417. https://doi.org/10.1093/applin/amp040.
*Li, J., Link, S., & Hegelheimer, V. (2015). Rethinking the role of automated writing evaluation (AWE) feedback in ESL writing instruction. Journal of Second Language Writing, 27, 1-18. https://doi.org/10.1016/j.jslw.2014.10.004
Li, S., & Wang, H. (2018). Traditional literature review and research synthesis. In A. Phakiti, P. De Costa, L. Plonsky, & S. Starfield (Eds.), The Palgrave Handbook of applied linguistics research methodology (pp. 123–144). Palgrave-MacMillan.
Liu, S., & Kunnan, A. J. (2016). Investigating the application of automated writing evaluation to Chinese undergraduate English majors: A case study of WritetoLearn. CALICO Journal, 33(1), 71–91. https://doi.org/10.1558/cj.v33i1.26380.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). American Council on Education and Macmillan.
Mislevy, R. (2020). An evidentiary-reasoning perspective on automated scoring: Commentary on part I. In D. Yan, A. A. Rupp, & P. Foltz (Eds.), Handbook of automated scoring: Theory into practice (pp. 151–167). Taylor and Francis Group/CRC Press.
National Council of Teachers of English. (2013). NCTE position statement on machine scoring. https://ncte.org/statement/machine_scoring/
Phakiti, A., De Costa, P., Plonsky, L., & Starfield, S. (2018). Applied linguistics research: Current issues, methods, and trends. In A. Phakiti, P. De Costa, L. Plonsky, & S. Starfield (Eds.) The Palgrave Handbook of Applied Linguistics Research Methodology pp. 5–29. Palgrave-MacMillan
*Perelman, L. (2014). When "the state of the art" is counting words. Assessing Writing, 21, 104-111. https://doi.org/10.1016/j.asw.2014.05.001
*Powers, D. E., Burstein, J. C., Chodorow, M., Fowles, M. E., & Kukich, K. (2002a). Stumping e-rater: challenging the validity of automated essay scoring. Computers in Human Behavior, 18(2), 103–134. https://doi.org/10.1016/s0747-5632(01)00052-8
*Powers, D. E., Burstein, J. C., Chodorow, M. S., Fowles, M. E., & Kukich, K. (2002b). Comparing the validity of automated and human scoring of essays. Journal of Educational Computing Research, 26(4), 407-425. https://doi.org/10.1092/UP3H-M3TE-Q290-QJ2T
*Qian, L., Zhao, Y., & Cheng, Y. (2020). Evaluating China’s Automated Essay Scoring System iWrite [Article]. Journal of Educational Computing Research, 58(4), 771-790. https://doi.org/10.1177/0735633119881472
Ramesh, D., & Sanampudi, S. K. (2021). An automated essay scoring systems: A systematic literature review. Artificial Intelligence Review, 55(3), 2495–2527. https://doi.org/10.1007/s10462-021-10068-2
Ramineni, C., & Williamson, D. M. (2013). Automated essay scoring: Psychometricguidelines and practices. Assessing Writing, 18(1), 25–39. https://doi.org/10.1016/j.asw.2012.10.004.
*Ramineni, C., & Williamson, D. (2018). Understanding mean score differences between the e-rater® automated scoring engine and humans for demographically based groups in the GRE® General Test. ETS Research Report Series, 2018(1), 1-31. https://doi.org/10.1002/ets2.12192
*Reilly, E. D., Stafford, R. E., Williams, K. M., & Corliss, S. B. (2014). Evaluating the validity and applicability of automated essay scoring in two massive open online courses. International Review of Research in Open and Distance Learning, 15(5), 83–98. https://doi.org/10.19173/irrodl.v15i5.1857
Reilly, E. D., Williams, K. M., Stafford, R. E., Corliss, S. B., Walkow, J. C., & Kidwell, D. K. (2016). Global times call for global measures: Investigating automated essay scoring in linguisticallydiverse MOOCs. Online Learning Journal, 20(2). https://doi.org/10.24059/olj.v20i2.638; https://doi.org/10.19173/irrodl.v15i5.1857
Riazi, M., Shi, L., & Haggerty, J. (2018). Analysis of the empirical research in the journal of second language writing at its 25th year (1992–2016). Journal of Second Language Writing, 41, 41–54. https://doi.org/10.1016/j.jslw.2018.07.002
Richardson, M. & Clesham, R. (2021) ‘Rise of the machines? The evolving role of AI technologies in high-stakes assessment’. London Review of Education, 19 (1), 9, 1–13. https://doi.org/10.14324/LRE.19.1.09
Rotou, O., & Rupp, A. A. (2020). Evaluations of Automated Scoring Systems inPractice. ETS Research Report Series, 2020(1), 1–18. https://doi.org/10.1002/ets2.12293.
Sarkis-Onofre, R., Catalá-López, F., Aromataris, E., & Lockwood, C. (2021). How to properly use the PRISMA Statement. Systematic Reviews, 10(1). https://doi.org/10.1186/s13643-021-01671-z
Sawaki, Y., & Xi, X. (2019). Univariate generalizability theory in language assessment. In V. Aryadoust & M. Raquel (Eds.), Quantitative data analysis for language assessment (Vol. 1, pp. 30–53). Routledge.
Schotten, M., Aisati, M., Meester, W. J. N., Steigninga, S., & Ross, C. A. (2018). A brief history of Scopus: The world’s largest abstract and citation database of scientific literature. In F. J. Cantu-Ortiz (Ed.), Research analytics: Boosting university productivity and competitiveness through Scientometrics (pp. 33–57). Taylor & Francis.
*Shermis, M. D. (2014). State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration. Assessing Writing, 20, 53-76.
Shermis, M. D. (2020). International application of Automated Scoring. In D. Yan, A. A. Rupp, & P. Foltz (Eds.), Handbook of automated scoring: Theory into practice (pp. 113–132). Taylor and Francis Group/CRC Press.
Shermis, M. D., & Burstein, J. (2003). Introduction. In M. D. Shermis & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. xiii–xvi). Lawrence Erlbaum Associates.
Shermis, M. D., Burstein, J., & Bursky, S. A. (2013). Introduction to automated essay evaluation. In M. D. Shermis & J. Burstein (Eds.), Handbook of automated essay evaluation: Current applications and new directions (pp. 1–15). Routledge/Taylor & Francis Group.
Shermis, M., Burstein, J., Elliot, N., Miel, S., & Foltz, P. (2016). Automated writing evaluation: A growing body of knowledge. In C. MacArthur, S. Graham, & J. Fitzgerald (Eds.), Handbook of writing research (pp. 395–409). Guilford Press.
Shin, J., & Gierl, M. J. (2020). More efficient processes for creating automated essayscoring frameworks: A demonstration of two algorithms. Language Testing, 38(2), 247–272. https://doi.org/10.1177/0265532220937830.
Stevenson, M., & Phakiti, A. (2014). The effects of computer-generated feedback on the quality of writing. Assessing Writing, 19, 51–65. https://doi.org/10.1016/j.asw.2013.11.007
Stevenson, M. (2016). A critical interpretative synthesis: The integration ofAutomated Writing Evaluation into classroom writing instruction. Computers and Composition, 42, 1–16. https://doi.org/10.1016/j.compcom.2016.05.001.
Stevenson, M., & Phakiti, A. (2019). Automated feedback and second language writing. In K. Hyland & F. Hyland (Eds.), Feedback in second language writing: Contexts and issues (pp. 125–142). Cambridge University Press. https://doi.org/10.1017/9781108635547.009
Toulmin, S. E. (2003). The uses of argument (Updated). Cambridge University Press.
*Tsai, M. H. (2012). The consistency between human raters and an automated essay scoring system in Grading High School Students' English writing. Action in Teacher Education, 34(4), 328-335. https://doi.org/10.1080/01626620.2012.717033
Vojak, C., Kline, S., Cope, B., McCarthey, S., & Kalantzis, M. (2011). New spaces and old places: An analysis of writing assessment software. Computers and Composition, 28(2), 97–111.
*Vajjala, S. (2018). Automated assessment of non-native learner essays: Investigating the role of linguistic features [Article]. International Journal of Artificial Intelligence in Education, 28(1), 79-105. https://doi.org/10.1007/s40593-017-0142-3
Ware, P. (2011). Computer-generated feedback on student writing. TESOL Quarterly, 45(4), 769–774. https://doi.org/10.5054/tq.2011.272525
Warschauer, M., & Ware, P. (2006). Automated writing evaluation: Defining the classroom research agenda. Language Teaching Research, 10(2), 157–180. https://doi.org/10.1191/1362168806lr190oa
Weigle, S. C. (2013a). English as a second language writing and automated essay evaluation. In M. D. Shermis & J. Burstein (Eds.), Handbook of automated essay evaluation: Current applications and new directions (pp. 36–54). Routledge/Taylor & Francis Group.
Weigle, S. C. (2013b). English language learners and automated scoring of essays: Critical considerations. Assessing Writing, 18(1), 85–99. https://doi.org/10.1016/j.asw.2012.10.006
*Wilson, J. (2017). Associated effects of automated essay evaluation software on growth in writing quality for students with and without disabilities. Reading and Writing, 30(4), 691-718. https://doi.org/10.1007/s11145-016-9695-z
Williamson, D., Xi, X., & Breyer, F. (2012). A Framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13. https://doi.org/10.1111/j.1745-3992.2011.00223.x
Xi, X. (2010). Automated scoring and feedback systems: Where are we and where are we heading? Language Testing, 27(3), 291–300. https://doi.org/10.1177/0265532210364643
Zheng, Y., & Yu, S. (2019). What has been assessed in writing and how? Empirical evidence from Assessing Writing (2000–2018). Assessing Writing, 42, 100421. https://doi.org/10.1016/j.asw.2019.100421
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Human and animal rights and informed consent
The study does NOT include any human participants and/or animals. According to the Research Ethics Committee of Nanyang Technological University, where there are no human or animal subjects in the study, no ethical approval is required. As a result, no informed consent was necessary in the study.
Conflict of interest
No potential conflict of interest was reported by the author(s).
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Huawei, S., Aryadoust, V. A systematic review of automated writing evaluation systems. Educ Inf Technol 28, 771–795 (2023). https://doi.org/10.1007/s10639-022-11200-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10639-022-11200-7