Abstract
We introduce a short answer scoring engine made up of an ensemble of deep neural networks and a Latent Semantic Analysis-based model to score short constructed responses for a large suite of questions from a national assessment program. We evaluate the performance of the engine and show that the engine achieves above-human-level performance on a large set of items. Items are scored using 2-point and 3-point holistic rubrics. We outline the items, data, handscoring methods, engine, and results. We also provide an overview of performance key student groups including: gender, ethnicity, English language proficiency, disability status, and economically disadvantaged status.

Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The data provided may contain personally identifiable information and is not publicly available.
Code Availability
The code used to model these responses is not generally available.
Notes
The student and item-specific embeddings were that it was trained predominantly on a corpus in which there are many spelling mistakes. Approximately 90 thousand words in the 1.12 million-word vocabulary could be found in a large dictionary of correctly spelled words. Ad hoc inspection indicated that many incorrectly spelled words have a very high cosine similarity with their correctly spelled variants.
References
American Educational Research Association, American Psychological Association, National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.
Anscombe, F. J. (1948). The transformation of Poisson, binomial and negative-binomial data. Biometrika, 246–254.
Anson, C. S. (2013). NCTE position statement on machine scoring: Machine scoring fails the test.
Arter, J. (2000). Rubrics, scoring guides, and performance criteria: Classroom tools for. The Annual Conference of the American Educational Research Association. New Orleans.
Basu, S., Jacobs, C., & Vanderwende, L. (2013). Powergrading: A clustering approach to amplify human effort for short answer grading. Transactions of the Association for Computational Linguistics, 391–402.
Bridgeman, B. C. (2009). Considering fairness and validity in evaluating automated scoring. National Council on Measurement in Education. San Diego.
Cho, K. v. (2012). Learning phrase representations using rnn encoderdecoder for statistical machine translation. preprint arxivs, 1406.1078.
Cohen, J. (1968). Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological bulletin, 213.
Darling-Hammond, L. J. (2013). Criteria for high-quality assessment. Stanford Center for Opportunity Policy in Education.
Deerwester, S., Dumais, S., Furnas, G., Landauer, T., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.
Devlin, J. M.-W. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dikli, S. (2006). An overview of automated scoring of essays. The Journal of Technology, Learning and Assessment.
Dzikovska, M. O., Nielsen, R. D., Brew, C., Leacock, C., Giampiccolo, D., Bentivogli, L., Clark, P., Dagan, I., & Dang, H. T. (2013). Semeval-2013 task 7: The joint student response analysis and 8th recognizing textual entailment challenge. NORTH TEXAS STATE UNIV DENTON.
Esteva, A. B. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature, 115.
Fan, Xitao, & Nowell, Dana L. (2011). Using propensity score matching in educational research. Gifted Child Quarterly, 55(1), 74–79.
Gewertz, C. (2013, June 9). States Ponder Costs of Common Tests. Education Week, pp. 20–22.
Gong, T. a. (2019). An Attention-based Deep Model for Automatic Short Answer Score. International Journal of Computer Science and Software Engineering, 127–132.
Hand-Scoring Rules. (2016). Retrieved from http://www.smarterapp.org/documents/Smarter_Balanced_Hand_Scoring_Rules.pdf. Accessed 18 June.
Harris, Z. S. (1954). Distributional structure. Word, 146–162.
Hochreiter, S. a. (1997). Long short-term memory. Neural computation, 1735–1780.
Kumar, Y., Swati A., Debanjan M., Rajiv R. S., Ponnurangam K., & Roger Z. (2019). Get it scored using autosas—an automated system for scoring short answers. In Proceedings of the AAAI Conference on Artificial Intelligence, 33(1), pp. 9662–9669.
Leacock, C. a. (2003). C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 389–405.
Lee, N. T. (2018). Detecting racial bias in algorithms and machine learning. Journal of Information, Communication and Ethics in Society.
Madnani, N. & Loukina, A. (2016). RSMTool: A Collection of Tools for Building and Evaluating Automated Scoring Models. Journal of Open Source Software.
McCurry, D. (2010). Can machine scoring deal with broad and open writing. Assessing Writing, 118–129.
McGraw-Hill Education, C. T. (2014). Smarter balanced assessment consortium field test: Automated scoring research studies (in accordance with smarter balanced RFP 17).
Mikolov, T. I. (2013). Distributed representations of words and phrases and their compositionality. . Advances in neural information processing systems, 3111–3119.
Mohler, M. a. (2009). Text-to-text semantic similarity for automatic short answer grading. Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. (pp. 567–575). Association for Computational Linguistics.
Murphy, R. F. (2019). Artificial Intelligence Applications to Support K–12 Teachers and Teaching. RAND Corporation.
Norvig, P. (2007a). Retrieved from How to write a spelling corrector: http://norvig.com/spell-correct.html. Accessed July 2018
Norvig, P. (2007b). How to write a spelling corrector. Retrieved from How to write a spelling corrector: http://norvig.com/spell-correct.html. Accessed July 2018
Ormerod, C. M. & Harris, A. E. (2018). Neural network approach to classifying alarming student responses to online assessment. arXiv preprint, 1809.08899.
Ormerod, C. M., Malhotra, A., & Jafari, A. (2021). Automated essay scoring using efficient transformer-based language models. arXiv preprint arXiv:2102.13136.
Osgood, C. E. (1964). Semantic differential technique in the comparative study of cultures 1. American Anthropologist, 171–200.
Page, E. B. & Petersen, N. S. (1995). The computer moves into essay grading: Updating the ancient test. Phi delta kappan, 561.
Page, E. B. (2003). Project Essay Grade: PEG. In M. D. Shermis (Ed.), Automated essay scoring: A cross-disciplinary perspective (p. 43). New Jersey: Lawrence Erlbaum Associates.
Pearson and ETS. (2015). Research results of PARCC automated scoring proof of concept study. Retrieved from http://www.parcconline.org/images/Resources/Educator-resources/PARCC_AI_Research_Report.pdf. Accessed Sept 2019.
Perelman, L. C. (2013). Critique of Mark D. Shermis & Ben Hammer, Contrasting state-of-the-art automated scoring of essays: Analysis. Journal of Writing Assessment.
Powers, D. E. (2002). Stumping e-rater: challenging the validity of automated essay scoring. Computers in Human Behavior, 103–134.
Rajpurkar, P. J. (2016). Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:, 1606.05250.
Riordan, B., Horbach, A., Cahill, A., Zesch, T., & Lee, C. (2017). Investigating neural architectures for short answer scoring. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 159–168.
Roberts, K. (2016). Assessing the corpus size vs. similarity trade-off for word embeddings in clinical NLP. Proceedings of the Clinical Natural Language Processing Workshop, (pp. 54–63). Osaka, Japan.
Rubin, D. B. (2006). Matched sampling for causal effects. Cambridge University Press.
Sakaguchi, K. M. (2015). Effective feature integration for automated short answer scoring. Proceedings of the 2015 conference of the North American Chapter of the association for computational linguistics: Human language technologies, (pp. 1049–1054).
Sato, E. R. (2011). SMARTER balanced assessment consortium common core state standards analysis: Eligible content for the summative assessment. Final Report. Smarter Balanced Assessment Consortium.
Shermis, M. D. (2013a). Contrasting state-of-the-art automated scoring of essays: Analysis. Annual national council on measurement in education meeting.
Shermis, M. D. (2013). Handbook of automated essay evaluation: Current applications and new directions. Routledge.
Shermis, M. D. (2015). Contrasting state-of-the-art in the machine scoring of short-form constructed responses. Educational Assessment, 46–65.
Silver, D. A. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 484.
Smith, C. (2017). iOS 10: Siri now works in third-party apps, comes with extra AI features. BGR.
Sultan, M. A. (2015). Fast and easy short answer grading with high accuracy. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (pp. 1070–1075).
Sung, Chul, Tejas Indulal Dhamecha, and Nirmal Mukhi. (2019). Improving short answer grading using transformer-based pre-training. International Conference on Artificial Intelligence in Education, (pp. 469–481).
Szegedy, C. S. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence.
Tomas Mikolov, K. C. (2013). Efficient Estimation of Word Representations in Vector Space. Proceedings of Workshop at ICLR.
Turney, P. D. (2010). From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research, 141–188.
Vaswani, A. N. (2017). Attention is all you need. Advances in neural information processing systems., 5998–6008.
Vogels, W. (2017). Bringing the Magic of Amazon AI and Alexa to Apps on AWS. Retrieved from All Things Distributed: www.allthingsdistributed.com
Williamson, D. M., Xiaoming X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational measurement: issues and practice, 31(1), 2–13.
Wu, Y. M. (2016). Google's neural machine translation system: Bridging the gap between human and machine translation. arxiv preprints, 1609.08144.
Zhilin Yang, Z. D. (2019). XLNet: Generalized Autoregressive Pretraining for Language Understanding. preprint Paper arxiv, 1906.08237.
Zhou, Z.-H. J. (2002a). Ensembling neural networks: many could be better than all. Artificial Intelligence, 239–263.
Zhou, Z.-H. J. (2002b). Ensembling neural networks: many could be better than all. Artificial Intelligence, 239–263.
Zhai, X. Y. (2020). Applying machine learning in science assessment: a systematic review. Studies in Science Education, 111–151.
Zou, J. a. (2018). AI can be sexist and racist—it’s time to make it fair. Nature, 324–326.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of Interest/Competing Interests
Not Applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ormerod, C., Lottridge, S., Harris, A.E. et al. Automated Short Answer Scoring Using an Ensemble of Neural Networks and Latent Semantic Analysis Classifiers. Int J Artif Intell Educ 33, 467–496 (2023). https://doi.org/10.1007/s40593-022-00294-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40593-022-00294-2