Skip to main content

Advertisement

Log in

AI-assisted automated scoring of picture-cued writing tasks for language assessment

  • Published:
Education and Information Technologies Aims and scope Submit manuscript

Abstract

Grading assignments is inherently subjective and time-consuming; automatic scoring tools can greatly reduce teacher workload and shorten the time needed for providing feedback to learners. The purpose of this paper is to propose a novel method for automatically scoring student responses to picture-cued writing tasks. As a popular paradigm for language instruction and assessment, a picture-cued writing task typically requires students to describe a picture or pictures. Correspondingly, the automatic scoring methods must measure the link(s) between visual pictures and their textual descriptions. For this purpose, we first designed a picture-cued writing test and collected nearly 4 k responses from 279 K12 students. Based on these responses, we then developed an AI scoring model by incorporating the emerging cross-modal matching technology and some NLP algorithms. The performance of the model was evaluated carefully with six popular measures and was found to demonstrate accurate scoring results with a small mean absolute error of 0.479 and a high adjacent-agreement rate of 90.64%. We believe this method could reduce the subjective elements inherent in human grading and save teachers’ time from the mundane task of grading to other valuable endeavors such as designing teaching plans based on AI-generated diagnosis of student progress.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig.8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  • Aschawir, A. (2014). Using series pictures to develop the students’ ideas in English narrative writing. Scholarly Journal of Education, 3(7), 88–95.

    Google Scholar 

  • Asrifan, A. (2015). The use of pictures story in improving students' ability to write narrative composition. International Journal of Language and Linguistics, 3(4), 244–251. https://doi.org/10.11648/j.ijll.20150304.18

  • Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® V. 2. The Journal of Technology, Learning and Assessment, 4(3), 1–31.

    Google Scholar 

  • Baird, C., & Dooey, P. (2017). Using images to facilitate writing for skills assessment: A visual PELA. The Australian Journal of Indigenous Education, 46(2), 160–172. https://doi.org/10.1017/jie.2016.32

    Article  Google Scholar 

  • Bridgeman, B., Trapani, C., & Attali, Y. (2012). Comparison of human and machine scoring of essays: Differences by gender, ethnicity, and country. Applied Measurement in Education, 25(1), 27–40. https://doi.org/10.1080/08957347.2012.635502

    Article  Google Scholar 

  • Chapelle, C. A., Cotos, E., & Lee, J. (2015). Validity arguments for diagnostic assessment using automated writing evaluation. Language Testing, 32(3), 385–405. https://doi.org/10.1177/0265532214565386

    Article  Google Scholar 

  • Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.

    Article  MATH  Google Scholar 

  • Chen, H., Ding, G., Liu, X., Lin, Z., Liu, J., & Han, J. (2020). IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12655–12663). https://doi.org/10.1109/CVPR42600.2020.01267

  • Chen, F., Zhang, D., Han, M., Chen, X., Shi, J., Xu, S., & Xu, B. (2022). VLP: A survey on vision-language pre-training. arXiv preprint arXiv: 2202.09061. https://doi.org/10.48550/arXiv.2202.09061

  • Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794).

  • Deeva, G., Bogdanova, D., Serral, E., Snoeck, M., & De Weerdt, J. (2021). A review of automated feedback systems for learners: Classification framework, challenges and opportunities. Computers & Education, 162, 104094. https://doi.org/10.1016/j.compedu.2020.104094

    Article  Google Scholar 

  • Diao, H., Zhang, Y., Ma, L., & Lu, H. (2021). Similarity reasoning and filtration for image-text matching. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (Vol. 35, No. 2. pp. 1218– 1226).

  • Elliott, S., Shermis, M. D., & Burstein, J. (2003). Overview of intelliMetric. In M. D. Shermis and J. Burstein (Eds.), Automated Essay Scoring: A Cross-Disciplinary Perspective (pp. 67–70). Lawrence Erlbaum Associates. https://doi.org/10.4324/9781410606860

  • Erfanian Mohammadi, J., Elahi Shirvan, M., & Akbari, O. (2019). Systemic functional multimodal discourse analysis of teaching students developing classroom materials. Teaching in Higher Education, 24(8), 964–986. https://doi.org/10.1080/13562517.2018.1527763

    Article  Google Scholar 

  • Gabeur, V., Sun, C., Alahari, K., & Schmid, C. (2020). Multi-modal transformer for video retrieval. In Proceedings of European Conference on Computer Vision (pp. 214–229).

  • Haman, E., Łuniewska, M., & Pomiechowska, B. (2015). Designing cross-linguistic lexical tasks (CLTs) for bilingual preschool children. In S. Armon-Lotem, J. de Jong, & N. Meir (Eds.), Methods for assessing multilingual children: Disentangling bilingualism from Language impairment (pp. 194–238). Multilingual Matters. https://doi.org/10.21832/9781783093137-010

  • Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77(1), 81–112.

    Article  Google Scholar 

  • Hossain, M. Z., Sohel, F., Shiratuddin, M. F., & Laga, H. (2019). A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR), 51(6), 1–36. https://doi.org/10.1145/3295748

    Article  Google Scholar 

  • James, K. H., Vinci-Booher, S. Munoz-Rubke, F. (2017). The impact of multimodal-multisensory learning on human performance and brain activation patterns. In Oviatt, S., Schuller, B. Cohen, P., Sonntag, D. Potamianos, G., & Kruger, A. (Eds.), The handbook of multimodal-multisensor interfaces, vol 1: Foundations, user modeling, and common modality combinations (pp. 51-94). Morgan & Claypool Publishers. https://doi.org/10.1145/3015783.3015787

  • Jin, C., Zhang, T., Liu, S., Tie, Y., Lv, X., Li, J., & Yang, Z. (2021). Cross-modal deep learning applications: audio-visual retrieval. In Proceedings of International Conference on Pattern Recognition (pp. 301–313)

  • Kharkhurin, A. V. (2012). A preliminary version of an internet-based picture naming test. Open Journal of Modern Linguistics, 2(01), 34–41. https://doi.org/10.4236/ojml.2012.21005

    Article  Google Scholar 

  • Khoii, R., & Doroudian, A. (2014). Automated scoring of EFL learners' written performance: a torture or a blessing. In Proceedings of Conference on ICT for Language Learning (pp. 5146–5155)

  • Kingston, N., & Nash, B. (2011). Formative assessment: A meta-analysis and a call for research. Educational Measurement: Issues and Practice, 30(4), 28–37. https://doi.org/10.1111/j.1745-3992.2011.00220.x

    Article  Google Scholar 

  • Lee, K. H., Chen, X., Hua, G., Hu, H., & He, X. (2018). Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (pp. 201–216). https://doi.org/10.1007/978-3-030-01225-0_13

  • Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., & Hoi, S. C. H. (2021). Align before fuse: Vision and language representation learning with momentum distillation. In Proceedings of the 35th Conference on Neural Information Processing Systems (pp. 1978–1992)

  • Link, S., Mehrzad, M., & Rahimi, M. (2020). Impact of automated writing evaluation on teacher feedback, student revision, and writing improvement. Computer Assisted Language Learning, 35(4), 605–634. https://doi.org/10.1080/09588221.2020.1743323

    Article  Google Scholar 

  • Listyani, L. (2019). The use of a visual image to promote narrative writing ability and creativity. Eurasian Journal of Educational Research, 80, 193–224.

    Google Scholar 

  • Liu, C., Mao, Z., Zhang, T., Xie, H., Wang, B., & Zhang, Y. (2020). Graph structured network for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10921–10930). https://doi.org/10.1109/CVPR42600.2020.01093

  • Liu, J., Xu, Y., & Zhu, Y. (2019). Automated essay scoring based on two-stage learning. arXiv preprint arXiv: 1901.07744. https://doi.org/10.48550/arXiv.1901.07744

  • Lu, C., & Cutumisu, M. (2021). Integrating Deep Learning into an Automated Feedback Generation System for Automated Essay Scoring. In Proceedings of the 14th International Conference on Educational Data Mining (pp.573–579).

  • Malali, N., & Keller, Y. (2021). Learning to embed semantic similarity for joint image-text retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2021.3132163

    Article  Google Scholar 

  • Mangaroska, K., Martinez-Maldonado, R., Vesin, B., & Gašević, D. (2021). Challenges and opportunities of multimodal data in human learning: The computer science students’ perspective. Journal of Computer Assisted Learning, 37(4), 1030–1047. https://doi.org/10.1111/jcal.12542

    Article  Google Scholar 

  • Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 55–60). https://doi.org/10.3115/v1/P14-5010

  • McCarthy, K. S., Roscoe, R. D., Allen, L. K., Likens, A. D., & McNamara, D. S. (2022). Automated writing evaluation: Does spelling and grammar feedback support high-quality writing and revision? Assessing Writing, 52, 100608. https://doi.org/10.1016/j.asw.2022.100608

    Article  Google Scholar 

  • Page, E. B. (2003). Project essay grade: PEG. In M. D. Shermis and J. Burstein (Eds.), Automated essay scoring: A Cross-disciplinary perspective (pp. 43–54). Lawrence Erlbaum Associates. https://doi.org/10.4324/9781410606860-12

  • Paivio, A. (1991). Dual coding theory: Retrospect and current status. Canadian Journal of Psychology/revue Canadienne De Psychologie, 45(3), 255. https://doi.org/10.1037/h0084295

    Article  Google Scholar 

  • Palermo, C., & Thomson, M. M. (2018). Teacher implementation of self-regulated strategy development with an automated writing evaluation system: Effects on the argumentative writing performance of middle school students. Contemporary Educational Psychology, 54, 255–270. https://doi.org/10.1016/j.cedpsych.2018.07.002

    Article  Google Scholar 

  • Ramineni, C., & Williamson, D. M. (2013). Automated essay scoring: Psychometric guidelines and practices. Assessing Writing, 18(1), 25–39.https://doi.org/10.1016/j.asw.2012.10.004

  • Ren, S., He, K., Girshick, R., & Sun, J. (2016). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031

    Article  Google Scholar 

  • Roscoe, R. D., Allen, L. K., Johnson, A. C., & McNamara, D. S. (2018). Automated writing instruction and feedback: Instructional mode, attitudes, and revising. In Proceedings of the 62nd Annual Meeting of the Human Factors and Ergonomics Society (pp. 2089–2093). Human Factors & Ergonomics Society. https://doi.org/10.1177/1541931218621471

  • Sakaguchi, K., Heilman, M., & Madnani, N. (2015). Effective feature integration for automated short answer scoring. In Proceedings of the 2015 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1049–1054). https://doi.org/10.3115/v1/N15-1111

  • Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681.

    Article  Google Scholar 

  • Shim, Y. (2013). The effects of online writing evaluation program. Teaching English with Technology, 13(3), 18–34.

    Google Scholar 

  • Silverman, R. D., Coker, D., Proctor, C. P., Harring, J., Piantedosi, K. W., & Hartranft, A. M. (2015). The relationship between language skills and writing outcomes for linguistically diverse students in upper elementary school. The Elementary School Journal, 116(1), 103–125. https://doi.org/10.1086/683135

    Article  Google Scholar 

  • Steinberg, D., & Colla, P. (2009). CART: Classification and regression trees. The Top Ten Algorithms in Data Mining, 9, 179. https://doi.org/10.4135/9781412950589.n88

    Article  Google Scholar 

  • Stevenson, M., & Phakiti, A. (2014). The effects of computer-generated feedback on the quality of writing. Assessing Writing, 19, 51–65. https://doi.org/10.1016/j.asw.2013.11.007

    Article  Google Scholar 

  • Strobl, C., Ailhaud, E., Benetos, K., Devitt, A., Kruse, O., Proske, A., & Rapp, C. (2019). Digital support for academic writing: A review of technologies and pedagogies. Computers & Education, 131, 33–48. https://doi.org/10.1016/j.compedu.2018.12.005

    Article  Google Scholar 

  • Tan, H., & Bansal, M. (2019). Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv: 1908.07490. https://doi.org/10.18653/v1/D19-1514

  • Toyama, J., Misono, M., Suzuki, M., Nakayama, K., & Matsuo, Y. (2016). Neural machine translation with latent semantic of image and text. arXiv preprint arXiv 1611.08459. https://doi.org/10.48550/arXiv.1611.08459

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (pp.1–11).

  • Wang, Y. (2021). Survey on deep multi-modal data analytics: Collaboration, rivalry, and fusion. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 17(1), 1–25. https://doi.org/10.1145/3408317

  • Wang, E. L., Matsumura, L. C., Correnti, R., Litman, D., Zhang, H., Howe, E., … & Quintana, R. (2020). eRevis (ing): Students’ revision of text evidence use in an automated writing evaluation system. Assessing Writing, 44, 100449. https://doi.org/10.1016/j.asw.2020.100449

  • Warschauer, M., & Grimes, D. (2008). Automated writing assessment in the classroom. Pedagogies: An International Journal, 3(1), 22–36. https://doi.org/10.1080/15544800701771580

  • Wei, X., Zhang, T., Li, Y., Zhang, Y., & Wu, F. (2020). Multi-modality cross attention network for image and sentence matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10941–10950). https://doi.org/10.1109/CVPR42600.2020.01095

  • Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13. https://doi.org/10.1111/j.1745-3992.2011.00223.x

    Article  Google Scholar 

  • Wilson, J., Ahrendt, C., Fudge, E. A., Raiche, A., Beard, G., & MacArthur, C. (2021). Elementary teachers’ perceptions of automated feedback and automated scoring: Transforming the teaching and learning of writing using automated writing evaluation. Computers & Education, 168, 104208. https://doi.org/10.1016/j.compedu.2021.104208

    Article  Google Scholar 

  • Wilson, J., & Czik, A. (2016). Automated essay evaluation software in English language arts classrooms: Effects on teacher feedback, student motivation, and writing quality. Computers & Education, 100, 94–109. https://doi.org/10.1016/j.compedu.2016.05.004

    Article  Google Scholar 

  • Wilson, J., & Roscoe, R. D. (2020). Automated writing evaluation and feedback: Multiple metrics of efficacy. Journal of Educational Computing Research, 58(1), 87–125. https://doi.org/10.1177/0735633119830764

    Article  Google Scholar 

  • Woodworth, J., & Barkaoui, K. (2020). Perspectives on using automated writing evaluation systems to provide written corrective feedback in the ESL classroom. TESL Canada Journal, 37(2), 234–247. https://doi.org/10.18806/tesl.v37i2.1340

  • Zhang, R., & Zou, D. (2021). A state-of-the-art review of the modes and effectiveness of multimedia input for second and foreign language learning. Computer Assisted Language Learning, ahead-of-print, 1–27. https://doi.org/10.1080/09588221.2021.1896555

Download references

Funding

This work was supported by the One-off Special Fund from Central and Faculty [grant number 02136] and the Start-Up Research Grant [grant number RG41/20-21R] of the Education University of Hong Kong; and the Youth Elite Supporting Plan in Universities of Anhui Province [grant number gxyqZD2019077], the Higher Education Teaching and Research Project of Anhui Province [grant number 2020jyxm0633], and the Science and Technology Plan Project in Chuzhou [grant number 2021ZD016].

Author information

Authors and Affiliations

Authors

Contributions

Ruibin Zhao, Conceptualization, methodology, validation, formal analysis, investigation, writing – original draft. Yipeng, Zhuang, Methodology, software, formal analysis, visualization, writing – review & editing. Di Zou, Resources, investigation, writing – review & editing. Qin Xie, Conceptualization, investigation, writing – review & editing. Leung Ho Philip YU, Supervision, writing – review & editing, project administration, funding acquisition.

Corresponding author

Correspondence to Philip L. H. Yu.

Ethics declarations

Conflict of Interest

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1

All 15 pictures were used in our writing test, and they were grouped in three categories. Each picture of first category includes one dominant character, e.g., a person or an animal, each picture of second category covers two characters having some interaction, and each picture third category often includes a group of people in an activates.

figure a

Note. The numbers are the ids of the pictures in our research and this paper.

Appendix 2

Table 5

Table 5 The scoring rubric designed for the picture-cued writing test in this study, and all the example responses were written by students for the 8th picture in Appendix A

Appendix 3. Machine learning model training.

Considering the problem of imbalanced data, that there are only few responses assigned with a low and high score. Serious imbalance of data can even cause the trained model to directly ignore categories with a small sample size. Some optimization tricks can be used to make the model pay more attention to the classes with fewer samples. This allows the model to focus on learning such features, rather than focusing too much on categories with a large number of samples. A simple way is to copy the same data directly to achieve the purpose of expanding the data, that is, oversampling. Another method is SMOTE. First, each sample \({x}_{i}\) is selected from the minority class samples as the root sample for synthesizing new samples; secondly, the k nearest neighbor samples of the same class of \({x}_{i}\) are used as the reference for synthesizing new samples, and new samples are generated by interpolation, repeat this process until the required number of samples is reached. Here, k is generally an odd number, we have tried k = 5, 15, and 25 on the training set, and found that k = 5 works better. Two sampling methods are compared, and finally we choose oversampling as the method for balancing the data based on their performance on the training set.

Next, we trained several machine learning models and used grid search to tune the parameters of each model by tenfold cross-validation on the training set. These models are: K-nearest neighbors algorithm (k-NN) (Coomans & Massart, 1982), which classify samples by finding the k most similar, nearest neighbors. The searching space of hyperparameters is {Number of neighbors: 5, 10, 20, 50. Weight function: “uniform”, “distance”}. The second model we trained is Random Forest (Breiman, 2001), which builds multiple decision trees in order to combine all predictions for a more robust behavior. The searching space of hyperparameters is {Number of decision trees in the forest: 50, 100, 200, 400. Criterion: “gini”, “entropy”. Maximum depth of the tree: 4,5,6,8,10. Minimum number of samples to split: 20,40,80,100. Number of features to consider when split: “sqrt”, “log2″}. Support-vector machines (SVM) (Cortes & Vladimir, 1995) is also considered, it maps training samples to point in space for classification and regression analysis. The searching space of hyperparameters is {Regularization parameter: 1, 10, 100. Kernel: “linear”, “poly”, “rbf”, “sigmoid”. Kernel coefficient gamma: 1e-2, 1e-3, 1e-4,” auto”}. Another model we trained is XGBoost (Chen & Guestrin, 2016), which implements machine learning algorithms under the Gradient Boosting framework. The searching space of hyperparameters is {Learning rate: 0.05, 0.1, 0.2, 0.3. Maximum depth: 4, 6, 8. Subsample: 0.6, 0.8, 0.9, 1. Scale_pos_weight: 1, 5, 10. Alpha: 0, 1, 2, 5, 10}. Based on the model performance on the tenfold CV, the models with the least MAE were selected for each component. We trained models to predict 6 components first, and the performance are list in the following table. By adding scores of these components, we get the final scores. Compared with the human scores, the total MAE is 0.479.

Components

MAE

Grammar \(\in \left[\mathrm{0,3}\right]\)

0.230

Spelling \(\in \left[0, 1\right]\)

0.056

Convention \(\in \left[0, 1\right]\)

0.054

Comprehensiveness \(\in \left[0, 3\right]\)

0.198

Vividness \(\in \left[0, 1\right]\)

0.083

Sentence structure \(\in \left[0, 1\right]\)

0.018

Appendix 4

Main indices used in our automated scoring model. In the figure, the values represent the Gini importance (Steinberg & Colla, 2009) of the indices in evaluating the grammar, spelling, convention, comprehensiveness, vividness, and sentence structure of student response, as well as in predicting the final score for the responses. The larger the values, the more important the indices are.

figure c

The six variables in the upper part are the features generated by three cross-modal matching methods (i.e., ALBEF, GSMN and SGRAF), and they are mainly used in the prediction of “Comprehensiveness”. Each method estimated a set of similarities with their models trained on different datasets or with multiple scales of parameters, and we chose the two most effective similarities in our experiments. The variables in the lower part are the features generated in natural language processing. This graph shows the importance of the variables by which the machine predicts the score, and they all fit our conjectures. For example, the vividness of a sentence mainly depends on whether a lot of adjectives and adverbs are used, and when scoring the sentences structure, more attention is paid to the number of clauses and pronouns.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, R., Zhuang, Y., Zou, D. et al. AI-assisted automated scoring of picture-cued writing tasks for language assessment. Educ Inf Technol 28, 7031–7063 (2023). https://doi.org/10.1007/s10639-022-11473-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10639-022-11473-y

Keywords

Navigation