Abstract
Grading assignments is inherently subjective and time-consuming; automatic scoring tools can greatly reduce teacher workload and shorten the time needed for providing feedback to learners. The purpose of this paper is to propose a novel method for automatically scoring student responses to picture-cued writing tasks. As a popular paradigm for language instruction and assessment, a picture-cued writing task typically requires students to describe a picture or pictures. Correspondingly, the automatic scoring methods must measure the link(s) between visual pictures and their textual descriptions. For this purpose, we first designed a picture-cued writing test and collected nearly 4 k responses from 279 K12 students. Based on these responses, we then developed an AI scoring model by incorporating the emerging cross-modal matching technology and some NLP algorithms. The performance of the model was evaluated carefully with six popular measures and was found to demonstrate accurate scoring results with a small mean absolute error of 0.479 and a high adjacent-agreement rate of 90.64%. We believe this method could reduce the subjective elements inherent in human grading and save teachers’ time from the mundane task of grading to other valuable endeavors such as designing teaching plans based on AI-generated diagnosis of student progress.
Similar content being viewed by others
References
Aschawir, A. (2014). Using series pictures to develop the students’ ideas in English narrative writing. Scholarly Journal of Education, 3(7), 88–95.
Asrifan, A. (2015). The use of pictures story in improving students' ability to write narrative composition. International Journal of Language and Linguistics, 3(4), 244–251. https://doi.org/10.11648/j.ijll.20150304.18
Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® V. 2. The Journal of Technology, Learning and Assessment, 4(3), 1–31.
Baird, C., & Dooey, P. (2017). Using images to facilitate writing for skills assessment: A visual PELA. The Australian Journal of Indigenous Education, 46(2), 160–172. https://doi.org/10.1017/jie.2016.32
Bridgeman, B., Trapani, C., & Attali, Y. (2012). Comparison of human and machine scoring of essays: Differences by gender, ethnicity, and country. Applied Measurement in Education, 25(1), 27–40. https://doi.org/10.1080/08957347.2012.635502
Chapelle, C. A., Cotos, E., & Lee, J. (2015). Validity arguments for diagnostic assessment using automated writing evaluation. Language Testing, 32(3), 385–405. https://doi.org/10.1177/0265532214565386
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.
Chen, H., Ding, G., Liu, X., Lin, Z., Liu, J., & Han, J. (2020). IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12655–12663). https://doi.org/10.1109/CVPR42600.2020.01267
Chen, F., Zhang, D., Han, M., Chen, X., Shi, J., Xu, S., & Xu, B. (2022). VLP: A survey on vision-language pre-training. arXiv preprint arXiv: 2202.09061. https://doi.org/10.48550/arXiv.2202.09061
Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794).
Deeva, G., Bogdanova, D., Serral, E., Snoeck, M., & De Weerdt, J. (2021). A review of automated feedback systems for learners: Classification framework, challenges and opportunities. Computers & Education, 162, 104094. https://doi.org/10.1016/j.compedu.2020.104094
Diao, H., Zhang, Y., Ma, L., & Lu, H. (2021). Similarity reasoning and filtration for image-text matching. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (Vol. 35, No. 2. pp. 1218– 1226).
Elliott, S., Shermis, M. D., & Burstein, J. (2003). Overview of intelliMetric. In M. D. Shermis and J. Burstein (Eds.), Automated Essay Scoring: A Cross-Disciplinary Perspective (pp. 67–70). Lawrence Erlbaum Associates. https://doi.org/10.4324/9781410606860
Erfanian Mohammadi, J., Elahi Shirvan, M., & Akbari, O. (2019). Systemic functional multimodal discourse analysis of teaching students developing classroom materials. Teaching in Higher Education, 24(8), 964–986. https://doi.org/10.1080/13562517.2018.1527763
Gabeur, V., Sun, C., Alahari, K., & Schmid, C. (2020). Multi-modal transformer for video retrieval. In Proceedings of European Conference on Computer Vision (pp. 214–229).
Haman, E., Łuniewska, M., & Pomiechowska, B. (2015). Designing cross-linguistic lexical tasks (CLTs) for bilingual preschool children. In S. Armon-Lotem, J. de Jong, & N. Meir (Eds.), Methods for assessing multilingual children: Disentangling bilingualism from Language impairment (pp. 194–238). Multilingual Matters. https://doi.org/10.21832/9781783093137-010
Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77(1), 81–112.
Hossain, M. Z., Sohel, F., Shiratuddin, M. F., & Laga, H. (2019). A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR), 51(6), 1–36. https://doi.org/10.1145/3295748
James, K. H., Vinci-Booher, S. Munoz-Rubke, F. (2017). The impact of multimodal-multisensory learning on human performance and brain activation patterns. In Oviatt, S., Schuller, B. Cohen, P., Sonntag, D. Potamianos, G., & Kruger, A. (Eds.), The handbook of multimodal-multisensor interfaces, vol 1: Foundations, user modeling, and common modality combinations (pp. 51-94). Morgan & Claypool Publishers. https://doi.org/10.1145/3015783.3015787
Jin, C., Zhang, T., Liu, S., Tie, Y., Lv, X., Li, J., & Yang, Z. (2021). Cross-modal deep learning applications: audio-visual retrieval. In Proceedings of International Conference on Pattern Recognition (pp. 301–313)
Kharkhurin, A. V. (2012). A preliminary version of an internet-based picture naming test. Open Journal of Modern Linguistics, 2(01), 34–41. https://doi.org/10.4236/ojml.2012.21005
Khoii, R., & Doroudian, A. (2014). Automated scoring of EFL learners' written performance: a torture or a blessing. In Proceedings of Conference on ICT for Language Learning (pp. 5146–5155)
Kingston, N., & Nash, B. (2011). Formative assessment: A meta-analysis and a call for research. Educational Measurement: Issues and Practice, 30(4), 28–37. https://doi.org/10.1111/j.1745-3992.2011.00220.x
Lee, K. H., Chen, X., Hua, G., Hu, H., & He, X. (2018). Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (pp. 201–216). https://doi.org/10.1007/978-3-030-01225-0_13
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., & Hoi, S. C. H. (2021). Align before fuse: Vision and language representation learning with momentum distillation. In Proceedings of the 35th Conference on Neural Information Processing Systems (pp. 1978–1992)
Link, S., Mehrzad, M., & Rahimi, M. (2020). Impact of automated writing evaluation on teacher feedback, student revision, and writing improvement. Computer Assisted Language Learning, 35(4), 605–634. https://doi.org/10.1080/09588221.2020.1743323
Listyani, L. (2019). The use of a visual image to promote narrative writing ability and creativity. Eurasian Journal of Educational Research, 80, 193–224.
Liu, C., Mao, Z., Zhang, T., Xie, H., Wang, B., & Zhang, Y. (2020). Graph structured network for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10921–10930). https://doi.org/10.1109/CVPR42600.2020.01093
Liu, J., Xu, Y., & Zhu, Y. (2019). Automated essay scoring based on two-stage learning. arXiv preprint arXiv: 1901.07744. https://doi.org/10.48550/arXiv.1901.07744
Lu, C., & Cutumisu, M. (2021). Integrating Deep Learning into an Automated Feedback Generation System for Automated Essay Scoring. In Proceedings of the 14th International Conference on Educational Data Mining (pp.573–579).
Malali, N., & Keller, Y. (2021). Learning to embed semantic similarity for joint image-text retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2021.3132163
Mangaroska, K., Martinez-Maldonado, R., Vesin, B., & Gašević, D. (2021). Challenges and opportunities of multimodal data in human learning: The computer science students’ perspective. Journal of Computer Assisted Learning, 37(4), 1030–1047. https://doi.org/10.1111/jcal.12542
Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 55–60). https://doi.org/10.3115/v1/P14-5010
McCarthy, K. S., Roscoe, R. D., Allen, L. K., Likens, A. D., & McNamara, D. S. (2022). Automated writing evaluation: Does spelling and grammar feedback support high-quality writing and revision? Assessing Writing, 52, 100608. https://doi.org/10.1016/j.asw.2022.100608
Page, E. B. (2003). Project essay grade: PEG. In M. D. Shermis and J. Burstein (Eds.), Automated essay scoring: A Cross-disciplinary perspective (pp. 43–54). Lawrence Erlbaum Associates. https://doi.org/10.4324/9781410606860-12
Paivio, A. (1991). Dual coding theory: Retrospect and current status. Canadian Journal of Psychology/revue Canadienne De Psychologie, 45(3), 255. https://doi.org/10.1037/h0084295
Palermo, C., & Thomson, M. M. (2018). Teacher implementation of self-regulated strategy development with an automated writing evaluation system: Effects on the argumentative writing performance of middle school students. Contemporary Educational Psychology, 54, 255–270. https://doi.org/10.1016/j.cedpsych.2018.07.002
Ramineni, C., & Williamson, D. M. (2013). Automated essay scoring: Psychometric guidelines and practices. Assessing Writing, 18(1), 25–39.https://doi.org/10.1016/j.asw.2012.10.004
Ren, S., He, K., Girshick, R., & Sun, J. (2016). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031
Roscoe, R. D., Allen, L. K., Johnson, A. C., & McNamara, D. S. (2018). Automated writing instruction and feedback: Instructional mode, attitudes, and revising. In Proceedings of the 62nd Annual Meeting of the Human Factors and Ergonomics Society (pp. 2089–2093). Human Factors & Ergonomics Society. https://doi.org/10.1177/1541931218621471
Sakaguchi, K., Heilman, M., & Madnani, N. (2015). Effective feature integration for automated short answer scoring. In Proceedings of the 2015 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1049–1054). https://doi.org/10.3115/v1/N15-1111
Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681.
Shim, Y. (2013). The effects of online writing evaluation program. Teaching English with Technology, 13(3), 18–34.
Silverman, R. D., Coker, D., Proctor, C. P., Harring, J., Piantedosi, K. W., & Hartranft, A. M. (2015). The relationship between language skills and writing outcomes for linguistically diverse students in upper elementary school. The Elementary School Journal, 116(1), 103–125. https://doi.org/10.1086/683135
Steinberg, D., & Colla, P. (2009). CART: Classification and regression trees. The Top Ten Algorithms in Data Mining, 9, 179. https://doi.org/10.4135/9781412950589.n88
Stevenson, M., & Phakiti, A. (2014). The effects of computer-generated feedback on the quality of writing. Assessing Writing, 19, 51–65. https://doi.org/10.1016/j.asw.2013.11.007
Strobl, C., Ailhaud, E., Benetos, K., Devitt, A., Kruse, O., Proske, A., & Rapp, C. (2019). Digital support for academic writing: A review of technologies and pedagogies. Computers & Education, 131, 33–48. https://doi.org/10.1016/j.compedu.2018.12.005
Tan, H., & Bansal, M. (2019). Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv: 1908.07490. https://doi.org/10.18653/v1/D19-1514
Toyama, J., Misono, M., Suzuki, M., Nakayama, K., & Matsuo, Y. (2016). Neural machine translation with latent semantic of image and text. arXiv preprint arXiv 1611.08459. https://doi.org/10.48550/arXiv.1611.08459
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (pp.1–11).
Wang, Y. (2021). Survey on deep multi-modal data analytics: Collaboration, rivalry, and fusion. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 17(1), 1–25. https://doi.org/10.1145/3408317
Wang, E. L., Matsumura, L. C., Correnti, R., Litman, D., Zhang, H., Howe, E., … & Quintana, R. (2020). eRevis (ing): Students’ revision of text evidence use in an automated writing evaluation system. Assessing Writing, 44, 100449. https://doi.org/10.1016/j.asw.2020.100449
Warschauer, M., & Grimes, D. (2008). Automated writing assessment in the classroom. Pedagogies: An International Journal, 3(1), 22–36. https://doi.org/10.1080/15544800701771580
Wei, X., Zhang, T., Li, Y., Zhang, Y., & Wu, F. (2020). Multi-modality cross attention network for image and sentence matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10941–10950). https://doi.org/10.1109/CVPR42600.2020.01095
Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13. https://doi.org/10.1111/j.1745-3992.2011.00223.x
Wilson, J., Ahrendt, C., Fudge, E. A., Raiche, A., Beard, G., & MacArthur, C. (2021). Elementary teachers’ perceptions of automated feedback and automated scoring: Transforming the teaching and learning of writing using automated writing evaluation. Computers & Education, 168, 104208. https://doi.org/10.1016/j.compedu.2021.104208
Wilson, J., & Czik, A. (2016). Automated essay evaluation software in English language arts classrooms: Effects on teacher feedback, student motivation, and writing quality. Computers & Education, 100, 94–109. https://doi.org/10.1016/j.compedu.2016.05.004
Wilson, J., & Roscoe, R. D. (2020). Automated writing evaluation and feedback: Multiple metrics of efficacy. Journal of Educational Computing Research, 58(1), 87–125. https://doi.org/10.1177/0735633119830764
Woodworth, J., & Barkaoui, K. (2020). Perspectives on using automated writing evaluation systems to provide written corrective feedback in the ESL classroom. TESL Canada Journal, 37(2), 234–247. https://doi.org/10.18806/tesl.v37i2.1340
Zhang, R., & Zou, D. (2021). A state-of-the-art review of the modes and effectiveness of multimedia input for second and foreign language learning. Computer Assisted Language Learning, ahead-of-print, 1–27. https://doi.org/10.1080/09588221.2021.1896555
Funding
This work was supported by the One-off Special Fund from Central and Faculty [grant number 02136] and the Start-Up Research Grant [grant number RG41/20-21R] of the Education University of Hong Kong; and the Youth Elite Supporting Plan in Universities of Anhui Province [grant number gxyqZD2019077], the Higher Education Teaching and Research Project of Anhui Province [grant number 2020jyxm0633], and the Science and Technology Plan Project in Chuzhou [grant number 2021ZD016].
Author information
Authors and Affiliations
Contributions
Ruibin Zhao, Conceptualization, methodology, validation, formal analysis, investigation, writing – original draft. Yipeng, Zhuang, Methodology, software, formal analysis, visualization, writing – review & editing. Di Zou, Resources, investigation, writing – review & editing. Qin Xie, Conceptualization, investigation, writing – review & editing. Leung Ho Philip YU, Supervision, writing – review & editing, project administration, funding acquisition.
Corresponding author
Ethics declarations
Conflict of Interest
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1
All 15 pictures were used in our writing test, and they were grouped in three categories. Each picture of first category includes one dominant character, e.g., a person or an animal, each picture of second category covers two characters having some interaction, and each picture third category often includes a group of people in an activates.
Note. The numbers are the ids of the pictures in our research and this paper.
Appendix 2
Table 5
Appendix 3. Machine learning model training.
Considering the problem of imbalanced data, that there are only few responses assigned with a low and high score. Serious imbalance of data can even cause the trained model to directly ignore categories with a small sample size. Some optimization tricks can be used to make the model pay more attention to the classes with fewer samples. This allows the model to focus on learning such features, rather than focusing too much on categories with a large number of samples. A simple way is to copy the same data directly to achieve the purpose of expanding the data, that is, oversampling. Another method is SMOTE. First, each sample \({x}_{i}\) is selected from the minority class samples as the root sample for synthesizing new samples; secondly, the k nearest neighbor samples of the same class of \({x}_{i}\) are used as the reference for synthesizing new samples, and new samples are generated by interpolation, repeat this process until the required number of samples is reached. Here, k is generally an odd number, we have tried k = 5, 15, and 25 on the training set, and found that k = 5 works better. Two sampling methods are compared, and finally we choose oversampling as the method for balancing the data based on their performance on the training set.
Next, we trained several machine learning models and used grid search to tune the parameters of each model by tenfold cross-validation on the training set. These models are: K-nearest neighbors algorithm (k-NN) (Coomans & Massart, 1982), which classify samples by finding the k most similar, nearest neighbors. The searching space of hyperparameters is {Number of neighbors: 5, 10, 20, 50. Weight function: “uniform”, “distance”}. The second model we trained is Random Forest (Breiman, 2001), which builds multiple decision trees in order to combine all predictions for a more robust behavior. The searching space of hyperparameters is {Number of decision trees in the forest: 50, 100, 200, 400. Criterion: “gini”, “entropy”. Maximum depth of the tree: 4,5,6,8,10. Minimum number of samples to split: 20,40,80,100. Number of features to consider when split: “sqrt”, “log2″}. Support-vector machines (SVM) (Cortes & Vladimir, 1995) is also considered, it maps training samples to point in space for classification and regression analysis. The searching space of hyperparameters is {Regularization parameter: 1, 10, 100. Kernel: “linear”, “poly”, “rbf”, “sigmoid”. Kernel coefficient gamma: 1e-2, 1e-3, 1e-4,” auto”}. Another model we trained is XGBoost (Chen & Guestrin, 2016), which implements machine learning algorithms under the Gradient Boosting framework. The searching space of hyperparameters is {Learning rate: 0.05, 0.1, 0.2, 0.3. Maximum depth: 4, 6, 8. Subsample: 0.6, 0.8, 0.9, 1. Scale_pos_weight: 1, 5, 10. Alpha: 0, 1, 2, 5, 10}. Based on the model performance on the tenfold CV, the models with the least MAE were selected for each component. We trained models to predict 6 components first, and the performance are list in the following table. By adding scores of these components, we get the final scores. Compared with the human scores, the total MAE is 0.479.
Components | MAE |
---|---|
Grammar \(\in \left[\mathrm{0,3}\right]\) | 0.230 |
Spelling \(\in \left[0, 1\right]\) | 0.056 |
Convention \(\in \left[0, 1\right]\) | 0.054 |
Comprehensiveness \(\in \left[0, 3\right]\) | 0.198 |
Vividness \(\in \left[0, 1\right]\) | 0.083 |
Sentence structure \(\in \left[0, 1\right]\) | 0.018 |
Appendix 4
Main indices used in our automated scoring model. In the figure, the values represent the Gini importance (Steinberg & Colla, 2009) of the indices in evaluating the grammar, spelling, convention, comprehensiveness, vividness, and sentence structure of student response, as well as in predicting the final score for the responses. The larger the values, the more important the indices are.
The six variables in the upper part are the features generated by three cross-modal matching methods (i.e., ALBEF, GSMN and SGRAF), and they are mainly used in the prediction of “Comprehensiveness”. Each method estimated a set of similarities with their models trained on different datasets or with multiple scales of parameters, and we chose the two most effective similarities in our experiments. The variables in the lower part are the features generated in natural language processing. This graph shows the importance of the variables by which the machine predicts the score, and they all fit our conjectures. For example, the vividness of a sentence mainly depends on whether a lot of adjectives and adverbs are used, and when scoring the sentences structure, more attention is paid to the number of clauses and pronouns.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhao, R., Zhuang, Y., Zou, D. et al. AI-assisted automated scoring of picture-cued writing tasks for language assessment. Educ Inf Technol 28, 7031–7063 (2023). https://doi.org/10.1007/s10639-022-11473-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10639-022-11473-y