Abstract
Worldwide, open-ended questions that require short answers have been used in many exams in fields of science, such as the International Student Assessment Program (PISA), the International Science and Maths Trends Research (TIMSS). However, multiple-choice questions are used for many exams at the national level in Turkey, especially high school and university entrance exams. This study aims to develop an objective and useful automatic scoring model for open-ended questions using machine learning algorithms. Within the scope of this aim, an automated scoring model construction study was conducted on four Physics questions at a University level course with the participation of 246 undergraduate students. The short-answer scoring was handled through an approach that addresses students’ answers in Turkish. Model performing machine learning classification techniques such as SVM (Support Vector Machines), Gini, KNN (k-Nearest Neighbors), and Bagging and Boosting were applied after data preprocessing. The score indicated the accuracy, precision and F1-Score of each predictive model of which the AdaBoost.M1 technique had the best performance. In this paper, we report on a short answer grading system in Turkish, based on a machine learning approach using a constructed dataset about a Physics course in Turkish. This study is also the first study in the field of open-ended exam scoring in Turkish.


Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Alfaro, E., Gamez, M., & Garcia, N. (2015). Adabag: Applies multiclass AdaBoost.M1, SAMME and bagging. R package version 4.2. https://cran.r-project.org/web/packages/adabag/. Accessed 9 Dec 2019.
Alfonseca, E., & Perez, D. (2004). Automatic assessment of open ended questions with a BLEU-inspired algorithm and shallow NLP. Advances in Natural Language Processing. LNCS, 3230, 25–35.
Alvarado, J. G., Ghavidel, H. A., Zouaq, A., Jovanovic, J., & McDonald, J. (2018). A comparison of features for the automatic labeling of student answers to open-ended questions. Proceeding of the 11th International Educational Data Mining Conference, Buffalo, NY, 55-65.
Bailey, S., & Meurers, D. (2008). Diagnosing meaning errors in short answers to reading comprehension questions, Proceedings of the 3rd Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics, 107–115.
Basu, S., Jacobs, C., & Vanderwende, L. (2013). Powergrading: A clustering approach to amplify human effort for short answer grading. Transactions of the Association of Computational Linguistics, 1, 391–401.
Bukai, O., Pokorny, R., & Haynes, J. (2006). An automated short-free-text scoring system: Development and assessment. Proceedings of the 20th Interservice-Industry Training, Simulation, and Education Conference, 1–11.
Burrows, S., Gurevych, I., & Stein, B. (2015). The eras and trends of automatic short answer grading. International Journal of Artificial Intelligence in Education, 25, 60–117.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.
Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm, machine learning. Proceedings Of The TThirteenth International Conference (Icml96), 148–156.
Galhardi, L., Barbosa, C.R.S.C., ThomdeSouza, R.C., & Brancher, J.D. (2018). Portuguese automatic short answer grading, VII Congresso Brasileiro de Informática na Educação-CBIE 2018, 1373–1382.
Graham, M., Milanowski, A., & Miller, J. (2012). Measuring and promoting inter-rater agreement of teacher and principal performance ratings. Washington, DC: U.S. Department of Education: Center for Educator Compensation Reform.
Gronlund, N. E. (1998). Assessment of student achievement (6th ed.). Boston: Allyn & Bacon.
Herwanto, G.B., Sari, Y., Prastowo, B.N., Riasetiawan, M., Bustoni, I.A., & Hidayatulloh, I. (2018). UKARA: A fast and simple automatic short answer scoring system for Bahasa Indonesia, Proceeding Book of 1st International Conference on Educational Assessment and Policy, 2, 48–53.
Hewitt, P. G., Lyons, S., Suchocki, J. A., & Yeh, J. (2015). Conceptual integrated science (2nd ed.) Pearson.
Hou, W., & Tsao, J. (2011). Automatic assessment of students' free-text answers with different levels. International Journal on Artificial Intelligence Tools, 20(2) ,327–347.
Hsu, H., & Hsieh, C. (2010). Feature selection via correlation coefficient clustering. Journal of Software, 5(12), 1371–1377.
Jayashankar, S., & Sridaran, R. (2017). Superlative model using word cloud for short answers evaluation in eLearning. Education and Information Technologies, 22(5), 2383–2402.
Joachims T. (1999). Transductive inference for text classification using support vector machines. Proceedings of the 16th International Conference on Machine Learning (ICML), San Francisco: Morgan Kaufmann publishers, 200–209.
Klein, R., Kyrilov, A., & Tokman, M. (2011). Automated assessment of short free-text responses in computer science using latent semantic analysis. Proceedings of the 16th Annual Joint Conference On Innovation and Technology In Computer Science Education (ITiCSE’11), 158–162.
Leacock, C., & Chodorow, M. (2003). C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 37, 389–405.
Madnani, N., Loukina, A., & Cahill, A. (2017). A large scale quantitative exploration of modeling strategies for content scoring. Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, 457–467.
Mcdonald, J. Knott, A., & Zeng R. (2012). Free-text input vs menu selection: Exploring the difference with a tutorial dialogue system. Proceedings of Australasian Language Technology Association Workshop, 97−105.
Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., Leisch, F., Chang, C. C. et al. (2019). Package ‘e1071’. https://cran.r-project.org/web/packages/e1071/e1071.pdf. Accessed 9 Dec 2019.
Mohler, M., & Mihalcea, R. (2009). Text-to-text semantic similarity for automatic short answer grading. Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, 567–575.
Mowafy, M., Rezk, A., & El-bakry, H. M. (2018). An efficient classification model for unstructured text document. American Journal of Computer Science and Information Technology, 6, 1–16.
Nielsen, R. D., Ward, W., Martin, J. H., & Palmer, M. (2008). Annotating students’ understanding of science concepts. In Proceedings of the 6th international conference on language resources and evaluation (pp. 1–8).
Perez, D., Alfonseca, E., & Rodriguez, P. (2004). Upper bounds of the BLEU algorithm applied to assessing student essays. Proceedings of the 30th international association for educational assessment (IAEA).
Perez-Marin, D. (2004). Automatic evaluation of user’s short essays by using statistical and shallow natural language processing techniques. Unpublished PhD thesis, Computer Science Department, Universidad Autonoma of Madrid.
Pribadi, F. S., Permanasari, A. E., & Adji, T. B. (2018). Short answer scoring system using automatic reference answer generation and geometric average normalized-longest common subsequence (GAN-LCS). Education and Information Technologies, 23, 2855–2866.
Riordan, B. Horbach, A., Cahill, A. Zesch, T. & Lee, C. M. (2017). Investigating neural architectures for short answer scoring. Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, 159-168.
Ripley, B., & Venables, W. (2019). Package ‘class’. https://cran.r-project.org/web/packages/class/class.pdf. Accessed 9 Dec 2019.
Romagnano, L. (2001). The myth of objectivity in mathematics assessment. Mathematics Teacher, 94(1), 31–37.
Sakaguchi, K., Heilman, M., & Madnani, N. (2015). Effective feature integration for automated short answer scoring. Proceedings of NAACL: HLT, Association for Computational Linguistics, 1049–1054.
Shermis, M. D. (2015). Contrasting state-of-the-art in the machine scoring of short-form constructed responses. Educational Assessment, 20(1), 46–65.
Syarif, I., Zaluska, E., Prugel-Bennett, A., & Wills, G. (2012). Application of bagging, boosting and stacking to intrusion detection. Machine learning and data mining in pattern recognition (pp. 539–602). Heidelberg Springer Berlin.
Therneau, T., Atkinson, B., & Ripley, B. (2017). Recursive Partitioning and Regression Trees, Package ‘rpart’. https://cran.r-project.org/web/packages/rpart/rpart.pdf. Accessed 9 Dec 2019.
Van der Linden, W. J., & Hambleton, R. K. (1997). Item response theory: Brief history, common models, and extensions (pp. 1–28). New York: Handbook of modern item response theory: Springer-Verlag.
Vanlehn, K., Jordan, P., Rosé, C., Bhembe, D., Böttner, M., Gaydos, A., Makatchev, M., Pappuswamy, U., Ringenberg, M., Roque, A. Siler, S., & Srivastava, R. (2002). The architecture of Why2-atlas: A coach for qualitative physics essay writing. Proceedings of the 6th International Conference on Intelligent Tutoring Systems.
Zhang, J. & Mani, I. (2003). KNN approach to unbalanced data distributions: A case study involving information extraction. Proceedings of the ICML 2003 Workshop on Learning from Imbalanced Datasets.
Zhang, Y., Shah, R., & Chi, M. (2016). Deep learning + student modeling + clustering: A recipe for effective automatic short answer grading. The 9th International Conference on Educational Data Mining (EDM2016), 562–567.
Ziai, R., Ott, N., & Meurers, D. (2012). Short answer assessment: Establishing links between research strands. The 7th Workshop on the Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics, 190–200.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Highlights
• In this study, a very high performance of the AdaBoost.M1 algorithm was observed in the scoring of four
physics questions which were quite different and difficult.
• In the evaluation of scoring of open-ended questions by using machine learning algorithms, the systems
imitate the field expert. It was constructed with the methods closest to the human scoring in this research.
• In the case of open-ended questions in the selection and placement exam taking place at the national level
in Turkey, the AdaBoost.M1 technique will be shown to be successful.
Electronic supplementary material
ESM 1
(DOCX 60 kb)
Rights and permissions
About this article
Cite this article
Çınar, A., Ince, E., Gezer, M. et al. Machine learning algorithm for grading open-ended physics questions in Turkish. Educ Inf Technol 25, 3821–3844 (2020). https://doi.org/10.1007/s10639-020-10128-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10639-020-10128-0