ABSTRACT
Learning supervised models to grade open-ended responses is an expensive process. A model has to be trained for every prompt/question separately, which in turn requires graded samples. In automatic programming evaluation specifically, the focus of this work, this issue is amplified. The models have to be trained not only for every question but also for every language the question is offered in. Moreover, the availability and time taken by experts to create a labeled set of programs for each question is a major bottleneck in scaling such a system. We address this issue by presenting a method to grade computer programs which requires no manually assigned labeled samples for grading responses to a new, unseen question. We extend our previous work [25] wherein we introduced a grammar of features to learn question specific models. In this work, we propose a method to transform those features into a set of features that maintain their structural relation with the labels across questions. Using these features we learn one supervised model, across questions for a given language, which can then be applied to an ungraded response to an unseen question. We show that our method rivals the performance of both, question specific models and the consensus among human experts while substantially outperforming extant ways of evaluating codes. We demonstrate the system single s value by deploying it to grade programs in a high stakes assessment. The learning from this work is transferable to other grading tasks such as math question grading and also provides a new variation to the supervised learning approach.
Supplemental Material
- Automata. Aspiring Minds http://www.aspiringminds.com/technology/automata.Google Scholar
- E-rater. ETS http://www.ets.org/research/topics/as_nlp/writing_quality/.Google Scholar
- Intelli metric. Vantage Learning http://www.vantagelearning.com/products/intellimetric/.Google Scholar
- Speechrater. ETS https://www.ets.org/research/topics/as_nlp/speech/.Google Scholar
- Svar. Aspiring Minds http://www.aspiringminds.com/technology/svar.Google Scholar
- V. Aggarwal, S. Srikant, and V. Shashidhar. Principles for using machine learning in the assessment of open response items: Programming assessment as a case study. In NIPS Workshop on Data Driven Education, 2013.Google Scholar
- J. Baxter. A bayesian/information theoretic model of learning to learn via multiple task sampling. Machine Learning, 28(1):7--39, 1997. Google ScholarDigital Library
- J. Bernstein, A. Van Moere, and J. Cheng. Validating automated speaking tests. Language Testing, 2010.Google ScholarCross Ref
- M. Birenbaum and K. K. Tatsuoka. Open-ended versus multiple-choice response formats-it does make a difference for diagnostic purposes. Applied Psychological Measurement, 11(4):385--395, 1987.Google ScholarCross Ref
- H. M. Breland. The direct assessment of writing skill: A measurement review. ETS Research Report Series, 1983(2):i--23, 1983.Google Scholar
- J. Burstein, L. Braden-Harder, M. Chodorow, S. Hua, B. Kaplan, K. Kukich, C. Lu, J. Nolan, D. Rock, and S. Wolff. Computer analysis of essay content for automated score prediction: A prototype automated scoring system for gmat analytical writing assessment essays. ETS Research Report Series, 1998(1):i--67, 1998.Google Scholar
- C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011. Google ScholarDigital Library
- H. Daume III and D. Marcu. Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research, pages 101--126, 2006. Google ScholarDigital Library
- E. L. Glassman, J. Scott, R. Singh, P. J. Guo, and R. C. Miller. Overcode: Visualizing variation in student solutions to programming problems at scale. ACM Transactions on Computer-Human Interaction (TOCHI), 22(2):7, 2015. Google ScholarDigital Library
- J. Huang, C. Piech, A. Nguyen, and L. Guibas. Syntactic and functional variability of a million code submissions in a machine learning mooc. In AIED 2013 Workshops Proceedings Volume, page 25. Citeseer, 2013.Google Scholar
- A. S. Lan, D. Vats, A. E. Waters, and R. G. Baraniuk. Mathematical language processing: Automatic grading and feedback for open response mathematical questions. In Proceedings of the Second (2015) ACM Conference on Learning@ Scale, pages 167--176. ACM, 2015. Google ScholarDigital Library
- N. Meinshausen and P. Bühlmann. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417--473, 2010.Google Scholar
- L. Pappano. The year of the mooc. The New York Times (Accessed: 2016--2--2).Google Scholar
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12:2825--2830, 2011. Google ScholarDigital Library
- K. Rivers and K. R. Koedinger. Automatic generation of programming feedback: A data-driven approach. In The First Workshop on AI-supported Education for Computer Science (AIEDCS 2013), page 50, 2013.Google Scholar
- V. Shashidhar, N. Pandey, and V. Aggarwal. Automatic spontaneous speech grading: A novel feature derivation technique using the crowd. In Proceedings of the Conference of the Association for Computational Linguistics. ACL, 2015.Google ScholarCross Ref
- V. Shashidhar, N. Pandey, and V. Aggarwal. Spoken english grading: Machine learning with crowd intelligence. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2089--2097. ACM, 2015. Google ScholarDigital Library
- R. Singh, S. Gulwani, and A. Solar-Lezama. Automated feedback generation for introductory programming assignments. In ACM SIGPLAN Notices, volume 48, pages 15--26. ACM, 2013. Google ScholarDigital Library
- V. Southavilay, K. Yacef, P. Reimann, and R. A. Calvo. Analysis of collaborative writing processes using revision maps and probabilistic topic models. In Proceedings of the Third International Conference on Learning Analytics and Knowledge, pages 38--47. ACM, 2013. Google ScholarDigital Library
- S. Srikant and V. Aggarwal. A system to grade computer programming skills using machine learning. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1887--1896. ACM, 2014. Google ScholarDigital Library
- S. Thrun. Is learning the n-th thing any easier than learning the first? Advances in neural information processing systems, pages 640--646, 1996.Google Scholar
- C. Vleuten, G. Norman, and E. Graaff. Pitfalls in the pursuit of objectivity: issues of reliability. Medical education, 25(2):110--118, 1991.Google ScholarCross Ref
Index Terms
- Question Independent Grading using Machine Learning: The Case of Computer Program Grading
Recommendations
A system to grade computer programming skills using machine learning
KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data miningThe automatic evaluation of computer programs is a nascent area of research with a potential for large-scale impact. Extant program assessment systems score mostly based on the number of test-cases passed, providing no insight into the competency of the ...
An Exploration of Automated Grading of Complex Assignments
L@S '16: Proceedings of the Third (2016) ACM Conference on Learning @ ScaleAutomated grading is essential for scaling up learning. In this paper, we conduct the first systematic study of how to automate grading of a complex assignment using a medical case assessment as a test case. We propose to solve this problem using a ...
Comments