Abstract
Accurately interpreting student responses is a critical requirement of dialog-based intelligent tutoring systems. The accuracy of supervised learning methods, used for interpreting or analyzing student responses, is strongly dependent on the availability of annotated training data. Collecting and grading student responses is tedious, time-consuming, and expensive. This work proposes an iterative data collection and grading approach. We show that data collection efforts can be significantly reduced by predicting question difficulty and by collecting answers from a focused set of students. Further, grading efforts can be reduced by filtering student answers that may not be helpful in training Student Response Analyzer (SRA). To ensure the quality of grades, we analyze the grader characteristics, and show improvement when a biased grader is removed. An experimental evaluation on a large scale dataset shows a reduction of up to 28% in the data collection cost, and up to 10% in grading cost while improving the response analysis macro-average F1.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
For simplicity, we assume that the costs of question creation, answer collection and answer grading are uniform across questions and answers.
References
Arora, S., Nyberg, E., Rosé, C.P.: Estimating annotation cost for active learning in a multi-annotator environment. In: Proceedings of the NAACL-HLT Workshop on Active Learning for Natural Language Processing, pp. 18–26 (2009)
Baldridge, J., Osborne, M.: Active learning and the total cost of annotation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2004)
Basu, S., Jacobs, C., Vanderwende, L.: Powergrading: a clustering approach to amplify human effort for short answer grading. Trans. Assoc. Comput. Linguist. 1, 391–402 (2013)
Bernstein, M.S., Little, G., Miller, R.C., Hartmann, B., Ackerman, M.S., Karger, D.R., Crowell, D., Panovich, K.: Soylent: a word processor with a crowd inside. Commun. ACM 58(8), 85–94 (2015)
Birnbaum, A.: Some latent train models and their use in inferring an examinee’s ability. In: Statistical Theories of Mental Test Scores, pp. 395–479 (1968)
Brooks, M., Basu, S., Jacobs, C., Vanderwende, L.: Divide and correct: using clusters to grade short answers at scale. In: Proceedings of the ACM Conference on Learning@ Scale Conference, pp. 89–98 (2014)
Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 670–680 (2017)
Dzikovska, M.O., Nielsen, R.D., Brew, C.: Towards effective tutorial feedback for explanation questions: a dataset and baselines. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 200–210 (2012)
Guan, D., Yuan, W., Ma, T., Khattak, A.M., Chow, F.: Cost-sensitive elimination of mislabeled training data. Inform. Sci. 402, 170–181 (2017)
Gweon, G., Rosé, C.P., Wittwer, J., Nueckles, M.: Supporting efficient and reliable content analysis using automatic text processing technology. In: Costabile, M.F., Paternò, F. (eds.) INTERACT 2005. LNCS, vol. 3585, pp. 1112–1115. Springer, Heidelberg (2005). https://doi.org/10.1007/11555261_117
Horbach, A., Palmer, A., Wolska, M.: Finding a tradeoff between accuracy and rater’s workload in grading clustered short answers. In: International Conference on Language Resources and Evaluation, pp. 588–595 (2014)
Hsueh, P.Y., Melville, P., Sindhwani, V.: Data quality from crowdsourcing: a study of annotation selection criteria. In: Proceedings of the NAACL-HLT Workshop on Active Learning for Natural Language Processing, pp. 27–35 (2009)
Johnson, M.S., et al.: Marginal maximum likelihood estimation of item response models in R. J. Stat. Softw. 20(10), 1–24 (2007)
Jurgens, D.: Embracing ambiguity: a comparison of annotation methodologies for crowdsourcing word sense labels. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 556–562 (2013)
Kittur, A., Smus, B., Khamkar, S., Kraut, R.E.: CrowdForge: Crowdsourcing complex work. In: Proceedings of the ACM Symposium on User Interface Software and Technology, pp. 43–52 (2011)
Kulkarni, A., Can, M., Hartmann, B.: Collaboratively crowdsourcing workflows with turkomatic. In: Proceedings of the ACM Conference on Computer Supported Cooperative Work, pp. 1003–1012 (2012)
Nicholson, B., Sheng, V.S., Zhang, J.: Label noise correction and application in crowdsourcing. Expert Syst. Appl. 66, 149–162 (2016)
Nicholson, B., Zhang, J., Sheng, V.S., Wang, Z.: Label noise correction methods. In: Proceedings of IEEE International Conference on Data Science and Advanced Analytics, pp. 1–9 (2015)
Rosé, C.P., Moore, J.D., Vanlehn, K., Allbritton, D.: A comparative evaluation of socratic versus didactic tutoring. In: Proceedings of the Annual Meeting of the Cognitive Science Society, vol. 23 (2001)
Sánchez, J.S., Barandela, R., Marqués, A.I., Alejo, R., Badenas, J.: Analysis of new techniques to obtain quality training sets. Patt. Recogn. Lett. 24(7), 1015–1022 (2003)
Settles, B., Craven, M., Friedland, L.: Active learning with real annotation costs. In: Proceedings of the NIPS Workshop on Cost-Sensitive Learning, pp. 1–10 (2008)
Sheng, V.S., Provost, F., Ipeirotis, P.G.: Get another label? improving data quality and data mining using multiple, noisy labelers. In: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining, pp. 614–622 (2008)
Snow, R., O’Connor, B., Jurafsky, D., Ng, A.Y.: Cheap and fast–but is it good?: evaluating non-expert annotations for natural language tasks. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 254–263 (2008)
Valizadegan, H., Tan, P.N.: Kernel based detection of mislabeled training examples. In: Proceedings of the SIAM International Conference on Data Mining, pp. 309–319 (2007)
Zesch, T., Heilman, M., Cahill, A.: Reducing annotation efforts in supervised short answer scoring. In: Proceedings of the NAACL-HLT Workshop on Building Educational Applications, pp. 124–132 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Dhamecha, T.I., Marvaniya, S., Saha, S., Sindhgatta, R., Sengupta, B. (2018). Balancing Human Efforts and Performance of Student Response Analyzer in Dialog-Based Tutors. In: Penstein Rosé, C., et al. Artificial Intelligence in Education. AIED 2018. Lecture Notes in Computer Science(), vol 10947. Springer, Cham. https://doi.org/10.1007/978-3-319-93843-1_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-93843-1_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93842-4
Online ISBN: 978-3-319-93843-1
eBook Packages: Computer ScienceComputer Science (R0)