Developing, evaluating, and refining an automatic generator of diagnostic multiple choice cloze questions to assess children's comprehension while reading*

JACK MOSTOW; YI-TING HUANG; HYEJU JANG; ANDERS WEINSTEIN; JOE VALERI; DONNA GATES

doi:10.1017/S1351324916000024

Developing, evaluating, and refining an automatic generator of diagnostic multiple choice cloze questions to assess children's comprehension while reading*

Published online by Cambridge University Press: 14 April 2016

HYEJU JANG ,

JOE VALERI and

JACK MOSTOW: Affiliation:
Project LISTEN, School of Computer Science, Carnegie Mellon University, RI-NSH 4103, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA e-mail: mostow@cs.cmu.edu
YI-TING HUANG: Affiliation:
Information Management, National Taiwan University No. 1, Sec. 4, Roosevelt Road, 10617 Taipei, Taiwan e-mail: d97008@im.ntu.edu.tw
HYEJU JANG: Affiliation:
Language Technologies Institute, School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA e-mail: hyejuj@cs.cmu.edu
ANDERS WEINSTEIN: Affiliation:
School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA e-mail: andersw@cs.cmu.edu
JOE VALERI: Affiliation:
e-mail: joevaleri@gmail.com
DONNA GATES: Affiliation:
e-mail: donnamgates7123@gmail.com

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

We describe the development, pilot-testing, refinement, and four evaluations of Diagnostic Question Generator (DQGen), which automatically generates multiple choice cloze (fill-in-the-blank) questions to test children's comprehension while reading a given text. Unlike previous methods, DQGen tests comprehension not only of an individual sentence but of the context preceding it. To test different aspects of comprehension, DQGen generates three types of distractors: ungrammatical distractors test syntax; nonsensical distractors test semantics; and locally plausible distractors test inter-sentential processing.

(1) A pilot study of DQGen 2012 evaluated its overall questions and individual distractors, guiding its refinement into DQGen 2014.
(2) Twenty-four elementary students generated 200 responses to multiple choice cloze questions that DQGen 2014 generated from forty-eight stories. In 130 of the responses, the child chose the correct answer. We define the distractiveness of a distractor as the frequency with which students choose it over the correct answer. The incorrect responses were consistent with expected distractiveness: twenty-seven were plausible, twenty-two were nonsensical, fourteen were ungrammatical, and seven were null.
(3) To compare DQGen 2014 against DQGen 2012, five human judges categorized candidate choices without knowing their intended type or whether they were the correct answer or a distractor generated by DQGen 2012 or DQGen 2014. The percentage of distractors categorized as their intended type was significantly higher for DQGen 2014.
(4) We evaluated DQGen 2014 against human performance based on 1,486 similarly blind categorizations by twenty-seven judges of sixteen correct answers, forty-eight distractors generated by DQGen 2014, and 504 distractors authored by twenty-one humans. Surprisingly, DQGen 2014 did significantly better than humans at generating ungrammatical distractors and marginally better than humans at generating nonsensical distractors, albeit slightly worse at generating plausible distractors. Moreover, vetting DQGen 2014's output and writing distractors only when necessary would halve the time to write them all, and produce higher quality distractors.

Type: Articles
Information: Natural Language Engineering , Volume 23 , Issue 2 , March 2017 , pp. 245 - 294

DOI: https://doi.org/10.1017/S1351324916000024 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2016

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

This paper combines material from Mostow and Jang (2012), our AIED2015 paper (Huang and Mostow 2015) on a comparison to human performance, and substantial new content including improvements to DQGen and the evaluations reported in Section 4.1 and 4.2. The research reported here was supported in part by the Institute of Education Sciences, U.S. Department of Education, through Grant R305A080157, the National Science Foundation through Grant IIS1124240, and by the Taiwan National Science Council through the Graduate Students Study Abroad Program. We thank the other LISTENers who contributed to this work; everyone who categorized and wrote distractors; the reviewers of our BEA2012 and AIED2015 papers and this article for their helpful comments; and Prof. Y. S. Sun at National Taiwan University and Dr. M. C. Chen at Academia Sinica for enabling the first author to participate in this program. The opinions expressed are those of the authors and do not necessarily represent the views of the Institute, the U.S. Department of Education, the National Science Foundation, or the National Science Council.

References

Agarwal, M., and Mannem, P., 2011a. Automatic gap-fill question generation from text books. In Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics. 209 N. Eighth Street, Stroudsburg, PA 18360, USA, pp. 56–64.Google Scholar

Agarwal, M., Shah, R., and Mannem, P., 2011b. Automatic question generation using discourse cues. In Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics. 209 N. Eighth Street, Stroudsburg, PA 18360, USA, pp. 1–9.Google Scholar

Aldabe, I., and Maritxalar, M. 2010. Automatic distractor generation for domain specific texts advances in natural language processing. In Loftsson, H., Rögnvaldsson, E., and Helgadóttir, S. (eds.), The 7th International Conference on NLP, Reykjavk, Iceland, pp. 27–38, Berlin/Heidelberg: Springer.Google Scholar

Aldabe, I., Maritxalar, M., and Martinez, E. 2007. Evaluating and improving distractor-generating heuristics. In Ezeiza, N., Maritxalar, M., and S. M. (eds.), The Workshop on NLP for Educational Resources. In conjunction with RANLP07, Amsterdam, Netherlands, pp. 7–13. Borovets, Bulgaria.Google Scholar

Aldabe, I., Maritxalar, M., and Mitkov, R. 2009, July 6–10. A study on the automatic selection of candidate sentences and distractors. In Dimitrova, V., Mizoguchi, R., Boulay, B. D., and Graesser, A. (eds.), In Proceedings of the 14th International Conference on Artificial Intelligence in Education (AIED2009), pp. 656–8. Brighton, UK: IOS Press.Google Scholar

Becker, L., Basu, S., and Vanderwende, L. 2012. Mind the gap: learning to choose gaps for question generation. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 742–51. Montreal, Canada: Association for Computational Linguistics.Google Scholar

Biemiller, A., 2009. Words Worth Teaching: Closing the Vocabulary Gap. Columbus, OH: SRA/McGraw-Hill.Google Scholar

Brown, J. C., Frishkoff, G. A., and Eskenazi, M. 2005, October 6–8. Automatic question generation for vocabulary assessment. In Proceedings of the Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pp. 819–26. Vancouver, BC, Canada. Stroudsburg, PA, USA: Association for Computational Linguistics.Google Scholar

Burton, S. J., Sudweeks, R. R., Merrill, P. F., and Wood, B., 1991. How to Prepare Better Multiple-Choice Test Items: Guidelines for University Faculty. Salt Lake City, UT: Brigham Young University Testing Services and The Department of Instructional Science.Google Scholar

Cassels, J. R. T., and Johnstone, A. H. 1984. The effect of language on student performance on multiple choice tests in chemistry. Journal of Chemical Education 61 (7): 613.Google Scholar

Chang, K.-M., Nelson, J., Pant, U., and Mostow, J. 2013. Toward exploiting eeg input in a reading tutor. International Journal of Artificial Intelligence in Education 22(1, “Best of AIED2011 Part 1”): 29–41.Google Scholar

Chen, W., Mostow, J., and Aist, G. S. 2013. Recognizing young readers’ spoken questions. International Journal of Artificial Intelligence in Education 21 (4): 255–69.Google Scholar

Coniam, D. 1997. A preliminary inquiry into using corpus word frequency data in the automatic generation of english language cloze tests. CALICO Journal 14 (2–4): 15–33.Google Scholar

Correia, R., Baptista, J., Mamede, N., Trancoso, I., and Eskenazi, M. 2010, September 22–24. Automatic generation of cloze question distractors. In Proceedings of the Interspeech 2010 Satellite Workshop on Second Language Studies: Acquisition, Learning, Education and Technology, Waseda University, Tokyo, Japan.Google Scholar

Fellbaum, C. 2012. Wordnet. The Encyclopedia of Applied Linguistics: Blackwell Publishing Ltd. Hoboken, New Jersey, USA.Google Scholar

Gates, D., Aist, G., Mostow, J., Mckeown, M., and Bey, J. 2011. How to generate cloze questions from definitions: a syntactic approach. In Proceedings of the AAAI Symposium on Question Generation, pp. 19–22. Arlington, VA, AAAI Press.Google Scholar

Goto, T., Kojiri, T., Watanabe, T., Iwata, T., and Yamada, T. 2010. Automatic generation system of multiple-choice cloze questions and its evaluation. Knowledge Management & E-Learning: An International Journal (KM& EL) 2 (3): 210–24.Google Scholar

Graesser, A. C., and Bertus, E. L. 1998. The construction of causal inferences while reading expository texts on science and technology. Scientific Studies of Reading 2 (3): 247–69.Google Scholar

Haladyna, T. M., Downing, S. M., and Rodriguez, M. C. 2002. A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement In Education 15 (3): 309–34.Google Scholar

Heilman, M., and Smith, N. A. 2009. Question Generation Via Overgenerating Transformations and Ranking (Technical Report CMU-LTI-09-013). Pittsburgh, PA: Carnegie Mellon University.Google Scholar

Heilman, M., and Smith, N. A. 2010, June. Good question! Statistical ranking for question generation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, pp. 609–17. Los Angeles, CA, Association for Computational Linguistics.Google Scholar

Hensler, B. S., and Beck, J. E. 2006, June 26–30. Better student assessing by finding difficulty factors in a fully automated comprehension measure [best paper nominee]. In Ashley, K. and Ikeda, M. (eds.), Proceedings of the 8th International Conference on Intelligent Tutoring Systems, pp. 21–30. Jhongli, Taiwan, Springer-Verlag.Google Scholar

Huang, Y.-T., Chen, M. C., and Sun, Y. S. 2012, November 26–30. Personalized automatic quiz generation based on proficiency level estimation. In Proceedings of the 20th International Conference on Computers in Education (ICCE 2012), pp. 553–60. Singapore.Google Scholar

Huang, Y.-T., and Mostow, J. 2015, June 22–26. Evaluating human and automated generation of distractors for diagnostic multiple-choice cloze questions to assess children’s reading comprehension. In Conati, C., Heffernan, N., Mitrovic, A., and Verdejo, M. F. (eds.), Proceedings of the 17th International Conference on Artificial Intelligence in Education, pp. 155–64. Madrid, Spain, Lecture Notes in Computer Science, vol. 9112. Switzerland: Springer International Publishing.Google Scholar

Kendall, M. G., and Babington Smith, B. 1939. The problem of m rankings. The Annals of Mathematical Statistics 10 (3): 275–87.Google Scholar

Kintsch, W. 2005. An overview of top-down and bottom-up effects in comprehension: the ci perspective. Discourse Processes 39 (2–3): 125–8.CrossRef Google Scholar

Klein, D., and Manning, C. D. 2003, July 7–12. Accurate unlexicalized parsing. In E. W. Hinrichs and D. Roth (eds.), Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 423–30. Sapporo, Japan, Association for Computational Linguistics.CrossRef Google Scholar

Kolb, P. 2008. Disco: a multilingual database of distributionally similar words. In Proceedings of KONVENS-2008 (Konferenz zur Verarbeitung natürlicher Sprache), pp. 5–12. Berlin.Google Scholar

Kolb, P. 2009. Experiments on the difference between semantic similarity and relatedness. In Proceedings of the 17th Nordic Conference on Computational Linguistics-NODALIDA’09, Odense, Denmark.Google Scholar

Landis, J. R., and Koch, G. G. 1977. The measurement of observer agreement for categorical data. Biometrics 33 (1): 159–74.CrossRef Google Scholar

Lee, J., and Seneff, S. 2007, August 27–31. Automatic generation of cloze items for prepositions. In Proceedings of INTERSPEECH, pp. 2173–6. Antwerp, Belgium,Google Scholar

Li, L., Roth, B., and Sporleder, C. 2010. Topic models for word sense disambiguation and token-based idiom detection. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1138–47. Uppsala, Sweden, Association for Computational Linguistics.Google Scholar

Li, L., and Sporleder, C. 2009. Classifier combination for contextual idiom detection without labelled data, In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 315–23. Singapore, Association for Computational Linguistics.Google Scholar

Lin, Y.-C., Sung, L.-C., and Chen, M. C., 2007. An automatic multiple-choice question generation scheme for english adjective understanding. In Workshop on Modeling, Management and Generation of Problems/Questions in eLearning, the 15th International Conference on Computers in Education (ICCE 2007), Amsterdam, Netherlands, pp. 137–42.Google Scholar

Liu, C.-L., Wang, C.-H., Gao, Z.-M., and Huang, S.-M. 2005, June 29. Applications of lexical information for algorithmically composing multiple-choice cloze items. In Proceedings of the Second Workshop on Building Educational Applications Using NLP, Ann Arbor, Michigan, pp. 1–8. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Ming, L., Calvo, R. A., Aditomo, A., and Pizzato, L. A. 2012. Using wikipedia and conceptual graph structures to generate questions for academic writing support. IEEE Transactions on Learning Technologies 5 (3): 251–63.Google Scholar

Mitkov, R., Ha, L. A., and Karamanis, N. 2006. A computer-aided environment for generating multiple choice test items. Natural Language Engineering 12 (2): 177–94.Google Scholar

Mitkov, R., Ha, L. A., Varga, A., and Rello, L. 2009, March 31. Semantic similarity of distractors in multiple-choice tests: extrinsic evaluation. In Basili, R. and Pennacchiotti, M. (eds.), EACL 2009 Workshop on GEMS: GEometrical Models of Natural Language Semantics, pp. 49–56. Athens, Greece, Association for Computational Linguistics.Google Scholar

Mostow, J. 2013, July. Lessons from project listen: what have we learned from a reading tutor that listens? (keynote). In H. C. Lane, K. Yacef, J. Mostow, and P. Pavlik (eds.), Proceedings of the 16th International Conference on Artificial Intelligence in Education, pp. 557–8. Memphis, TN, LNAI, vol. 7926. Springer.Google Scholar

Mostow, J., Beck, J. E., Bey, J., Cuneo, A., Sison, J., Tobin, B., and Valeri, J. 2004. Using automated questions to assess reading comprehension, vocabulary, and effects of tutorial interventions. Technology, Instruction, Cognition and Learning 2 (1–2): 97–134.Google Scholar

Mostow, J., and Chen, W. 2009, July 6–10. Generating instruction automatically for the reading strategy of self-questioning. In Dimitrova, V., Mizoguchi, R., Boulay, B. D., and Graesser, A. (eds.), Proceedings of the 14th International Conference on Artificial Intelligence in Education, pp. 465–72. Brighton, UK: IOS Press.Google Scholar

Mostow, J., and Jang, H. 2012, June 7. Generating diagnostic multiple choice comprehension cloze questions. In NAACL-HLT 2012 7th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 136–46. Montréal, Association for Computational Linguistics.Google Scholar

Niraula, N. B., Rus, V., Stefanescu, D., and Graesser, A. C. 2014. Mining gap-fill questions from tutorial dialogues. In Proceedings of the 7th International Conference on Educational Data Mining, pp. 265–8. London, UK.Google Scholar

Pearson, P. D., and Hamm, D. N. 2005. The history of reading comprehension assessment. In Paris, S. G. and Stahl, S. A. (eds.), Children’s Reading Comprehension and Assessment, pp. 13–69. London, United Kingdom, CIERA.Google Scholar

Pino, J., Heilman, M., and Eskenazi, M. 2008. A selection strategy to improve cloze question quality. In Proceedings of the Workshop on Intelligent Tutoring Systems for Ill-Defined Domains. 9th International Conference on Intelligent Tutoring Systems, pp. 22–34. Montreal, Canada.Google Scholar

Piwek, P., and Boyer, K. E. 2012. Varieties of question generation: introduction to this special issue. Dialogue and Discourse 3 (2): 1–9.Google Scholar

Raghunathan, K., Lee, H., Rangarajan, S., Chambers, N., Surdeanu, M., Jurafsky, D., and Manning, C. 2010. A multi-pass sieve for coreference resolution. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 492–501. MIT, Cambridge, MA, Association for Computational Linguistics.Google Scholar

Rus, V., Wyse, B., Piwek, P., Lintean, M., Stoyanchev, S., and Moldovan, C. 2010. The first question generation shared task evaluation challenge. In Proceedings of the 6th International Natural Language Generation Conference, pp. 251–7. Dublin, Ireland, Association for Computational Linguistics.Google Scholar

Shrout, P. E., and Fleiss, J. L. 1979. Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin 86 (2): 420–8.Google Scholar

Sleator, D. D. K., and Temperley, D. 1993, August 10–13. Parsing english with a link grammar. Third International Workshop on Parsing Technologies, Tilburg, NL, and Durbuy, Belgium.Google Scholar

Smith, S., Sommers, S., and Kilgarriff, A. 2008. Learning words right with the sketch engine and webbootcat: automatic cloze generation from corpora and the web. In Proceedings of the 25th International Conference of English Teaching and Learning & 2008 International Conference on English Instruction and Assessment, pp. 1–8. Lisbon, Portugal.Google Scholar

Sumita, E., Sugaya, F., and Yamamoto, S. 2005. Measuring non-native speakers’ proficiency of english by using a test with automatically-generated fill-in-the-blank questions. In Proceedings of the Second Workshop on Building Educational Applications Using NLP, pp. 61–8. Ann Arbor, Michigan, Association for Computational Linguistics.Google Scholar

Tapanainen, P., and Järvinen, T. 1997. A non-projective dependency parser. In Proceedings of the 5th Conference on Applied Natural Language Processing, pp. 64–71. Washington, DC, Association for Computational Linguistics.Google Scholar

Toutanova, K., Klein, D., Manning, C., and Singer, Y. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. Proceedings of the Human Language Technology Conference and Annual Meeting of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), Edmonton, Canada, pp. 252–9.Google Scholar

Unspecified. 2006. Tiny invaders, National Geographic Explorer (Pioneer Edition) http://ngexplorer.cengage.com/pioneer/.Google Scholar

van den Broek, P., Everson, M., Virtue, S., Sung, Y., and Tzeng, Y. 2002. Comprehension and memory of science texts: inferential processes and the construction of a mental representation. In Otero, J., Leon, J., and Graesser, A. C. (eds.), The Psychology of Science Text Comprehension, pp. 131–154. Mahwah, NJ: Erlbaum.Google Scholar

Zesch, T., and Melamud, O. 2014. Automatic generation of challenging distractors using context-sensitive inference rules. In Workshop on Innovative Use of NLP for Building Educational Applications (BEA), pp. 143–8. Baltimore, MD.Google Scholar

Zhang, X., Mostow, J., and Beck, J. E. 2007, July 9–13. Can a computer listen for fluctuations in reading comprehension?. In R. Luckin, K. R. Koedinger, and J. Greer (eds.), Proceedings of the 13th International Conference on Artificial Intelligence in Education, pp. 495–502. Marina del Rey, CA: IOS Press.Google Scholar

Article contents

Developing, evaluating, and refining an automatic generator of diagnostic multiple choice cloze questions to assess children's comprehension while reading*

Abstract

Access options

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests