Skip to main content
Log in

The joint student response analysis and recognizing textual entailment challenge: making sense of student responses in educational applications

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

We present the results of the joint student response analysis (SRA) and 8th recognizing textual entailment challenge. The goal of this challenge was to bring together researchers from the educational natural language processing and computational semantics communities. The goal of the SRA task is to assess student responses to questions in the science domain, focusing on correctness and completeness of the response content. Nine teams took part in the challenge, submitting a total of 18 runs using methods and features adapted from previous research on automated short answer grading, recognizing textual entailment and semantic textual similarity. We provide an extended analysis of the results focusing on the impact of evaluation metrics, application scenarios and the methods and features used by the participants. We conclude that additional research is required to be able to leverage syntactic dependency features and external semantic resources for this task, possibly due to limited coverage of scientific domains in existing resources. However, each of three approaches to using features and models adjusted to application scenarios achieved better system performance, meriting further investigation by the research community.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. There was also a partial entailment task, which is outside the scope of this paper—see Dzikovska et al. (2013b) for details.

  2. It is easy to imagine a numeric grading scheme that converts such categorical labels into numeric scores, making the SRA labels equally useful for supporting summative assessment.

  3. In the 5-way task, we excluded the “non-domain” class from the calculation of the macro-average on the SciEntsBank dataset because it had so few examples and was severely underrepresented with only 23 out of 4335 total examples and, hence, could have had a significant random effect.

  4. We decided to treat it as a semantic resource feature for purposes of our analysis, since it relies on an external corpus, though its exact use is not very clear in the prior literature.

  5. http://bit.ly/11a7QpP.

  6. Recall that there was no Unseen Domains test set in Beetle data.

  7. For purposes of this analysis, we focused on the use of dependency features, and did not take into account the use of part of speech tags.

References

  • Agirre, E., Cer, D., Diab, M., & Gonzalez-Agirre, A. (2012). Semeval-2012 task 6: A pilot on semantic textual similarity. In *SEM 2012: The first joint conference on lexical and computational semantics (pp. 385–393). Montréal: Association for Computational Linguistics.

  • Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., & Guo, W. (2013). *SEM 2013 shared task: Semantic textual similarity. In Second joint conference on lexical and computational semantics (*SEM) (pp. 32–43). Atlanta, GA: Association for Computational Linguistics.

  • Aldabe, I., Maritxalar, M., & Lopez de Lacalle, O. (2013). EHU-ALM: Similarity-feature based approach for student response analysis. In Second joint conference on lexical and computational semantics (*SEM), Vol. 2: Proceedings of the seventh international workshop on semantic evaluation (SemEval 2013) (pp. 580–584). Atlanta, GA: Association for Computational Linguistics.

  • Bentivogli, L., Clark, P., Dagan, I., Dang, H. T., & Giampiccolo, D. (2010). The sixth PASCAL recognizing textual entailment challenge. In Notebook papers and results, text analysis conference (TAC).

  • Bentivogli, L., Clark, P., Dagan, I., Dang, H. T., & Giampiccolo, D. (2011). The seventh PASCAL recognizing textual entailment challenge. In Notebook papers and results, text analysis conference (TAC).

  • Bentivogli, L., Dagan, I., Dang, H. T., Giampiccolo, D., & Magnini, B. (2009). The fifth PASCAL recognizing textual entailment challenge. In Proceedings of text analysis conference (TAC) 2009.

  • Bicici, E., & van Genabith, J. (2013). CNGL: Grading student answers by acts of translation. In Second joint conference on lexical and computational semantics (*SEM), Vol. 2: Proceedings of the seventh international workshop on semantic evaluation (SemEval 2013) (pp. 585–591). Atlanta, GA: Association for Computational Linguistics.

  • Burrows, S., Gurevych, I., & Stein, B. (2015a). The eras and trends of automatic short answer grading. International Journal of Artificial Intelligence in Education, 25(1), 60–117. doi:10.1007/s40593-014-0026-8.

    Article  Google Scholar 

  • Burstein, J., Tetreault, J., & Madnani, N. (2013). The e-rater essay scoring system. In M. D. Shermis & J. Burstein (Eds.), Handbook of automated essay evaluation: Current applications and new directions. London: Taylor and Francis.

    Google Scholar 

  • Campbell, G. C., Steinhauser, N. B., Dzikovska, M. O., Moore, J. D., Callaway, C. B., & Farrow, E. (2009). The DeMAND coding scheme: A “common language” for representing and analyzing student discourse. In Proceedings of 14th international conference on artificial intelligence in education (AIED), poster session, Brighton.

  • Chang, C. C., & Lin, C. J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 27:1–27:27.

    Article  Google Scholar 

  • Dagan, I., Glickman, O., & Magnini, B. (2006). The PASCAL recognising textual entailment challenge. In J. Quin̄onero-Candela, I. Dagan, B. Magnini, & F. d’Alché Buc (eds.) Machine learning challenges, lecture notes in computer science (Vol. 3944). Berlin: Springer.

  • Dale, R., Anisimoff, I., & Narroway, G. (2012). HOO 2012: A report on the preposition and determiner error correction shared task. In Proceedings of the seventh workshop of building educational applications using NLP. Association for Computational Linguistics.

  • Dale, R., & Kilgarriff, A. (2011). Helping our own: The HOO 2011 pilot shared task. In Proceedings of the generation challenges session at the 13th European workshop on natural language generation (pp. 242–249). Association for Computational Linguistics.

  • Daume, H., III (2007). Frustratingly easy domain adaptation. In Proceedings of the 45th annual meeting of the Association of Computational Linguistics (pp. 256–263). Association for Computational Linguistics, Prague.

  • Dzikovska, M. O., Bell, P., Isard, A., Moore, J. D. (2012a). Evaluating language understanding accuracy with respect to objective outcomes in a dialogue system. In Proceedings of EACL-12 conference (pp. 471–481).

  • Dzikovska, M. O., Farrow, E., & Moore. J. D. (2013a). Combining semantic interpretation and statistical classification for improved explanation processing in a tutorial dialogue system. In Proceedings of the 16th international conference on artificial intelligence in education (AIED 2013), Memphis, TN.

  • Dzikovska, M. O., Moore, J. D., Steinhauser, N., Campbell, G., Farrow, E., & Callaway, C. B. (2010). Beetle II: A system for tutoring and computational linguistics experimentation. In Proceedings of ACL 2010 system demonstrations (pp. 13–18).

  • Dzikovska, M. O., Nielsen, R. D., & Brew, C. (2012b). Towards effective tutorial feedback for explanation questions: A dataset and baselines. In Proceedings of 2012 conference of NAACL: Human language technologies (pp. 200–210).

  • Dzikovska, M. O., Nielsen, R., Brew, C., Leacock, C., Giampiccolo, D., Bentivogli, L., et al. (2013b). Semeval-2013 task 7: The joint student response analysis and 8th recognizing textual entailment challenge. In Proceedings of the 6th international workshop on semantic evaluation (SEMEVAL-2013). Association for Computational Linguistics, Atlanta, GA.

  • Dzikovska, M., Steinhauser, N., Farrow, E., Moore, J., & Campbell, G. (2014). BEETLE II: Deep natural language understanding and automatic feedback generation for intelligent tutoring in basic electricity and electronics. International Journal of Artificial Intelligence in Education, 24(3), 284–332. doi:10.1007/s40593-014-0026-8.

    Article  Google Scholar 

  • Giampiccolo, D., Dang, H. T., Magnini, B., Dagan, I., Cabrio, E., & Dolan , B. (2008). The fourth PASCAL recognizing textual entailment challenge. In Proceedings of text analysis conference (TAC) 2008, Gaithersburg, MD.

  • Glass, M. (2000). Processing language input in the CIRCSIM-Tutor intelligent tutoring system. In Papers from the 2000 AAAI fall symposium. AAAI technical report FS-00-01 (pp. 74–79).

  • Gleize, M., & Grau, B. (2013). LIMSIILES: Basic English substitution for student answer assessment at semeval 2013. In Second joint conference on lexical and computational semantics (*SEM), Vol. 2: Proceedings of the seventh international workshop on semantic evaluation (SemEval 2013) (pp. 598–602). Atlanta, GA: Association for Computational Linguistics.

  • Graesser, A. C., Wiemer-Hastings, K., Wiemer-Hastings, P., & Kreuz, R. (1999). Autotutor: A simulation of a human tutor. Cognitive Systems Research, 1, 35–51.

    Article  Google Scholar 

  • Heilman, M., & Madnani, N. (2012). ETS: Discriminative edit models for paraphrase scoring. In *SEM 2012: The first joint conference on lexical and computational semantics—Vol. 1: Proceedings of the main conference and the shared task, and Vol. 2: Proceedings of the sixth international workshop on semantic evaluation (SemEval 2012) (pp. 529–535). Montréal: Association for Computational Linguistics.

  • Heilman, M., & Madnani, N. (2013a). ETS: Domain adaptation and stacking for short answer scoring. In Second joint conference on lexical and computational semantics (*SEM), Vol. 2: Proceedings of the seventh international workshop on semantic evaluation (SemEval 2013) (pp. 275–279). Atlanta, GA: Association for Computational Linguistics.

  • Heilman, M., & Madnani, N. (2013b). HENRY-CORE: Domain adaptation and stacking for text similarity. In Second joint conference on lexical and computational semantics (*SEM), Vol. 1: Proceedings of the main conference and the shared task: semantic textual similarity (pp. 96–102). Atlanta, GA: Association for Computational Linguistics.

  • Jimenez, S., Becerra, C., & Gelbukh, A. (2013). SOFTCARDINALITY: Hierarchical text overlap for student response analysis. In Second joint conference on lexical and computational semantics (*SEM), Vol. 2: Proceedings of the seventh international workshop on semantic evaluation (SemEval 2013) (pp. 280–284). Atlanta, GA: Association for Computational Linguistics.

  • Jordan, P. W., Makatchev, M., & Pappuswamy, U. (2006a). Understanding complex natural language explanations in tutorial applications. In Proceedings of the third workshop on scalable natural language understanding, ScaNaLU ’06 (pp. 17–24).

  • Jordan, P., Makatchev, M., Pappuswamy, U., VanLehn, K., & Albacete, P. (2006b). A natural language tutorial dialogue system for physics. In Proceedings of the 19th international FLAIRS conference (pp. 521–527).

  • Kouylekov, M., Dini, L., Bosca, A., & Trevisan, M. (2013). Celi: EDITS and generic text pair classification. In Second joint conference on lexical and computational semantics (*SEM), Vol. 2: Proceedings of the seventh international workshop on semantic evaluation (SemEval 2013) (pp. 592–597). Atlanta, GA: Association for Computational Linguistics.

  • Leacock, C., & Chodorow, M. (2003). C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 37(4), 389–405.

    Article  Google Scholar 

  • Leacock, C., Chodorow, M., Gamon, M., & Tetreault, J. R. (2014). Automated grammatical error detection for language learners, Second edition. Synthesis lectures on human language technologies. San Rafael: Morgan & Claypool.

  • Levy, O., Zesch, T., Dagan, I., & Gurevych, I. (2013). UKP-BIU: Similarity and entailment metrics for student response analysis. In Second joint conference on lexical and computational semantics (*SEM), Vol. 2: Proceedings of the seventh international workshop on semantic evaluation (SemEval 2013) (pp. 285–289). Atlanta, GA: Association for Computational Linguistics.

  • MacDonald, N. H., Frase, L. T., Gingrich, P. S., & Keenan, S. A. (1982). The writer’s workbench: Computer aids for text analysis. IEEE Transactions on Communications, 30, 105–110.

    Article  Google Scholar 

  • McConville, M., & Dzikovska, M. O. (2008). Deep grammatical relations for semantic interpretation. In Coling 2008: Proceedings of the workshop on cross-framework and cross-domain parser evaluation (pp. 51–58).

  • Mohler, M., Bunescu, R., & Mihalcea, R. (2011). Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. Portland, OR: Association for Computational Linguistics (pp. 752–762). http://www.aclweb.org/anthology/P11-1076.

  • Ng, H. T., Wu, S. M., Briscoe, T., Hadiwinoto, C., Susanto, R. H., & Bryant, C. (2014). The CoNLL-201r shared task on grammatical error correction. In Proceedings of the 18th conference on computational natural language learning: Shared task (pp. 1–14). Association for Computational Linguistics.

  • Ng, H. T., Wu, S. M., Wu, Y., & Tetreault, J. (2013). The CoNLL-2013 shared task on grammatical error correction. In Proceedings of the 17th conference on computational natural language learning. Association for Computational Linguistics.

  • Nielsen, R. D., Ward, W., & Martin, J. H. (2008a). Learning to assess low-level conceptual understanding. In Proceedings of 21st international FLAIRS conference (pp. 427–432).

  • Nielsen, R. D., Ward, W., Martin, J. H., & Palmer, M. (2008b). Annotating students’ understanding of science concepts. In Proceedings of the sixth international language resources and evaluation conference (LREC08), Marrakech.

  • Nielsen, R. D., Ward, W., & Martin, J. H. (2009). Recognizing entailment in intelligent tutoring systems. The Journal of Natural Language Engineering, 15, 479–501.

    Article  Google Scholar 

  • Okoye, I., Bethard, S., & Sumner, T. (2013). CU: Computational assessment of short free text answers—A tool for evaluating students’ understanding. In Second joint conference on lexical and computational semantics (*SEM), Vol. 2: Proceedings of the seventh international workshop on semantic evaluation (SemEval 2013) (pp. 603–607). Atlanta, GA: Association for Computational Linguistics.

  • Ott, N., Ziai, R., Hahn, M., & Meurers, D. (2013). CoMeT: Integrating different levels of linguistic modeling for meaning assessment. In Second joint conference on lexical and computational semantics (*SEM), Vol. 2: Proceedings of the seventh international workshop on semantic evaluation (SemEval 2013) (pp. 608–616). Atlanta, GA: Association for Computational Linguistics.

  • Page, E. (1996). The imminence of grading essays by computer. Bloomington: Phi Delta Kappa.

    Google Scholar 

  • Pon-Barry, H., Clark, B., Schultz, K., Bratt, E. O., & Peters, S. (2004). Advantages of spoken language interaction in dialogue-based intelligent tutoring systems. In Proceedings of ITS-2004 conference (pp. 390–400).

  • Pulman, S. G., & Sukkarieh, J. Z. (2005). Automatic short answer marking. In Proceedings of the second workshop on building educational applications using NLP (pp. 9–16). Ann Arbor, MI: Association for Computational Linguistics.

  • Shermis, M. D., & Burstein, J. (Eds.). (2013). Handbook on automated essay evaluation: Current applications and new directions. London: Routledge.

    Google Scholar 

  • Tetreault, J., Blanchard, D., & Cahill, A. (2013). 12:10 a report on the first native language identification shared task. In Proceedings of the eighth workshop on innovative use of NLP for building educational applications (pp. 48–57). Association for Computational Linguistics.

  • VanLehn, K., Jordan, P., & Litman, D. (2007). Developing pedagogically effective tutorial dialogue tactics: Experiments and a testbed. In Proceedings of SLaTE workshop on speech and language technology in education. Farmington, PA.

  • Wolska, M., & Kruijff-Korbayová, I. (2004) Analysis of mixed natural and symbolic language input in mathematical dialogs. In ACL-2004, Barcelona.

  • Yeh, A. (2000). More accurate tests for the statistical significance of result differences. In Proceedings of the 18th international conference on computational linguistics (COLING 2000) (pp. 947–953). Stroudsburg, PA: Association for Computational Linguistics.

Download references

Acknowledgments

The research reported here was supported by the US ONR Award N000141410733 and by the Institute of Education Sciences, U.S. Department of Education, through Grant R305A120808 to the University of North Texas. The opinions expressed are those of the authors and do not represent views of the Institute of Education Sciences or the U.S. Department of Education. The authors like to thank Chris Brew for the discussion and suggestions related to the paper organization. We thank the three anonymous reviewers for their helpful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Myroslava O. Dzikovska.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dzikovska, M.O., Nielsen, R.D. & Leacock, C. The joint student response analysis and recognizing textual entailment challenge: making sense of student responses in educational applications. Lang Resources & Evaluation 50, 67–93 (2016). https://doi.org/10.1007/s10579-015-9313-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-015-9313-8

Keywords

Navigation