Skip to main content
Log in

Towards the benchmarking of question generation: introducing the Monserrate corpus

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Despite the growing interest in Question Generation, evaluating these systems remains notably difficult. Many authors rely on metrics like BLEU or ROUGE instead of relying on manual evaluations, as their computation is mostly free. However, corpora generally used as reference is very incomplete, containing just a couple of hypotheses per source sentence. In this paper, we propose monserrate corpus, a dataset specifically built to evaluate Question Generation systems, with, on average, 26 questions associated to each source sentence, attempting to be an “exhaustive” reference. With monserrate we study the impact of the reference size in evaluating Question Generation systems. Several evaluation metrics are used, from more traditional lexical ones to metrics based on word embeddings, and we conclude that these are still a limiting evaluation factor, as they lead to different outcomes. Finally, with monserrate, we benchmark three different Question Generation systems, representing different approaches to this task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. https://github.com/bjwyse/QGSTEC2010.

  2. https://github.com/hprodrig/MONSERRATE_Corpus.

  3. https://en.wikipedia.org/wiki/Monserrate_Palace.

  4. https://github.com/hprodrig/MONSERRATE_Corpus.

  5. https://github.com/Maluuba/nlg-eval.

  6. Results in last section include all configurations together.

References

  • Ali, H., Chali, Y., & Hasan, S. A. (2010). Automation of question generation from sentences. In: Proceedings of QG2010: The Third Workshop on Question Generation.

  • Amidei, J., Piwek, P., & Willis, A. (2018). Rethinking the agreement in human evaluation tasks. In: Proceedings of the 27th International Conference on Computational Linguistics, Association for Computational Linguistics, Santa Fe, New Mexico (pp. 3318–3329). USA

  • Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Association for Computational Linguistics, Ann Arbor, Michigan, (pp. 65–72).

  • Chaganty, A. T., Mussmann, S., & Liang, P. (2018). The price of debiasing automatic metrics in natural language evalaution. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15–20, 2018, Volume 1: Long Papers, (pp. 643–653).

  • Chen, W., Aist, G., & mostow, J. (2009). Generating questions automatically from informational text. In: Proceedings of the 2nd Workshop on Question Generation (AIED 2009), (pp. 17–24).

  • Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37.

    Article  Google Scholar 

  • Curto, S., Mendes, A. C., & Coheur, L. (2011). Exploring linguistically-rich patterns for question generation. In: Proceedings of the UCNLG+Eval: Language Generation and Evaluation Workshop, Association for Computational Linguistics, Stroudsburg, PA, USA, UCNLG+EVAL ’11, (pp. 33–38).

  • Curto, S., Mendes, A. C., & Coheur, L. (2012). Question generation based on Lexico-syntactic patterns learned from the web. Dialogue & Discourse, 3(2), 147–175.

    Article  Google Scholar 

  • Du, X., Shao, J., & Cardie, C. (2017). Learning to ask: Neural question generation for reading comprehension. In: Association for Computational Linguistics (ACL).

  • Flor, M., & Riordan, B. (2018). A semantic role-based approach to open-domain automatic question generation. In: Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics, New Orleans, Louisiana, (pp. 254–263).

  • Forgues, G., Pineau, J., Larchevêque, J. M., & Tremblay, R. (2014). Bootstrapping dialog systems with word embeddings. In: Nips, modern Machine Learning and Natural Language Processing Workshop, vol 2.

  • Heilman, M. (2011). Automatic factual question generation from text. PhD thesis, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA.

  • Heilman, M., & Smith, N. A. (2009). Question generation via overgenerating transformations and ranking. Tech. rep., School of Computer Science, Carnegie Mellon University, Pittsburgh, PA.

  • Heilman, M., & Smith, N. A. (2010). Good question! statistical ranking for question generation. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, USA, HLT ’10, (pp. 609–617).

  • Indurthi, S., Raghu, D., Khapra, M. M., & Joshi, S. (2017). Generating natural language question-answer pairs from a knowledge graph using a RNN based question generation model. In: EACL, Association for Computational Linguistics, (pp. 376–385).

  • Kalady, S., Illikottil, A., & Das, R. (2010). Natural language question generation using syntax and keywords. In: Proceedings of QG2010: The Third Workshop on Question Generation.

  • Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R. S., Torralba, A., Urtasun, R., & Fidler, S. (2015). Skip-thought vectors. CoRR abs/1506.06726.

  • Kumar, V., Ramakrishnan, G., & Li, Y. F. (2018). A framework for automatic question generation from text using deep reinforcement learning. ArXiv.

  • Labutov, I., Basu, S., & Vanderwende, L. (2015). Deep questions without deep understanding. In: Proceedings of ACL.

  • Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.

    Article  Google Scholar 

  • Levy, R., & Andrew, G. (2006). Tregex and tsurgeon: Tools for querying and manipulating tree data structures. In: In 5th International Conference on Language Resources and Evaluation.

  • Lin, C. Y. (2004). ROUGE: A package for automatic evaluation of summaries. Text Summarization Branches Out (pp. 74–81). Barcelona, Spain: Association for Computational Linguistics.

  • Liu, B., Zhao, M., Niu, D., Lai, K., He, Y., Wei, H., & Xu, Y. (2019). Learning to generate questions by learning what not to generate. CoRR abs/1902.10418.

  • Mannem, P., Prasad, R., & Joshi, A. (2010). Question generation from paragraphs at upenn: Qgstec system description. In: Proceedings of QG2010: The Third Workshop on Question Generation, (pp. 84–91).

  • Mazidi, K., & Nielsen, R. D. (2015). Leveraging multiple views of text for automatic question generation. In: Proceedings of Artificial Intelligence in Education - 17th International Conference, AIED 2015, Madrid, Spain, June 22–26, 2015. (pp. 257–266).

  • Mazidi, K., & Tarau, P. (2016). Infusing nlu into automatic question generation. In: Proceedings of the 9th International Natural Language Generation conference, ACL.

  • Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng, L. (2016). Ms marco: A human generated machine reading comprehension dataset. CoRR.

  • Novikova, J., Dušek, O., Cercas Curry, A., & Rieser, V. (2017). Why we need new evaluation metrics for nlg. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, (pp. 2241–2252).

  • Pal, S., Mondal, T., Pakray, P., Das, D., & Bandyopadhyay, S. (2010). Qgstec system description-juqgg: A rule based approach. Boyer & Piwek, 2010, 76–79.

    Google Scholar 

  • Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). Bleu: A method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, (pp. 311–318). https://doi.org/10.3115/1073083.1073135.

  • Piwek, P., & Boyer, K. (2012). Varieties of question generation: Introduction to this special issue. Dialogue & Discourse, 3, 1–9.

    Article  Google Scholar 

  • Pontius, R. G., & Millones, M. (2011). Death to Kappa: Birth of quantity disagreement and allocation disagreement for accuracy assessment. International Journal of Remote Sensing, 32(15), 4407–4429.

    Article  Google Scholar 

  • Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). Squad: 100, 000+ questions for machine comprehension of text. CoRR abs/1606.05250.

  • Rei, R., Stewart, C., Farinha, A. C., & Lavie, A. (2020). Comet: A neural framework for mt evaluation. arXiv:200909025.

  • Rodrigues, H., Coheur, L., & Nyberg, E. (2018). Improving question generation with the teacher’s implicit feedback. In: International Conference on Artificial Intelligence in Education, Springer, (pp. 301–306).

  • Rus, V., & Lintean, M. (2012). A comparison of greedy and optimal assessment of natural language student input using word-to-word similarity metrics. In: Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, Association for Computational Linguistics, Montréal, Canada, (pp. 157–162).

  • Rus, V., Wyse, B., Piwek, P., Lintean, M., Stoyanchev, S., & Moldovan, C. (2010). Overview of the first question generation shared task evaluation challenge. In: Proceedings of the Sixth International Natural Language Generation Conference (INLG 2010).

  • Rus, V., Piwek, P., Stoyanchev, S., Wyse, B., Lintean, M., & Moldovan, C. (2011). Question generation shared task and evaluation challenge: Status report. In: Proceedings of the 13th European Workshop on Natural Language Generation, Association for Computational Linguistics, Stroudsburg, PA, USA, ENLG ’11, (pp. 318–320).

  • Rus, V., Wyse, B., Piwek, P., Lintean, M., Stoyanchev, S., & Moldovan, C. (2012). A detailed account of the first question generation shared task evaluation challenge. Dialogue & Discourse, 3(2), 177–204.

    Article  Google Scholar 

  • Serban, I. V., García-Durán, A., Gulcehre, C., Ahn, S., Chandar, S., Courville, A., & Bengio, Y. (2016). Generating factoid questions with recurrent neural networks: The 30M factoid question-answer corpus. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Berlin, Germany, (pp. 588–598).

  • Sharma, S., El Asri, L., Schulz, H., & Zumer, J. (2017). Relevance of unsupervised metrics in task-oriented dialogue for evaluating natural language generation. CoRR abs/1706.09799, http://arxiv.org/abs/1706.09799.

  • Subramanian, S., Wang, T., Yuan, X., Zhang, S., Trischler, A., & Bengio, Y. (2018). Neural models for key phrase extraction and question generation. In: QA@ACL, Association for Computational Linguistics, (pp. 78–88).

  • Varga, A., & Ha, L. A. (2010). Wlv: A question generation system for the qgstec 2010 task b. In: Proceedings of QG2010: The Third Workshop on Question Generation.

  • Wang, T., Yuan, X., & Trischler, A. (2017). A joint model for question answering and question generation. CoRR abs/1706.01450.

  • Wyse, B., & Piwek, P. (2009). Generating questions from openlearn study units. In: AIED 2009 Workshop Proceedings Volume 1: The 2nd Workshop on Question Generation.

  • Yuan, X., Wang, T., Gulcehre, C., Sordoni, A., Bachman, P., Zhang, S., Subramanian, S., & Trischler, A. (2017). Machine comprehension by text-to-text neural question generation. In: Proceedings of the 2nd Workshop on Representation Learning for NLP, Association for Computational Linguistics, Vancouver, Canada, (pp. 15–25).

  • Zhou, Q., Yang, N., Wei, F., Tan, C., Bao, H., & Zhou, M. (2018). Neural question generation from text: A preliminary study. In X. Huang, J. Jiang, D. Zhao, Y. Feng, & Y. Hong (Eds.), Natural language processing and Chinese computing (pp. 662–671). Cham: Springer International Publishing.

    Chapter  Google Scholar 

Download references

Acknowledgements

Hugo Rodrigues was supported by the Carnegie Mellon-Portugal program (SFRH/ BD/51916/2012). This work was also supported by national funds through Fundação para a Ciência e Tecnologia (FCT) with reference UIDB/50021/2020.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hugo Rodrigues.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rodrigues, H., Nyberg, E. & Coheur, L. Towards the benchmarking of question generation: introducing the Monserrate corpus. Lang Resources & Evaluation 56, 573–591 (2022). https://doi.org/10.1007/s10579-021-09545-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-021-09545-5

Keywords

Navigation