Towards the benchmarking of question generation: introducing the Monserrate corpus

Rodrigues, Hugo; Nyberg, Eric; Coheur, Luisa

doi:10.1007/s10579-021-09545-5

Towards the benchmarking of question generation: introducing the Monserrate corpus

Original Paper
Published: 03 June 2021

Volume 56, pages 573–591, (2022)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

374 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Despite the growing interest in Question Generation, evaluating these systems remains notably difficult. Many authors rely on metrics like BLEU or ROUGE instead of relying on manual evaluations, as their computation is mostly free. However, corpora generally used as reference is very incomplete, containing just a couple of hypotheses per source sentence. In this paper, we propose monserrate corpus, a dataset specifically built to evaluate Question Generation systems, with, on average, 26 questions associated to each source sentence, attempting to be an “exhaustive” reference. With monserrate we study the impact of the reference size in evaluating Question Generation systems. Several evaluation metrics are used, from more traditional lexical ones to metrics based on word embeddings, and we conclude that these are still a limiting evaluation factor, as they lead to different outcomes. Finally, with monserrate, we benchmark three different Question Generation systems, representing different approaches to this task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural Language Processing

GenAI against humanity: nefarious applications of generative artificial intelligence and large language models

Article Open access 22 February 2024

Near-term advances in quantum natural language processing

Article 11 April 2024

Notes

https://github.com/bjwyse/QGSTEC2010.
https://github.com/hprodrig/MONSERRATE_Corpus.
https://en.wikipedia.org/wiki/Monserrate_Palace.
https://github.com/hprodrig/MONSERRATE_Corpus.
https://github.com/Maluuba/nlg-eval.
Results in last section include all configurations together.

References

Ali, H., Chali, Y., & Hasan, S. A. (2010). Automation of question generation from sentences. In: Proceedings of QG2010: The Third Workshop on Question Generation.
Amidei, J., Piwek, P., & Willis, A. (2018). Rethinking the agreement in human evaluation tasks. In: Proceedings of the 27th International Conference on Computational Linguistics, Association for Computational Linguistics, Santa Fe, New Mexico (pp. 3318–3329). USA
Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Association for Computational Linguistics, Ann Arbor, Michigan, (pp. 65–72).
Chaganty, A. T., Mussmann, S., & Liang, P. (2018). The price of debiasing automatic metrics in natural language evalaution. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15–20, 2018, Volume 1: Long Papers, (pp. 643–653).
Chen, W., Aist, G., & mostow, J. (2009). Generating questions automatically from informational text. In: Proceedings of the 2nd Workshop on Question Generation (AIED 2009), (pp. 17–24).
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37.
Article Google Scholar
Curto, S., Mendes, A. C., & Coheur, L. (2011). Exploring linguistically-rich patterns for question generation. In: Proceedings of the UCNLG+Eval: Language Generation and Evaluation Workshop, Association for Computational Linguistics, Stroudsburg, PA, USA, UCNLG+EVAL ’11, (pp. 33–38).
Curto, S., Mendes, A. C., & Coheur, L. (2012). Question generation based on Lexico-syntactic patterns learned from the web. Dialogue & Discourse, 3(2), 147–175.
Article Google Scholar
Du, X., Shao, J., & Cardie, C. (2017). Learning to ask: Neural question generation for reading comprehension. In: Association for Computational Linguistics (ACL).
Flor, M., & Riordan, B. (2018). A semantic role-based approach to open-domain automatic question generation. In: Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics, New Orleans, Louisiana, (pp. 254–263).
Forgues, G., Pineau, J., Larchevêque, J. M., & Tremblay, R. (2014). Bootstrapping dialog systems with word embeddings. In: Nips, modern Machine Learning and Natural Language Processing Workshop, vol 2.
Heilman, M. (2011). Automatic factual question generation from text. PhD thesis, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA.
Heilman, M., & Smith, N. A. (2009). Question generation via overgenerating transformations and ranking. Tech. rep., School of Computer Science, Carnegie Mellon University, Pittsburgh, PA.
Heilman, M., & Smith, N. A. (2010). Good question! statistical ranking for question generation. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, USA, HLT ’10, (pp. 609–617).
Indurthi, S., Raghu, D., Khapra, M. M., & Joshi, S. (2017). Generating natural language question-answer pairs from a knowledge graph using a RNN based question generation model. In: EACL, Association for Computational Linguistics, (pp. 376–385).
Kalady, S., Illikottil, A., & Das, R. (2010). Natural language question generation using syntax and keywords. In: Proceedings of QG2010: The Third Workshop on Question Generation.
Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R. S., Torralba, A., Urtasun, R., & Fidler, S. (2015). Skip-thought vectors. CoRR abs/1506.06726.
Kumar, V., Ramakrishnan, G., & Li, Y. F. (2018). A framework for automatic question generation from text using deep reinforcement learning. ArXiv.
Labutov, I., Basu, S., & Vanderwende, L. (2015). Deep questions without deep understanding. In: Proceedings of ACL.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.
Article Google Scholar
Levy, R., & Andrew, G. (2006). Tregex and tsurgeon: Tools for querying and manipulating tree data structures. In: In 5th International Conference on Language Resources and Evaluation.
Lin, C. Y. (2004). ROUGE: A package for automatic evaluation of summaries. Text Summarization Branches Out (pp. 74–81). Barcelona, Spain: Association for Computational Linguistics.
Liu, B., Zhao, M., Niu, D., Lai, K., He, Y., Wei, H., & Xu, Y. (2019). Learning to generate questions by learning what not to generate. CoRR abs/1902.10418.
Mannem, P., Prasad, R., & Joshi, A. (2010). Question generation from paragraphs at upenn: Qgstec system description. In: Proceedings of QG2010: The Third Workshop on Question Generation, (pp. 84–91).
Mazidi, K., & Nielsen, R. D. (2015). Leveraging multiple views of text for automatic question generation. In: Proceedings of Artificial Intelligence in Education - 17th International Conference, AIED 2015, Madrid, Spain, June 22–26, 2015. (pp. 257–266).
Mazidi, K., & Tarau, P. (2016). Infusing nlu into automatic question generation. In: Proceedings of the 9th International Natural Language Generation conference, ACL.
Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng, L. (2016). Ms marco: A human generated machine reading comprehension dataset. CoRR.
Novikova, J., Dušek, O., Cercas Curry, A., & Rieser, V. (2017). Why we need new evaluation metrics for nlg. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, (pp. 2241–2252).
Pal, S., Mondal, T., Pakray, P., Das, D., & Bandyopadhyay, S. (2010). Qgstec system description-juqgg: A rule based approach. Boyer & Piwek, 2010, 76–79.
Google Scholar
Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). Bleu: A method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, (pp. 311–318). https://doi.org/10.3115/1073083.1073135.
Piwek, P., & Boyer, K. (2012). Varieties of question generation: Introduction to this special issue. Dialogue & Discourse, 3, 1–9.
Article Google Scholar
Pontius, R. G., & Millones, M. (2011). Death to Kappa: Birth of quantity disagreement and allocation disagreement for accuracy assessment. International Journal of Remote Sensing, 32(15), 4407–4429.
Article Google Scholar
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). Squad: 100, 000+ questions for machine comprehension of text. CoRR abs/1606.05250.
Rei, R., Stewart, C., Farinha, A. C., & Lavie, A. (2020). Comet: A neural framework for mt evaluation. arXiv:200909025.
Rodrigues, H., Coheur, L., & Nyberg, E. (2018). Improving question generation with the teacher’s implicit feedback. In: International Conference on Artificial Intelligence in Education, Springer, (pp. 301–306).
Rus, V., & Lintean, M. (2012). A comparison of greedy and optimal assessment of natural language student input using word-to-word similarity metrics. In: Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, Association for Computational Linguistics, Montréal, Canada, (pp. 157–162).
Rus, V., Wyse, B., Piwek, P., Lintean, M., Stoyanchev, S., & Moldovan, C. (2010). Overview of the first question generation shared task evaluation challenge. In: Proceedings of the Sixth International Natural Language Generation Conference (INLG 2010).
Rus, V., Piwek, P., Stoyanchev, S., Wyse, B., Lintean, M., & Moldovan, C. (2011). Question generation shared task and evaluation challenge: Status report. In: Proceedings of the 13th European Workshop on Natural Language Generation, Association for Computational Linguistics, Stroudsburg, PA, USA, ENLG ’11, (pp. 318–320).
Rus, V., Wyse, B., Piwek, P., Lintean, M., Stoyanchev, S., & Moldovan, C. (2012). A detailed account of the first question generation shared task evaluation challenge. Dialogue & Discourse, 3(2), 177–204.
Article Google Scholar
Serban, I. V., García-Durán, A., Gulcehre, C., Ahn, S., Chandar, S., Courville, A., & Bengio, Y. (2016). Generating factoid questions with recurrent neural networks: The 30M factoid question-answer corpus. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Berlin, Germany, (pp. 588–598).
Sharma, S., El Asri, L., Schulz, H., & Zumer, J. (2017). Relevance of unsupervised metrics in task-oriented dialogue for evaluating natural language generation. CoRR abs/1706.09799, http://arxiv.org/abs/1706.09799.
Subramanian, S., Wang, T., Yuan, X., Zhang, S., Trischler, A., & Bengio, Y. (2018). Neural models for key phrase extraction and question generation. In: QA@ACL, Association for Computational Linguistics, (pp. 78–88).
Varga, A., & Ha, L. A. (2010). Wlv: A question generation system for the qgstec 2010 task b. In: Proceedings of QG2010: The Third Workshop on Question Generation.
Wang, T., Yuan, X., & Trischler, A. (2017). A joint model for question answering and question generation. CoRR abs/1706.01450.
Wyse, B., & Piwek, P. (2009). Generating questions from openlearn study units. In: AIED 2009 Workshop Proceedings Volume 1: The 2nd Workshop on Question Generation.
Yuan, X., Wang, T., Gulcehre, C., Sordoni, A., Bachman, P., Zhang, S., Subramanian, S., & Trischler, A. (2017). Machine comprehension by text-to-text neural question generation. In: Proceedings of the 2nd Workshop on Representation Learning for NLP, Association for Computational Linguistics, Vancouver, Canada, (pp. 15–25).
Zhou, Q., Yang, N., Wei, F., Tan, C., Bao, H., & Zhou, M. (2018). Neural question generation from text: A preliminary study. In X. Huang, J. Jiang, D. Zhao, Y. Feng, & Y. Hong (Eds.), Natural language processing and Chinese computing (pp. 662–671). Cham: Springer International Publishing.
Chapter Google Scholar

Download references

Acknowledgements

Hugo Rodrigues was supported by the Carnegie Mellon-Portugal program (SFRH/ BD/51916/2012). This work was also supported by national funds through Fundação para a Ciência e Tecnologia (FCT) with reference UIDB/50021/2020.

Author information

Authors and Affiliations

Instituto Superior Técnico, Universidade de Lisboa/INESC-ID/Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA
Hugo Rodrigues
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA
Hugo Rodrigues & Eric Nyberg
Instituto Superior Técnico, Universidade de Lisboa/INESC-ID, Lisbon, Portugal
Luisa Coheur

Authors

Hugo Rodrigues
View author publications
You can also search for this author in PubMed Google Scholar
Eric Nyberg
View author publications
You can also search for this author in PubMed Google Scholar
Luisa Coheur
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hugo Rodrigues.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rodrigues, H., Nyberg, E. & Coheur, L. Towards the benchmarking of question generation: introducing the Monserrate corpus. Lang Resources & Evaluation 56, 573–591 (2022). https://doi.org/10.1007/s10579-021-09545-5

Download citation

Accepted: 13 May 2021
Published: 03 June 2021
Issue Date: June 2022
DOI: https://doi.org/10.1007/s10579-021-09545-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards the benchmarking of question generation: introducing the Monserrate corpus

Abstract

Access this article

Similar content being viewed by others

Natural Language Processing

GenAI against humanity: nefarious applications of generative artificial intelligence and large language models

Near-term advances in quantum natural language processing

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Towards the benchmarking of question generation: introducing the Monserrate corpus

Abstract

Access this article

Similar content being viewed by others

Natural Language Processing

GenAI against humanity: nefarious applications of generative artificial intelligence and large language models

Near-term advances in quantum natural language processing

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation