Introducing Shared Tasks to NLG: The TUNA Shared Task Evaluation Challenges

Gatt, Albert; Belz, Anja

doi:10.1007/978-3-642-15573-4_14

Albert Gatt^21,22 &
Anja Belz²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5790))

Included in the following conference series:

1255 Accesses

Abstract

Shared Task Evaluation Challenges (stecs) have only recently begun in the field of nlg. The tuna stecs, which focused on Referring Expression Generation (reg), have been part of this development since its inception. This chapter looks back on the experience of organising the three tuna Challenges, which came to an end in 2009. While we discuss the role of the stecs in yielding a substantial body of research on the reg problem, which has opened new avenues for future research, our main focus is on the role of different evaluation methods in assessing the output quality of reg algorithms, and on the relationship between such methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Perspectivist approaches to natural language processing: a survey

Article Open access 18 August 2024

The ASSIN 2 Shared Task: A Quick Overview

A pragmatic guide to geoparsing evaluation

Article Open access 19 September 2019

References

Appelt, D.: Planning English referring expressions. Artificial Intelligence 26(1), 1–33 (1985)
Article Google Scholar
Appelt, D., Kronfeld, A.: A computational model of referring. In: Proceedings of the 10th International Joint Conference on Artificial Intelligence (IJCAI 1987), pp. 640–647 (1987)
Google Scholar
Bard, E.G., Robertson, D., Sorace, A.: Magnitude estimation of linguistic acceptability. Language 72(1), 32–68 (1996)
Article Google Scholar
Belke, E., Meyer, A.: Tracking the time course of multidimensional stimulus discrimination: Analysis of viewing patterns and processing times during same-different decisions. European Journal of Cognitive Psychology 14(2), 237–266 (2002)
Article Google Scholar
Belz, A.: Statistical generation: Three methods compared and evaluated. In: Proceedings of the 10th European Workshop on Natural Language Generation (ENLG 2005), pp. 15–23 (2005)
Google Scholar
Belz, A., Gatt, A.: The attribute selection for gre challenge: Overview and evaluation results. In: Proceedings of UCNLG+MT: Language Generation and Machine Translation, pp. 75–83 (2007)
Google Scholar
Belz, A., Gatt, A.: Intrinsic vs. extrinsic evaluation measures for referring expression generation. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL 2008), pp. 197–200 (2008)
Google Scholar
Belz, A., Kow, E.: System-building cost vs. output quality in data-to-text generation. In: Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009), pp. 16–24 (2009)
Google Scholar
Belz, A., Kow, E., Viethen, J., Gatt, A.: Generating referring expressions in context: The grec task evaluation challenges. In: Krahmer, E., Theune, M. (eds.) Empirical Methods in NLG. LNCS (LNAI), vol. 5790, pp. 294–328. Springer, Heidelberg (2010)
Google Scholar
Belz, A., Reiter, E.: Comparing automatic and human evaluation of nlg systems. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006), pp. 313–320 (2006)
Google Scholar
Belz, A.: That’s nice.. what can you do with it? Computational Linguistics 35(1), 111–118 (2009)
Article Google Scholar
Belz, A., Kilgarriff, A.: Shared task evaluations in HLT: Lessons for NLG. In: Proceedings of INLG 2006, pp. 133–135 (2006)
Google Scholar
Black, A., Taylor, P., Caley, R.: The Festival speech synthesis system: System documentation. Tech. Rep. 1.4 edition., University of Edinburgh (1999)
Google Scholar
Bohnet, B.: is-fbn, is-fbs, is-iac: The adaptation of two classic algorithms for the generation of referring expressions in order to produce expressions like humans do. In: Proceedings of UCNLG+MT: Language Generation and Machine Translation, pp. 84–86 (2007)
Google Scholar
Bohnet, B.: The fingerprint of human referring expressions and their surface realization with graph transducers. In: Proceedings of the 5th International Conference on Natural Language Generation (INLG 2008), pp. 207–210 (2008)
Google Scholar
Bohnet, B., Dale, R.: Viewing referring expression generation as search. In: Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI 2005), pp. 1004–1009 (2005)
Google Scholar
Cahill, A., van Genabith, J.: Robust pcfg-based generation using automatically acquired lfg approximations. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING/ACL 2006), pp. 1033–1040 (2006)
Google Scholar
Callaway, C.B.: Evaluating coverage for large symbolic nlg grammars. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI 2003), pp. 811–817 (2003)
Google Scholar
Callaway, C.B., Lester, J.C.: Narrative prose generation. Artificial Intelligence 139(2), 213–252 (2002)
Article Google Scholar
Calliston-Burch, C., Osborne, M., Koehn, P.: Re-evaluating the role of bleu in machine translation research. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006), pp. 249–256 (2006)
Google Scholar
Dale, R., Reiter, E.: Computational interpretation of the Gricean maxims in the generation of referring expressions. Cognitive Science 19(8), 233–263 (1995)
Article Google Scholar
Dale, R., White, M. (eds.): Shared Tasks and Comparative Evaluation in Natural Language Generation: Workshop Report (2007), http://www.ling.ohio-state.edu/nlgeval07/NLGEval07-Report.pdf
Dale, R.: Cooking up referring expressions. In: Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, ACL 1989, pp. 68–75 (1989)
Google Scholar
van Deemter, K.: Generating referring expressions: Boolean extensions of the incremental algorithm. Computational Linguistics 28(1), 37–52 (2002)
Article MATH Google Scholar
van Deemter, K., Gatt, A.: Content determination in GRE: Evaluating the evaluator. In: Proceedings of the 2nd UCNLG Workshop: Language Generation and Machine Translation (2007)
Google Scholar
Doddington, G.: Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the 2nd International Conference on Human Language Technology Research (HLT 2002), pp. 138–145 (2002)
Google Scholar
Dorr, B.J., Monz, C., President, S., Schwartz, R., Zajic, D.: A methodology for extrinsic evaluation of text summarization: Does rouge correlate? In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarisation, pp. 1–8 (2005)
Google Scholar
Engelhardt, P.E., Bailey, K., Ferreira, F.: Do speakers and listeners observe the Gricean Maxim of Quantity? Journal of Memory and Language 54, 554–573 (2006)
Article Google Scholar
Fabbrizio, G.D., Stent, A.J., Bangalore, S.: Referring expression generation using speaker-based attribute selection and trainable realization (att-reg). In: Proceedings of the 5th International Conference on Natural Language Generation (INLG 2008), pp. 211–214 (2008)
Google Scholar
Fabbrizio, G.D., Stent, A.J., Bangalore, S.: Trainable speaker-based referring expression generation. In: Proceedings of the 12th Conference on Computational Natural Language Learning (CONLL 2008), pp. 151–158 (2008)
Google Scholar
Foster, M.: Automated metrics that agree with human judgements on generated output for an embodied conversational agent. In: Proceedings of the 5th International Conference on Natural Language Generation (INLG 2008), pp. 95–103 (2008)
Google Scholar
Gardent, C.: Generating minimal definite descriptions. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), pp. 96–103 (2002)
Google Scholar
Gatt, A., Belz, A.: Attribute selection for referring expression generation: New algorithms and evaluation methods. In: Proceedings of the 5th International Conference on Natural Language Generation (INLG 2008), pp. 50–58 (2008)
Google Scholar
Gatt, A., Belz, A., Kow, E.: The tuna Challenge 2008: Overview and evaluation results. In: Proceedings of the 5th International Conference on Natural Language Generation (INLG 2008), pp. 198–206 (2008)
Google Scholar
Gatt, A., Belz, A., Kow, E.: The tuna-reg Challenge 2009: Overview and evaluation results. In: Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009), pp. 198–206 (2009)
Google Scholar
Gatt, A., van Deemter, K.: Lexical choice and conceptual perspective in the generation of plural referring expressions. Journal of Logic, Language and Information 16(4), 423–443 (2007)
Article MATH Google Scholar
Gatt, A., van der Sluis, I., van Deemter, K.: Evaluating algorithms for the generation of referring expressions using a balanced corpus. In: Proceedings of the 11th European Workshop on Natural Language Generation (ENLG 2007), pp. 45–56 (2007)
Google Scholar
Grice, H.: Logic and conversation. In: Cole, P., Morgan, J. (eds.) Syntax and Semantics: Speech Acts, vol. III. Academic Press, London (1975)
Google Scholar
Gupta, S., Stent, A.J.: Automatic evaluation of referring expression generation using corpora. In: Proceedings of the 1st Workshop on Using Corpora in NLG (UCNLG 2005), pp. 1–6 (2005)
Google Scholar
Hervás, R., Gervás, P.: Evolutionary and case-based approaches to reg. In: Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009), pp. 187–188 (2009)
Google Scholar
Jordan, P., Walker, M.: Learning attribute selections for non-pronominal expressions. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (2000)
Google Scholar
Jordan, P.W.: Can nominal expressions achieve multiple goals? In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL 2000), pp. 142–149 (2000)
Google Scholar
Jordan, P.W., Walker, M.: Learning content selection rules for generating object descriptions in dialogue. Journal of Artificial Intelligence Research 24, 157–194 (2005)
MATH Google Scholar
Karasimos, A., Isard, A.: Multi-lingual evaluation of a natural language generation system. In: Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC 2004 (2004)
Google Scholar
Kelleher, J., Namee, B.M.: Referring expression generation challenge 2008 DIT system descriptions. In: Proceedings of the 5th International Conference on Natural Langauge Generation (INLG 2008), pp. 221–224 (2008)
Google Scholar
King, J.: OSU-GP: Attribute selection using genetic programming. In: Proceedings of the 5th International Conference on Natural Language Generation (INLG 2008), pp. 225–226 (2008)
Google Scholar
Koller, A., Striegnitz, K., Byron, D., Cassell, J., Dale, R., Moore, J., Oberlander, J.: The first challenge on generating instructions in virtual environments. In: Krahmer, E., Theune, M. (eds.) Empirical Methods in NLG. LNCS (LNAI), vol. 5790, pp. 329–353. Springer, Heidelberg (2010)
Google Scholar
Koolen, R., Gatt, A., Goudbeek, M., Krahmer, E.: Need I say more? On factors causing referential overspecification. In: Proceedings of the Workshop on Production of Referring Expressions: Bridging Computational and Psycholinguistic Approaches (pre-cogsci 2009) (2009)
Google Scholar
Krahmer, E., van Erk, S., Verleg, A.: Graph-based generation of referring expressions. Computational Linguistics 29(1), 53–72 (2003)
Article MATH Google Scholar
Kronfeld, A.: Conversationally relevant descriptions. In: Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics (ACL 1989), pp. 60–67 (1989)
Google Scholar
Langkilde-Geary, I.: An empirical verification of coverage and correctness for a general-purpose sentence generator. In: Proceedings of the 2nd International Conference on Natural Language Generation, INLG 2002 (2002)
Google Scholar
Law, A.S., Freer, Y., Hunter, J., Logie, R., McIntosh, N., Quinn, J.: A comparison of graphical and textual presentations of time series data to support medical decision making in the neonatal intensive care unit. Journal of Clinical Monitoring and Computing 19, 183–194 (2005)
Article Google Scholar
Lester, J., Porter, B.: Developing and empirically evaluating robust explanation generators: The knight experiments. Computational Linguistics 23(1), 65–101 (1997)
Google Scholar
Lin, C.Y., Hovy, E.: Automatic evaluation of summaries using n-gram co-occurrence statistics. In: Proceedings of HLT-NAACL 2003, pp. 71–78 (2003)
Google Scholar
de Lucena, D., Paraboni, I.: usp-each frequency-based greedy attribute selection for referring expressions generation. In: Proceedings of the 5th International Conference on Natural Language Generation (INLG 2008), pp. 219–220 (2008)
Google Scholar
Maes, A., Arts, A., Noordman, L.: Reference management in instructive discourse. Discourse Processes 37(2), 117–144 (2004)
Article Google Scholar
Miyao, Y., Saetre, R., Sagae, K., Matsuzaki, T., Tsujii, J.: Task-oriented evaluations of syntactic parsers and their representations. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL 2008), pp. 46–54 (2008)
Google Scholar
Papineni, S., Roukos, T., Ward, W., Zhu., W.: bleu: A method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), pp. 311–318 (2002)
Google Scholar
Passonneau, R.: Measuring agreement on set-valued items (masi) for semantic and pragmatic annotation. In: Proceedings of the 5th International Conference on Language Resources and Evaluation, LREC 2006 (2006)
Google Scholar
Pechmann, T.: Incremental speech production and referential overspecification. Linguistics 27, 89–110 (1989)
Article Google Scholar
Pereira, D.B., Paraboni, I.: From TUNA attribute sets to Portuguese text: A first report. In: Proceedings of the 5th International Conference on Natural Language Generation (INLG 2008), pp. 232–234 (2008)
Google Scholar
Portet, F., Reiter, E., Gatt, A., Hunter, J., Sripada, S., Freer, Y., Sykes, C.: Automatic generation of textual summaries from neonatal intensive care data. Artificial Intelligence 173(7–8), 789–816 (2009)
Article Google Scholar
Reips, U.D.: The Web Experimental Psychology Lab: Five years of data collection on the Internet. Behavioral Research Methods and Computers 33(2), 201–211 (2001)
Article Google Scholar
Reiter, E., Belz, A.: An investigation into the validity of some metrics for automatically evaluating Natural Language Generation systems. Computational Linguistics 35(4), 529–558 (2009)
Article Google Scholar
Reiter, E., Robertson, R., Osman, L.: Lessons from a failure: Generating tailored smoking cessation letters. Artificial Intelligence 144, 41–58 (2003)
Article Google Scholar
Reiter, E., Sripada, S.: Should corpora texts be gold standards for nlg? In: Proceedings of the 2nd International Conference on Natural Language Generation, INLG 2002 (2002)
Google Scholar
Reiter, E., Sripada, S., Hunter, J., Yu, J., Davy, I.: Choosing words in computer-generated weather forecasts. Artificial Intelligence 167, 137–169 (2005)
Article Google Scholar
van der Sluis, I., Gatt, A., van Deemter, K.: Evaluating algorithms for the generation of referring expressions: Going beyond toy domains. In: Proceedings of the Conference on Recent Advances in Natural Language Processing, RANLP 2007 (2007)
Google Scholar
van der Sluis, I., Krahmer, E.: Generating multimodal referring expressions. Discourse Processes 44(3), 145–174 (2007)
Article Google Scholar
Spanger, P., Kurosawa, T., Tokunaga, T.: TITCH: Attribute selection based on discrimination power and frequency. In: Proceedings of UCNLG+MT: Language Generation and Machine Translation, pp. 98–100 (2008)
Google Scholar
Spärck Jones, K., Galliers, J.R.: Evaluating natural language processing systems: An analysis and review. Springer, Berlin (1996)
Google Scholar
Stock, O., Zancanaro, M., Busetta, P., Callaway, C., Krueger, A., Kruppa, M., Kuflik, T., Not, E., Rocchi, C.: Adaptive, intelligent presentation of information for the museum visitor in peach. User Modeling and User-Adapted Interaction 17(3), 257–304 (2007)
Article Google Scholar
von Stutterheim, C., Mangold-Allwinn, R., Barattelli, S., Kohlmann, U., Kölbing, H.G.: Reference to objects in text production. Belgian Journal of Linguistics 8, 99–125 (1993)
Article Google Scholar
Tanenhaus, M.K., Spivey-Knowlton, M.J., Eberhard, K.M., Sedivy, J.G.: Integration of visual and linguistic information in spoken language comprehension. Science 268, 1632–1634 (1995)
Article Google Scholar
Theune, M., Touset, P., Viethen, J., Krahmer, E.: Cost-based attribute selection for generating referring expressions (graph-fp and graph-sc). In: Proceedings of UCNLG+MT: Language Generation and Machine Translation, pp. 95–97 (2007)
Google Scholar
Viethen, J., Dale, R.: Algorithms for generating referring expressions: Do they do what people do? In: Proceedings of the 4th International Conference on Natural Language Generation (INLG 2006), pp. 63–70 (2006)
Google Scholar
Viethen, J., Dale, R.: Evaluation in natural language generation: Lessons from referring expression generation. Traitement Automatique des Langues 48(1), 141–160 (2007)
Google Scholar
White, M., Rajkumar, R., Martin, S.: Towards broad coverage surface realization with ccg. In: Proceedings of the Workshop on Using Corpora for NLG: Language Generation and Machine Translation, UCNLG+MT (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Linguistics, Centre for Communication Technology, University of Malta, Malta
Albert Gatt
Communication and Cognition, Faculty of Arts, Tilburg University, Netherlands
Albert Gatt
NLTG, School of Computing, Mathematical and Information Sciences, University of Brighton, UK
Anja Belz

Authors

Albert Gatt
View author publications
You can also search for this author in PubMed Google Scholar
Anja Belz
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Humanities, Department of Communication and Information Sciences (DCI), Tilburg University, P.O.Box 90153, 5000 LE, Tilburg, The Netherlands
Emiel Krahmer
Human Media Interaction (HMI), Department of Electrical Engineering, Mathematics and Computer Science (EEMCS), University of Twente, P.O. Box 217, 7500 AE, Enschede, The Netherlands
Mariët Theune

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Gatt, A., Belz, A. (2010). Introducing Shared Tasks to NLG: The TUNA Shared Task Evaluation Challenges. In: Krahmer, E., Theune, M. (eds) Empirical Methods in Natural Language Generation. EACL ENLG 2009 2009. Lecture Notes in Computer Science(), vol 5790. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15573-4_14

Download citation

DOI: https://doi.org/10.1007/978-3-642-15573-4_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15572-7
Online ISBN: 978-3-642-15573-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics