Skip to main content

A task-performance evaluation of referring expressions in situated collaborative task dialogues

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Appropriate evaluation of referring expressions is critical for the design of systems that can effectively collaborate with humans. A widely used method is to simply evaluate the degree to which an algorithm can reproduce the same expressions as those in previously collected corpora. Several researchers, however, have noted the need of a task-performance evaluation measuring the effectiveness of a referring expression in the achievement of a given task goal. This is particularly important in collaborative situated dialogues. Using referring expressions used by six pairs of Japanese speakers collaboratively solving Tangram puzzles, we conducted a task-performance evaluation of referring expressions with 36 human evaluators. Particularly we focused on the evaluation of demonstrative pronouns generated by a machine learning-based algorithm. Comparing the results of this task-performance evaluation with the results of a previously conducted corpus-matching evaluation (Spanger et al. in Lang Resour Eval, 2010b), we confirmed the limitation of a corpus-matching evaluation and discuss the need for a task-performance evaluation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. They compared the system output with subjective human judgement as a gold standard rather than an existing corpus.

  2. In this methodology subjects are requested to perform different kinds of tasks simultaneously, one of which is the target task to measure the cognitive load.

  3. The corpus is publicly available together with other variants as the REX corpora (Tokunaga et al. 2012) through GSK (http://www.gsk.or.jp/index_e.html) (Resource ID: GSK2013-A).

  4. There are three types of demonstrative pronoun/adjective in Japanese: “kore/kono (this)”, “sore/sono (that)” and “are/ano (that)”. They are basically chosen based on the physical and mental distance between the speaker and the target (Ono 1994).

  5. http://svmlight.joachims.org/.

  6. In our current setting, “sore/sono (that)” is the most appropriate because the solver (speaker) does not have control of the mouse, i.e. pointing device. Actually, “sore (that)” is the most dominant demonstrative pronoun in the entire corpus.

References

  • Anderson, A. H., Bader, M., Bard, E. G., Boyle, E., Doherty, G., Garrod, S., et al. (1991). The HCRC map task corpus. Language and Speech, 34(4), 351–366.

    Google Scholar 

  • Belz, A., & Gatt, A. (2008). Intrinsic vs. extrinsic evaluation measures for referring expression generation. In Proceedings of ACL-08: HLT, Short Papers (pp. 197–200).

  • Belz, A., & Kow, E. (2010). The GREC challenges 2010: Overview and evaluation results. In Proceedings of the 6th international natural language generation conference (pp. 219–229).

  • Belz, A., Kow, E., Viethen, J., & Gatt, A. (2010). Referring expression generation in context: The GREC shared task evaluation challenges. In E. Krahmer, & M. Theune (Eds.), Empirical methods in natural language generation (Vol. LNCS5790, pp. 294–327). Berlin: Springer.

    Chapter  Google Scholar 

  • Bolt, R. A. (1980). Put-that-there: Voice and gesture at the graphics interface. In Proceedings of the 7th annual conference on Computer graphics and interactive techniques (SIGRAPH 1980) (pp. 262–270). ACM.

  • Byron, D., Koller, A., Striegnitz, K., Cassell, J., Dale, R., Moore, J., et al. (2009). Report on the first NLG challenge on generating instructions in virtual environments (GIVE). In Proceedings of the 12th European workshop on natural language generation (ENLG 2009) (pp. 165–173).

  • Cahill, A., & van Genabith, J. (2006). Robust PCFG-based generation using automatically acquired lfg approximations. In Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics (pp. 1033–1040).

  • Campana, E., Tanenhaus, M. K., Allen, J. F., & Remington, R. (2011). Natural discourse reference generation reduces cognitive load in spoken systems. Natural Language Engineering, 17(3), 311–329.

    Article  Google Scholar 

  • Carenini, G., & Moore, J. D. (2006). Generating and evaluating evaluative arguments. Artificial Intelligence, 170(11), 925–952.

    Article  Google Scholar 

  • Clark, H. H., & Wilkes-Gibbs, D. (1986). Referring as a collaborative process. Cognition, 22, 1–39.

    Article  Google Scholar 

  • Dale, R. (1989). Cooking up referring expressions. In Proceedings of the 27th annual meeting of the association for computational linguistics (pp. 68–75).

  • Dale, R., & Reiter, E. (1995). Computational interpretation of the Gricean maxims in the generation of referring expressions. Cognitive Science, 19(2), 233–263.

    Article  Google Scholar 

  • Di Eugenio, B., Glass, M., & Trolio, M. J. (2002). The DIAG experiments: Natural language generation for intelligent tutoring systems. In Proceesings of the 2nd international natural language generation conference (INLG 2002) (pp. 120–127).

  • Di Eugenio, B., Jordan, P. W., Thomason, R. H., & Moore, J. D. (2000). The agreement process: An empirical investigation of human-human computer-mediated collaborative dialogues. International Journal of Human-Computer Studies, 53(6), 1017–1076.

    Article  Google Scholar 

  • Foster, M. E., Giuliani, M., & Knoll, A. (2009). Comparing objective and subjective measures of usability in a human-robot dialogue system. In Proceedings of the 47th annual meeting of the ACL and the 4th IJCNLP of the AFNLP (pp. 879–887).

  • Gargett, A., Garoufi, K., Koller, A., & Striegnitz, K. (2010). The GIVE-2 corpus of giving instructions in virtual environments. In Proceedings of the seventh conference on international language resources and evaluation (LREC 2010) (pp. 2401–2406).

  • Gatt, A., & Belz, A. (2010). Introducing shared tasks to NLG: The TUNA shared task evaluation challenges. In: E. Krahmer, & M. Theune (Eds.), Empirical methods in natural language generation (Vol. LNAI 5790, pp. 264–293). Berlin: Springer.

    Chapter  Google Scholar 

  • Gupta, S., & Stent, A. J. (2005). Automatic evaluation of referring expression generation using corpora. In Proceedings of the 1st workshop on using Corpora in NLG.

  • Heeman, P. A., & Hirst, G. (1995). Collaborating on referring expressions. Computational Linguistics, 21(3), 351–382.

    Google Scholar 

  • Horton, W. S., & Keysar, B. (1996). When do speakers take into account common ground? Cognition, 59, 91–117.

    Article  Google Scholar 

  • Joachims, T. (1999). Making large-scale SVM learning practical. In B. Schölkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods: Support vector learning (pp. 169–184). Cambridge: MIT-Press.

    Google Scholar 

  • Jordan, P. W., & Walker, M. A. (2005). Learning content selection rules for generating object descriptions in dialogue. Journal of Artificial Intelligence Research, 24, 157–194.

    Google Scholar 

  • Khan, I., van Deemter, K., Ritchie, G., Gatt, A., & Cleland, A. A. (2009). A hearer-oriented evaluation of referring expression generation. In Proceedings of the 12th European workshop on natural language generation (ENLG 2009) (pp. 98–101).

  • Koller, A., Striegnitz, K., Gargett, A., Byron, D., Cassell, J., Dale, R., et al. (2010). Report on the second NLG challenge on generating instructions in virtual environments (GIVE-2). In Proceedings of the 6th international natural language generation conference (pp. 243–250).

  • Krahmer, E., & van Deemter, K. (2012). Computational generation of referring expressions: A survey. Computational Linguistics, 38(1), 173–218.

    Article  Google Scholar 

  • Lester, J. C., Voerman, J. L., Towns, S. G., & Callaway, C. B. (1999). Deictic believability: Coordinating gesture, locomotion, and speech in lifelike pedagogical agents. Applied Artificial Intelligence, 13(4–5), 383–414.

    Article  Google Scholar 

  • Mitkov, R. (2002). Anaphora resolution. London: Longman.

    Google Scholar 

  • Ono, K. (1994). Territories of information and Japanese demonstratives. The Journal of the Association of Teachers of Japanese, 28(2), 131–155.

    Article  Google Scholar 

  • Papineni, K., Roukos, S., Ward, T., & Jing Zhu, W. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics (ACL 2002) (pp. 311–318).

  • Paraboni, I., van Deemter, K., & Masthoff, J. (2007). Generating referring expressions: Making referents easy to identify. Computational Linguistics, 33(2), 229–254.

    Article  Google Scholar 

  • Reiter, E., & Belz, A. (2009). An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4), 529–558.

    Article  Google Scholar 

  • Reiter, E., Robertson, R., & Osman, L. M. (2003). Lessons from a failure: Generating tailored smoking cessation letters. Artificial Intelligence, 144(1–2), 41–58.

    Article  Google Scholar 

  • Reiter, E., & Sripada, S. (2002). Should corpora texts be gold standards for NLG? In Proceedings of the 2nd international natural language generation conference (INLG 2002) (pp. 97–104).

  • Reiter, E., Sripada, S., Hunter, J., Yu, J., & Davy, I. (2005). Choosing words in computer-generated weather forecasts. Artificial Intelligence, 167(1–2),137–169.

    Article  Google Scholar 

  • Spanger, P., Iida, R., Tokunaga, T., Teri, A., & Kuriyama, N. (2010a). Towards an extrinsic evaluation of referring expressions in situated dialogs. In J. Kelleher, B. M. Namee, & I. van der Sluis (Eds.), Proceedings of the sixth international natural language generation conference (INGL 2010) (pp. 135–144).

  • Spanger, P., Yasuhara, M., Iida, R., & Tokunaga, T. (2009). Using extra linguistic information for generating demonstrative pronouns in a situated collaboration task. In Proceedings of PreCogSci 2009: Production of referring expressions: Bridging the gap between computational and empirical approaches to reference.

  • Spanger, P., Yasuhara, M., Iida, R., Tokunaga, T., Terai, A., & Kuriyama, N. (2010b). REX-J: Japanese referring expression corpus of situated dialogs. Language Resources and Evaluation, 46(3), 461–491.

  • Sparck Jones, K., & Galliers, J. R. (1996). Evaluating natural language processing systems: An analysis and review. Berlin: Springer.

    Google Scholar 

  • Stoia, L., Shockley, D. M., Byron, D. K., & Fosler-Lussier, E. (2006). Noun phrase generation for situated dialogs. In Proceedings of the 4th international natural language generation conference (INLG 2006) (pp. 81–88).

  • Striegnitz, K., Denis, A., Gargett, A., Garoufi, K., Koller, A., & Theune, M. (2011). Report on the second second challenge on generating instructions in virtual environments (GIVE-2.5). In Proceedings of the 13th European workshop on natural language generation (ENLG 2011) (pp. 270–297).

  • Tokunaga, T., Iida, R., Terai, A., & Kuriyama, N. (2012). The REX corpora: A collection of multimodal corpora of referring expressions in collaborative problem solving dialogues. In Proceedings of the eigth international conference on language resources and evaluation (LREC 2012) (pp. 422–429).

  • van Deemter, K., Gatt, A., van der Sluis, I., & Power, R. (2012). Generation of referring expressions: Assessing the incremental algorithm. Cognitive Science, 36(5), 799–836.

    Article  Google Scholar 

  • van der Sluis, I., Gatt, A., & van Deemter, K. (2007). Evaluating algorithms for the generation of referring expressions: Going beyond toy domains. In Proceedings of recent advances in natural languae processing (RANLP 2007).

  • van der Sluis, I., & Krahmer, E. (2007). Generating multimodal references. Discourse Processes, 44(3), 145–174.

    Article  Google Scholar 

  • Vapnik, V. N. (1998). Statistical learning theory, adaptive and learning systems for signal processing communications, and control. New York: Wiley.

    Google Scholar 

  • Young, R. M. (1999). Using Grice’s maxim of quantity to select the content of plan descriptions. Artificial Intelligence, 115, 215–256.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Takenobu Tokunaga.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Spanger, P., Iida, R., Tokunaga, T. et al. A task-performance evaluation of referring expressions in situated collaborative task dialogues. Lang Resources & Evaluation 47, 1285–1304 (2013). https://doi.org/10.1007/s10579-013-9240-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-013-9240-5

Keywords