A task-performance evaluation of referring expressions in situated collaborative task dialogues

Spanger, Philipp; Iida, Ryu; Tokunaga, Takenobu; Terai, Asuka; Kuriyama, Naoko

doi:10.1007/s10579-013-9240-5

A task-performance evaluation of referring expressions in situated collaborative task dialogues

Original Paper
Published: 21 June 2013

Volume 47, pages 1285–1304, (2013)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Philipp Spanger¹,
Ryu Iida¹,
Takenobu Tokunaga¹,
Asuka Terai² &
…
Naoko Kuriyama²

273 Accesses
Explore all metrics

Abstract

Appropriate evaluation of referring expressions is critical for the design of systems that can effectively collaborate with humans. A widely used method is to simply evaluate the degree to which an algorithm can reproduce the same expressions as those in previously collected corpora. Several researchers, however, have noted the need of a task-performance evaluation measuring the effectiveness of a referring expression in the achievement of a given task goal. This is particularly important in collaborative situated dialogues. Using referring expressions used by six pairs of Japanese speakers collaboratively solving Tangram puzzles, we conducted a task-performance evaluation of referring expressions with 36 human evaluators. Particularly we focused on the evaluation of demonstrative pronouns generated by a machine learning-based algorithm. Comparing the results of this task-performance evaluation with the results of a previously conducted corpus-matching evaluation (Spanger et al. in Lang Resour Eval, 2010b), we confirmed the limitation of a corpus-matching evaluation and discuss the need for a task-performance evaluation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Building referring expression corpora with and without feedback

Article 08 July 2020

Spontaneous, controlled acts of reference between friends and strangers

Article 28 November 2022

Towards efficient human–machine collaboration: effects of gaze-driven feedback and engagement on performance

Article Open access 29 December 2018

Notes

They compared the system output with subjective human judgement as a gold standard rather than an existing corpus.
In this methodology subjects are requested to perform different kinds of tasks simultaneously, one of which is the target task to measure the cognitive load.
The corpus is publicly available together with other variants as the REX corpora (Tokunaga et al. 2012) through GSK (http://www.gsk.or.jp/index_e.html) (Resource ID: GSK2013-A).
There are three types of demonstrative pronoun/adjective in Japanese: “kore/kono (this)”, “sore/sono (that)” and “are/ano (that)”. They are basically chosen based on the physical and mental distance between the speaker and the target (Ono 1994).
http://svmlight.joachims.org/.
In our current setting, “sore/sono (that)” is the most appropriate because the solver (speaker) does not have control of the mouse, i.e. pointing device. Actually, “sore (that)” is the most dominant demonstrative pronoun in the entire corpus.

References

Anderson, A. H., Bader, M., Bard, E. G., Boyle, E., Doherty, G., Garrod, S., et al. (1991). The HCRC map task corpus. Language and Speech, 34(4), 351–366.
Google Scholar
Belz, A., & Gatt, A. (2008). Intrinsic vs. extrinsic evaluation measures for referring expression generation. In Proceedings of ACL-08: HLT, Short Papers (pp. 197–200).
Belz, A., & Kow, E. (2010). The GREC challenges 2010: Overview and evaluation results. In Proceedings of the 6th international natural language generation conference (pp. 219–229).
Belz, A., Kow, E., Viethen, J., & Gatt, A. (2010). Referring expression generation in context: The GREC shared task evaluation challenges. In E. Krahmer, & M. Theune (Eds.), Empirical methods in natural language generation (Vol. LNCS5790, pp. 294–327). Berlin: Springer.
Chapter Google Scholar
Bolt, R. A. (1980). Put-that-there: Voice and gesture at the graphics interface. In Proceedings of the 7th annual conference on Computer graphics and interactive techniques (SIGRAPH 1980) (pp. 262–270). ACM.
Byron, D., Koller, A., Striegnitz, K., Cassell, J., Dale, R., Moore, J., et al. (2009). Report on the first NLG challenge on generating instructions in virtual environments (GIVE). In Proceedings of the 12th European workshop on natural language generation (ENLG 2009) (pp. 165–173).
Cahill, A., & van Genabith, J. (2006). Robust PCFG-based generation using automatically acquired lfg approximations. In Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics (pp. 1033–1040).
Campana, E., Tanenhaus, M. K., Allen, J. F., & Remington, R. (2011). Natural discourse reference generation reduces cognitive load in spoken systems. Natural Language Engineering, 17(3), 311–329.
Article Google Scholar
Carenini, G., & Moore, J. D. (2006). Generating and evaluating evaluative arguments. Artificial Intelligence, 170(11), 925–952.
Article Google Scholar
Clark, H. H., & Wilkes-Gibbs, D. (1986). Referring as a collaborative process. Cognition, 22, 1–39.
Article Google Scholar
Dale, R. (1989). Cooking up referring expressions. In Proceedings of the 27th annual meeting of the association for computational linguistics (pp. 68–75).
Dale, R., & Reiter, E. (1995). Computational interpretation of the Gricean maxims in the generation of referring expressions. Cognitive Science, 19(2), 233–263.
Article Google Scholar
Di Eugenio, B., Glass, M., & Trolio, M. J. (2002). The DIAG experiments: Natural language generation for intelligent tutoring systems. In Proceesings of the 2nd international natural language generation conference (INLG 2002) (pp. 120–127).
Di Eugenio, B., Jordan, P. W., Thomason, R. H., & Moore, J. D. (2000). The agreement process: An empirical investigation of human-human computer-mediated collaborative dialogues. International Journal of Human-Computer Studies, 53(6), 1017–1076.
Article Google Scholar
Foster, M. E., Giuliani, M., & Knoll, A. (2009). Comparing objective and subjective measures of usability in a human-robot dialogue system. In Proceedings of the 47th annual meeting of the ACL and the 4th IJCNLP of the AFNLP (pp. 879–887).
Gargett, A., Garoufi, K., Koller, A., & Striegnitz, K. (2010). The GIVE-2 corpus of giving instructions in virtual environments. In Proceedings of the seventh conference on international language resources and evaluation (LREC 2010) (pp. 2401–2406).
Gatt, A., & Belz, A. (2010). Introducing shared tasks to NLG: The TUNA shared task evaluation challenges. In: E. Krahmer, & M. Theune (Eds.), Empirical methods in natural language generation (Vol. LNAI 5790, pp. 264–293). Berlin: Springer.
Chapter Google Scholar
Gupta, S., & Stent, A. J. (2005). Automatic evaluation of referring expression generation using corpora. In Proceedings of the 1st workshop on using Corpora in NLG.
Heeman, P. A., & Hirst, G. (1995). Collaborating on referring expressions. Computational Linguistics, 21(3), 351–382.
Google Scholar
Horton, W. S., & Keysar, B. (1996). When do speakers take into account common ground? Cognition, 59, 91–117.
Article Google Scholar
Joachims, T. (1999). Making large-scale SVM learning practical. In B. Schölkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods: Support vector learning (pp. 169–184). Cambridge: MIT-Press.
Google Scholar
Jordan, P. W., & Walker, M. A. (2005). Learning content selection rules for generating object descriptions in dialogue. Journal of Artificial Intelligence Research, 24, 157–194.
Google Scholar
Khan, I., van Deemter, K., Ritchie, G., Gatt, A., & Cleland, A. A. (2009). A hearer-oriented evaluation of referring expression generation. In Proceedings of the 12th European workshop on natural language generation (ENLG 2009) (pp. 98–101).
Koller, A., Striegnitz, K., Gargett, A., Byron, D., Cassell, J., Dale, R., et al. (2010). Report on the second NLG challenge on generating instructions in virtual environments (GIVE-2). In Proceedings of the 6th international natural language generation conference (pp. 243–250).
Krahmer, E., & van Deemter, K. (2012). Computational generation of referring expressions: A survey. Computational Linguistics, 38(1), 173–218.
Article Google Scholar
Lester, J. C., Voerman, J. L., Towns, S. G., & Callaway, C. B. (1999). Deictic believability: Coordinating gesture, locomotion, and speech in lifelike pedagogical agents. Applied Artificial Intelligence, 13(4–5), 383–414.
Article Google Scholar
Mitkov, R. (2002). Anaphora resolution. London: Longman.
Google Scholar
Ono, K. (1994). Territories of information and Japanese demonstratives. The Journal of the Association of Teachers of Japanese, 28(2), 131–155.
Article Google Scholar
Papineni, K., Roukos, S., Ward, T., & Jing Zhu, W. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics (ACL 2002) (pp. 311–318).
Paraboni, I., van Deemter, K., & Masthoff, J. (2007). Generating referring expressions: Making referents easy to identify. Computational Linguistics, 33(2), 229–254.
Article Google Scholar
Reiter, E., & Belz, A. (2009). An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4), 529–558.
Article Google Scholar
Reiter, E., Robertson, R., & Osman, L. M. (2003). Lessons from a failure: Generating tailored smoking cessation letters. Artificial Intelligence, 144(1–2), 41–58.
Article Google Scholar
Reiter, E., & Sripada, S. (2002). Should corpora texts be gold standards for NLG? In Proceedings of the 2nd international natural language generation conference (INLG 2002) (pp. 97–104).
Reiter, E., Sripada, S., Hunter, J., Yu, J., & Davy, I. (2005). Choosing words in computer-generated weather forecasts. Artificial Intelligence, 167(1–2),137–169.
Article Google Scholar
Spanger, P., Iida, R., Tokunaga, T., Teri, A., & Kuriyama, N. (2010a). Towards an extrinsic evaluation of referring expressions in situated dialogs. In J. Kelleher, B. M. Namee, & I. van der Sluis (Eds.), Proceedings of the sixth international natural language generation conference (INGL 2010) (pp. 135–144).
Spanger, P., Yasuhara, M., Iida, R., & Tokunaga, T. (2009). Using extra linguistic information for generating demonstrative pronouns in a situated collaboration task. In Proceedings of PreCogSci 2009: Production of referring expressions: Bridging the gap between computational and empirical approaches to reference.
Spanger, P., Yasuhara, M., Iida, R., Tokunaga, T., Terai, A., & Kuriyama, N. (2010b). REX-J: Japanese referring expression corpus of situated dialogs. Language Resources and Evaluation, 46(3), 461–491.
Sparck Jones, K., & Galliers, J. R. (1996). Evaluating natural language processing systems: An analysis and review. Berlin: Springer.
Google Scholar
Stoia, L., Shockley, D. M., Byron, D. K., & Fosler-Lussier, E. (2006). Noun phrase generation for situated dialogs. In Proceedings of the 4th international natural language generation conference (INLG 2006) (pp. 81–88).
Striegnitz, K., Denis, A., Gargett, A., Garoufi, K., Koller, A., & Theune, M. (2011). Report on the second second challenge on generating instructions in virtual environments (GIVE-2.5). In Proceedings of the 13th European workshop on natural language generation (ENLG 2011) (pp. 270–297).
Tokunaga, T., Iida, R., Terai, A., & Kuriyama, N. (2012). The REX corpora: A collection of multimodal corpora of referring expressions in collaborative problem solving dialogues. In Proceedings of the eigth international conference on language resources and evaluation (LREC 2012) (pp. 422–429).
van Deemter, K., Gatt, A., van der Sluis, I., & Power, R. (2012). Generation of referring expressions: Assessing the incremental algorithm. Cognitive Science, 36(5), 799–836.
Article Google Scholar
van der Sluis, I., Gatt, A., & van Deemter, K. (2007). Evaluating algorithms for the generation of referring expressions: Going beyond toy domains. In Proceedings of recent advances in natural languae processing (RANLP 2007).
van der Sluis, I., & Krahmer, E. (2007). Generating multimodal references. Discourse Processes, 44(3), 145–174.
Article Google Scholar
Vapnik, V. N. (1998). Statistical learning theory, adaptive and learning systems for signal processing communications, and control. New York: Wiley.
Google Scholar
Young, R. M. (1999). Using Grice’s maxim of quantity to select the content of plan descriptions. Artificial Intelligence, 115, 215–256.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Tokyo Institute of Technology, Tokyo, Japan
Philipp Spanger, Ryu Iida & Takenobu Tokunaga
Department of Human System Science, Tokyo Institute of Technology, Tokyo, Japan
Asuka Terai & Naoko Kuriyama

Authors

Philipp Spanger
View author publications
You can also search for this author inPubMed Google Scholar
Ryu Iida
View author publications
You can also search for this author inPubMed Google Scholar
Takenobu Tokunaga
View author publications
You can also search for this author inPubMed Google Scholar
Asuka Terai
View author publications
You can also search for this author inPubMed Google Scholar
Naoko Kuriyama
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Takenobu Tokunaga.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Spanger, P., Iida, R., Tokunaga, T. et al. A task-performance evaluation of referring expressions in situated collaborative task dialogues. Lang Resources & Evaluation 47, 1285–1304 (2013). https://doi.org/10.1007/s10579-013-9240-5

Download citation

Published: 21 June 2013
Issue Date: December 2013
DOI: https://doi.org/10.1007/s10579-013-9240-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A task-performance evaluation of referring expressions in situated collaborative task dialogues

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Building referring expression corpora with and without feedback

Spontaneous, controlled acts of reference between friends and strangers

Towards efficient human–machine collaboration: effects of gaze-driven feedback and engagement on performance

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now