Abstract
Appropriate evaluation of referring expressions is critical for the design of systems that can effectively collaborate with humans. A widely used method is to simply evaluate the degree to which an algorithm can reproduce the same expressions as those in previously collected corpora. Several researchers, however, have noted the need of a task-performance evaluation measuring the effectiveness of a referring expression in the achievement of a given task goal. This is particularly important in collaborative situated dialogues. Using referring expressions used by six pairs of Japanese speakers collaboratively solving Tangram puzzles, we conducted a task-performance evaluation of referring expressions with 36 human evaluators. Particularly we focused on the evaluation of demonstrative pronouns generated by a machine learning-based algorithm. Comparing the results of this task-performance evaluation with the results of a previously conducted corpus-matching evaluation (Spanger et al. in Lang Resour Eval, 2010b), we confirmed the limitation of a corpus-matching evaluation and discuss the need for a task-performance evaluation.
Similar content being viewed by others
Notes
They compared the system output with subjective human judgement as a gold standard rather than an existing corpus.
In this methodology subjects are requested to perform different kinds of tasks simultaneously, one of which is the target task to measure the cognitive load.
The corpus is publicly available together with other variants as the REX corpora (Tokunaga et al. 2012) through GSK (http://www.gsk.or.jp/index_e.html) (Resource ID: GSK2013-A).
There are three types of demonstrative pronoun/adjective in Japanese: “kore/kono (this)”, “sore/sono (that)” and “are/ano (that)”. They are basically chosen based on the physical and mental distance between the speaker and the target (Ono 1994).
In our current setting, “sore/sono (that)” is the most appropriate because the solver (speaker) does not have control of the mouse, i.e. pointing device. Actually, “sore (that)” is the most dominant demonstrative pronoun in the entire corpus.
References
Anderson, A. H., Bader, M., Bard, E. G., Boyle, E., Doherty, G., Garrod, S., et al. (1991). The HCRC map task corpus. Language and Speech, 34(4), 351–366.
Belz, A., & Gatt, A. (2008). Intrinsic vs. extrinsic evaluation measures for referring expression generation. In Proceedings of ACL-08: HLT, Short Papers (pp. 197–200).
Belz, A., & Kow, E. (2010). The GREC challenges 2010: Overview and evaluation results. In Proceedings of the 6th international natural language generation conference (pp. 219–229).
Belz, A., Kow, E., Viethen, J., & Gatt, A. (2010). Referring expression generation in context: The GREC shared task evaluation challenges. In E. Krahmer, & M. Theune (Eds.), Empirical methods in natural language generation (Vol. LNCS5790, pp. 294–327). Berlin: Springer.
Bolt, R. A. (1980). Put-that-there: Voice and gesture at the graphics interface. In Proceedings of the 7th annual conference on Computer graphics and interactive techniques (SIGRAPH 1980) (pp. 262–270). ACM.
Byron, D., Koller, A., Striegnitz, K., Cassell, J., Dale, R., Moore, J., et al. (2009). Report on the first NLG challenge on generating instructions in virtual environments (GIVE). In Proceedings of the 12th European workshop on natural language generation (ENLG 2009) (pp. 165–173).
Cahill, A., & van Genabith, J. (2006). Robust PCFG-based generation using automatically acquired lfg approximations. In Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics (pp. 1033–1040).
Campana, E., Tanenhaus, M. K., Allen, J. F., & Remington, R. (2011). Natural discourse reference generation reduces cognitive load in spoken systems. Natural Language Engineering, 17(3), 311–329.
Carenini, G., & Moore, J. D. (2006). Generating and evaluating evaluative arguments. Artificial Intelligence, 170(11), 925–952.
Clark, H. H., & Wilkes-Gibbs, D. (1986). Referring as a collaborative process. Cognition, 22, 1–39.
Dale, R. (1989). Cooking up referring expressions. In Proceedings of the 27th annual meeting of the association for computational linguistics (pp. 68–75).
Dale, R., & Reiter, E. (1995). Computational interpretation of the Gricean maxims in the generation of referring expressions. Cognitive Science, 19(2), 233–263.
Di Eugenio, B., Glass, M., & Trolio, M. J. (2002). The DIAG experiments: Natural language generation for intelligent tutoring systems. In Proceesings of the 2nd international natural language generation conference (INLG 2002) (pp. 120–127).
Di Eugenio, B., Jordan, P. W., Thomason, R. H., & Moore, J. D. (2000). The agreement process: An empirical investigation of human-human computer-mediated collaborative dialogues. International Journal of Human-Computer Studies, 53(6), 1017–1076.
Foster, M. E., Giuliani, M., & Knoll, A. (2009). Comparing objective and subjective measures of usability in a human-robot dialogue system. In Proceedings of the 47th annual meeting of the ACL and the 4th IJCNLP of the AFNLP (pp. 879–887).
Gargett, A., Garoufi, K., Koller, A., & Striegnitz, K. (2010). The GIVE-2 corpus of giving instructions in virtual environments. In Proceedings of the seventh conference on international language resources and evaluation (LREC 2010) (pp. 2401–2406).
Gatt, A., & Belz, A. (2010). Introducing shared tasks to NLG: The TUNA shared task evaluation challenges. In: E. Krahmer, & M. Theune (Eds.), Empirical methods in natural language generation (Vol. LNAI 5790, pp. 264–293). Berlin: Springer.
Gupta, S., & Stent, A. J. (2005). Automatic evaluation of referring expression generation using corpora. In Proceedings of the 1st workshop on using Corpora in NLG.
Heeman, P. A., & Hirst, G. (1995). Collaborating on referring expressions. Computational Linguistics, 21(3), 351–382.
Horton, W. S., & Keysar, B. (1996). When do speakers take into account common ground? Cognition, 59, 91–117.
Joachims, T. (1999). Making large-scale SVM learning practical. In B. Schölkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods: Support vector learning (pp. 169–184). Cambridge: MIT-Press.
Jordan, P. W., & Walker, M. A. (2005). Learning content selection rules for generating object descriptions in dialogue. Journal of Artificial Intelligence Research, 24, 157–194.
Khan, I., van Deemter, K., Ritchie, G., Gatt, A., & Cleland, A. A. (2009). A hearer-oriented evaluation of referring expression generation. In Proceedings of the 12th European workshop on natural language generation (ENLG 2009) (pp. 98–101).
Koller, A., Striegnitz, K., Gargett, A., Byron, D., Cassell, J., Dale, R., et al. (2010). Report on the second NLG challenge on generating instructions in virtual environments (GIVE-2). In Proceedings of the 6th international natural language generation conference (pp. 243–250).
Krahmer, E., & van Deemter, K. (2012). Computational generation of referring expressions: A survey. Computational Linguistics, 38(1), 173–218.
Lester, J. C., Voerman, J. L., Towns, S. G., & Callaway, C. B. (1999). Deictic believability: Coordinating gesture, locomotion, and speech in lifelike pedagogical agents. Applied Artificial Intelligence, 13(4–5), 383–414.
Mitkov, R. (2002). Anaphora resolution. London: Longman.
Ono, K. (1994). Territories of information and Japanese demonstratives. The Journal of the Association of Teachers of Japanese, 28(2), 131–155.
Papineni, K., Roukos, S., Ward, T., & Jing Zhu, W. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics (ACL 2002) (pp. 311–318).
Paraboni, I., van Deemter, K., & Masthoff, J. (2007). Generating referring expressions: Making referents easy to identify. Computational Linguistics, 33(2), 229–254.
Reiter, E., & Belz, A. (2009). An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4), 529–558.
Reiter, E., Robertson, R., & Osman, L. M. (2003). Lessons from a failure: Generating tailored smoking cessation letters. Artificial Intelligence, 144(1–2), 41–58.
Reiter, E., & Sripada, S. (2002). Should corpora texts be gold standards for NLG? In Proceedings of the 2nd international natural language generation conference (INLG 2002) (pp. 97–104).
Reiter, E., Sripada, S., Hunter, J., Yu, J., & Davy, I. (2005). Choosing words in computer-generated weather forecasts. Artificial Intelligence, 167(1–2),137–169.
Spanger, P., Iida, R., Tokunaga, T., Teri, A., & Kuriyama, N. (2010a). Towards an extrinsic evaluation of referring expressions in situated dialogs. In J. Kelleher, B. M. Namee, & I. van der Sluis (Eds.), Proceedings of the sixth international natural language generation conference (INGL 2010) (pp. 135–144).
Spanger, P., Yasuhara, M., Iida, R., & Tokunaga, T. (2009). Using extra linguistic information for generating demonstrative pronouns in a situated collaboration task. In Proceedings of PreCogSci 2009: Production of referring expressions: Bridging the gap between computational and empirical approaches to reference.
Spanger, P., Yasuhara, M., Iida, R., Tokunaga, T., Terai, A., & Kuriyama, N. (2010b). REX-J: Japanese referring expression corpus of situated dialogs. Language Resources and Evaluation, 46(3), 461–491.
Sparck Jones, K., & Galliers, J. R. (1996). Evaluating natural language processing systems: An analysis and review. Berlin: Springer.
Stoia, L., Shockley, D. M., Byron, D. K., & Fosler-Lussier, E. (2006). Noun phrase generation for situated dialogs. In Proceedings of the 4th international natural language generation conference (INLG 2006) (pp. 81–88).
Striegnitz, K., Denis, A., Gargett, A., Garoufi, K., Koller, A., & Theune, M. (2011). Report on the second second challenge on generating instructions in virtual environments (GIVE-2.5). In Proceedings of the 13th European workshop on natural language generation (ENLG 2011) (pp. 270–297).
Tokunaga, T., Iida, R., Terai, A., & Kuriyama, N. (2012). The REX corpora: A collection of multimodal corpora of referring expressions in collaborative problem solving dialogues. In Proceedings of the eigth international conference on language resources and evaluation (LREC 2012) (pp. 422–429).
van Deemter, K., Gatt, A., van der Sluis, I., & Power, R. (2012). Generation of referring expressions: Assessing the incremental algorithm. Cognitive Science, 36(5), 799–836.
van der Sluis, I., Gatt, A., & van Deemter, K. (2007). Evaluating algorithms for the generation of referring expressions: Going beyond toy domains. In Proceedings of recent advances in natural languae processing (RANLP 2007).
van der Sluis, I., & Krahmer, E. (2007). Generating multimodal references. Discourse Processes, 44(3), 145–174.
Vapnik, V. N. (1998). Statistical learning theory, adaptive and learning systems for signal processing communications, and control. New York: Wiley.
Young, R. M. (1999). Using Grice’s maxim of quantity to select the content of plan descriptions. Artificial Intelligence, 115, 215–256.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Spanger, P., Iida, R., Tokunaga, T. et al. A task-performance evaluation of referring expressions in situated collaborative task dialogues. Lang Resources & Evaluation 47, 1285–1304 (2013). https://doi.org/10.1007/s10579-013-9240-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-013-9240-5