skip to main content
10.1145/3404835.3463090acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

Hierarchical Dependence-aware Evaluation Measures for Conversational Search

Published:11 July 2021Publication History

ABSTRACT

Conversational agents are drawing a lot of attention in the information retrieval (IR) community also thanks to the advancements in language understanding enabled by large contextualized language models. IR researchers have long ago recognized the importance o fa sound evaluation of new approaches. Yet, the development of evaluation techniques for conversational search is still an underlooked problem. Currently, most evaluation approaches rely on procedures directly drawn from ad-hoc search evaluation, treating utterances in a conversation as independent events, as if they were just separate topics, instead of accounting for the conversation context. We overcome this issue by proposing a framework for defining evaluation measures that are aware of the conversation context and the utterance semantic dependencies. In particular, we model the conversations as Direct Acyclic Graphs (DAG), where self-explanatory utterances are root nodes, while anaphoric utterances are linked to sentences that contain their missing semantic information. Then,we propose a family of hierarchical dependence-aware aggregations of the evaluation metrics driven by the conversational graph. In our experiments, we show that utterances from the same conversation are 20% more correlated than utterances from different conversations. Thanks to the proposed framework, we are able to include such correlation in our aggregations, and be more accurate when determining which pairs of conversational systems are deemed significantly different.

References

  1. Avishek Anand, Lawrence Cavedon, Hideo Joho, Mark Sanderson, and Benno Stein. 2020. Conversational Search (Dagstuhl Seminar 19461). In Dagstuhl Reports, Vol. 9. Schloss Dagstuhl-Leibniz-Zentrum für Informatik.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ron Artstein, Sudeep Gandhe, Jillian Gerten, Anton Leuski, and David Traum. 2009. Semi-formal Evaluation of Conversational Characters .Springer Berlin Heidelberg, Berlin, Heidelberg, 22--35.Google ScholarGoogle Scholar
  3. S. Bangalore, G. Di Fabbrizio, and A. Stent. 2008. Learning the Structure of Task-Driven Human--Human Dialogs. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 16, 7 (2008), 1249--1259. https://doi.org/10.1109/TASL.2008.2001102Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Banks, P. Over, and N.-F. Zhang. 1999. Blind men and elephants: Six approaches to TREC data. Inf. Retr., Vol. 1, 1 (1999), 7--34.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ben Carterette. 2011. System Effectiveness, User Models, and User Utility: A Conceptual Framework for Investigation. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '11). Association for Computing Machinery, New York, NY, USA, 903--912. https://doi.org/10.1145/2009916.2010037Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. 2018. QuAC : Question Answering in Context. arxiv: cs.CL/1808.07036Google ScholarGoogle Scholar
  7. Jeffrey Dalton, Chenyan Xiong, Vaibhav Kumar, and Jamie Callan. 2020. CAsT-19: A Dataset for Conversational Information Seeking .Association for Computing Machinery, New York, NY, USA, 1985--1988. https://doi.org/10.1145/3397271.3401206Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Jan Deriu, Alvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko Agirre, and Mark Cieliebak. 2020. Survey on evaluation methods for dialogue systems. Artificial Intelligence Review (2020), 1--56.Google ScholarGoogle Scholar
  9. Mateusz Dubiel, Martin Halvey, Leif Azzopardi, and Sylvain Daronnat. 2018. Investigating how conversational search agents affect user's behaviour, performance and search experience. In The second international workshop on conversational approaches to information retrieval .Google ScholarGoogle Scholar
  10. N. Ferro and G. Silvello. 2018. Toward an anatomy of IR system component performances. jasist, Vol. 69, 2 (2018), 187--200.Google ScholarGoogle Scholar
  11. Asbjørn Følstad and Petter Bae Brandtzaeg. 2020. Users' experiences with chatbots: findings from a questionnaire study. Quality and User Experience, Vol. 5 (2020), 1--14.Google ScholarGoogle ScholarCross RefCross Ref
  12. Norbert Fuhr. 2018. Some Common Mistakes In IR Evaluation, And How They Can Be Avoided. SIGIR Forum, Vol. 51, 3 (Feb. 2018), 32--41. https://doi.org/10.1145/3190580.3190586Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Jia-Chen Gu, Zhen-Hua Ling, and Quan Liu. 2020. Utterance-to-Utterance Interactive Matching Network for Multi-Turn Response Selection in Retrieval-Based Chatbots. IEEE/ACM Trans. Audio, Speech and Lang. Proc., Vol. 28 (Jan. 2020), 369--379. https://doi.org/10.1109/TASLP.2019.2955290Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Hyunhoon Jung, Changhoon Oh, Gilhwan Hwang, Cindy Yoonjung Oh, Joonhwan Lee, and Bongwon Suh. 2019. Tell Me More: Understanding User Interaction of Smart Speaker News Powered by Conversational Search. In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems (CHI EA '19). Association for Computing Machinery, New York, NY, USA, 1--6. https://doi.org/10.1145/3290607.3312979Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Diane Kelly. 2009. Methods for evaluating interactive information retrieval systems with users. Foundations and trends in Information Retrieval, Vol. 3, 1-2 (2009), 1--224.Google ScholarGoogle Scholar
  16. Chia-Wei Liu, Ryan Lowe, Iulian V. Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. 2017. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. arxiv: cs.CL/1603.08023Google ScholarGoogle Scholar
  17. Alistair Moffat, Falk Scholer, and Paul Thomas. 2012. Models and Metrics: IR Evaluation as a User Process. In Proceedings of the Seventeenth Australasian Document Computing Symposium (ADCS '12). Association for Computing Machinery, New York, NY, USA, 47--54. https://doi.org/10.1145/2407085.2407092Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Gustavo Penha, Alexandru Balan, and Claudia Hauff. 2019. Introducing MANtIS: a novel multi-domain information seeking dialogues dataset. arXiv preprint arXiv:1912.04639 (2019).Google ScholarGoogle Scholar
  19. Gustavo Penha and C. Hauff. 2020. Challenges in the Evaluation of Conversational Search Systems. In Converse@KDD.Google ScholarGoogle Scholar
  20. Siva Reddy, Danqi Chen, and Christopher D. Manning. 2019. CoQA: A Conversational Question Answering Challenge. Transactions of the Association for Computational Linguistics, Vol. 7 (March 2019), 249--266. https://doi.org/10.1162/tacl_a_00266Google ScholarGoogle ScholarCross RefCross Ref
  21. Andrew Rutherford. 2001. Introducing ANOVA and ANCOVA: a GLM approach. Sage.Google ScholarGoogle Scholar
  22. T. Sakai. 2020. On Fuhr's Guideline for IR Evaluation. SIGIR Forum, Vol. 54, 1 (June 2020), p14:1--p14:8.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Chongyang Tao, Wei Wu, Can Xu, Wenpeng Hu, Dongyan Zhao, and Rui Yan. 2019. Multi-representation fusion network for multi-turn response selection in retrieval-based chatbots. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. 267--275.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. John W Tukey. 1949. Comparing individual means in the analysis of variance. Biometrics (1949), 99--114.Google ScholarGoogle Scholar
  25. Yu Wu, Wei Wu, Chen Xing, Ming Zhou, and Zhoujun Li. 2017. Sequential Matching Network: A New Architecture for Multi-turn Response Selection in Retrieval-Based Chatbots. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 496--505.Google ScholarGoogle ScholarCross RefCross Ref
  26. Rui Yan. 2018. "Chitty-Chitty-Chat Bot": Deep Learning for Conversational AI.. In IJCAI, Vol. 18. 5520--5526.Google ScholarGoogle Scholar
  27. Zhou Yu, Ziyu Xu, Alan W Black, and Alexander Rudnicky. 2016. Strategy and policy learning for non-task-oriented conversational systems. In Proceedings of the 17th annual meeting of the special interest group on discourse and dialogue. 404--412.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Hierarchical Dependence-aware Evaluation Measures for Conversational Search

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
      July 2021
      2998 pages
      ISBN:9781450380379
      DOI:10.1145/3404835

      Copyright © 2021 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 July 2021

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper

      Acceptance Rates

      Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader