short-paper

Hierarchical Dependence-aware Evaluation Measures for Conversational Search

Authors:
Guglielmo Faggioli

University of Padova, Padova, Italy

University of Padova, Padova, Italy
View Profile

,
Marco Ferrante

University of Padova, Padova, Italy

University of Padova, Padova, Italy
View Profile

,
Nicola Ferro

University of Padova, Padova, Italy

University of Padova, Padova, Italy
View Profile

,
Raffaele Perego

National Research Council, Pisa, Italy

National Research Council, Pisa, Italy
View Profile

,
Nicola Tonellotto

University of Pisa, Pisa, Italy

University of Pisa, Pisa, Italy
View Profile

SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information RetrievalJuly 2021Pages 1935–1939https://doi.org/10.1145/3404835.3463090

Published:11 July 2021Publication History

SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 1935–1939

ABSTRACT

Conversational agents are drawing a lot of attention in the information retrieval (IR) community also thanks to the advancements in language understanding enabled by large contextualized language models. IR researchers have long ago recognized the importance o fa sound evaluation of new approaches. Yet, the development of evaluation techniques for conversational search is still an underlooked problem. Currently, most evaluation approaches rely on procedures directly drawn from ad-hoc search evaluation, treating utterances in a conversation as independent events, as if they were just separate topics, instead of accounting for the conversation context. We overcome this issue by proposing a framework for defining evaluation measures that are aware of the conversation context and the utterance semantic dependencies. In particular, we model the conversations as Direct Acyclic Graphs (DAG), where self-explanatory utterances are root nodes, while anaphoric utterances are linked to sentences that contain their missing semantic information. Then,we propose a family of hierarchical dependence-aware aggregations of the evaluation metrics driven by the conversational graph. In our experiments, we show that utterances from the same conversation are 20% more correlated than utterances from different conversations. Thanks to the proposed framework, we are able to include such correlation in our aggregations, and be more accurate when determining which pairs of conversational systems are deemed significantly different.

References

Avishek Anand, Lawrence Cavedon, Hideo Joho, Mark Sanderson, and Benno Stein. 2020. Conversational Search (Dagstuhl Seminar 19461). In Dagstuhl Reports, Vol. 9. Schloss Dagstuhl-Leibniz-Zentrum für Informatik.Google ScholarDigital Library
Ron Artstein, Sudeep Gandhe, Jillian Gerten, Anton Leuski, and David Traum. 2009. Semi-formal Evaluation of Conversational Characters .Springer Berlin Heidelberg, Berlin, Heidelberg, 22--35.Google Scholar
S. Bangalore, G. Di Fabbrizio, and A. Stent. 2008. Learning the Structure of Task-Driven Human--Human Dialogs. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 16, 7 (2008), 1249--1259. https://doi.org/10.1109/TASL.2008.2001102Google ScholarDigital Library
D. Banks, P. Over, and N.-F. Zhang. 1999. Blind men and elephants: Six approaches to TREC data. Inf. Retr., Vol. 1, 1 (1999), 7--34.Google ScholarDigital Library
Ben Carterette. 2011. System Effectiveness, User Models, and User Utility: A Conceptual Framework for Investigation. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '11). Association for Computing Machinery, New York, NY, USA, 903--912. https://doi.org/10.1145/2009916.2010037Google ScholarDigital Library
Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. 2018. QuAC : Question Answering in Context. arxiv: cs.CL/1808.07036Google Scholar
Jeffrey Dalton, Chenyan Xiong, Vaibhav Kumar, and Jamie Callan. 2020. CAsT-19: A Dataset for Conversational Information Seeking .Association for Computing Machinery, New York, NY, USA, 1985--1988. https://doi.org/10.1145/3397271.3401206Google ScholarDigital Library
Jan Deriu, Alvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko Agirre, and Mark Cieliebak. 2020. Survey on evaluation methods for dialogue systems. Artificial Intelligence Review (2020), 1--56.Google Scholar
Mateusz Dubiel, Martin Halvey, Leif Azzopardi, and Sylvain Daronnat. 2018. Investigating how conversational search agents affect user's behaviour, performance and search experience. In The second international workshop on conversational approaches to information retrieval .Google Scholar
N. Ferro and G. Silvello. 2018. Toward an anatomy of IR system component performances. jasist, Vol. 69, 2 (2018), 187--200.Google Scholar
Asbjørn Følstad and Petter Bae Brandtzaeg. 2020. Users' experiences with chatbots: findings from a questionnaire study. Quality and User Experience, Vol. 5 (2020), 1--14.Google ScholarCross Ref
Norbert Fuhr. 2018. Some Common Mistakes In IR Evaluation, And How They Can Be Avoided. SIGIR Forum, Vol. 51, 3 (Feb. 2018), 32--41. https://doi.org/10.1145/3190580.3190586Google ScholarDigital Library
Jia-Chen Gu, Zhen-Hua Ling, and Quan Liu. 2020. Utterance-to-Utterance Interactive Matching Network for Multi-Turn Response Selection in Retrieval-Based Chatbots. IEEE/ACM Trans. Audio, Speech and Lang. Proc., Vol. 28 (Jan. 2020), 369--379. https://doi.org/10.1109/TASLP.2019.2955290Google ScholarDigital Library
Hyunhoon Jung, Changhoon Oh, Gilhwan Hwang, Cindy Yoonjung Oh, Joonhwan Lee, and Bongwon Suh. 2019. Tell Me More: Understanding User Interaction of Smart Speaker News Powered by Conversational Search. In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems (CHI EA '19). Association for Computing Machinery, New York, NY, USA, 1--6. https://doi.org/10.1145/3290607.3312979Google ScholarDigital Library
Diane Kelly. 2009. Methods for evaluating interactive information retrieval systems with users. Foundations and trends in Information Retrieval, Vol. 3, 1-2 (2009), 1--224.Google Scholar
Chia-Wei Liu, Ryan Lowe, Iulian V. Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. 2017. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. arxiv: cs.CL/1603.08023Google Scholar
Alistair Moffat, Falk Scholer, and Paul Thomas. 2012. Models and Metrics: IR Evaluation as a User Process. In Proceedings of the Seventeenth Australasian Document Computing Symposium (ADCS '12). Association for Computing Machinery, New York, NY, USA, 47--54. https://doi.org/10.1145/2407085.2407092Google ScholarDigital Library
Gustavo Penha, Alexandru Balan, and Claudia Hauff. 2019. Introducing MANtIS: a novel multi-domain information seeking dialogues dataset. arXiv preprint arXiv:1912.04639 (2019).Google Scholar
Gustavo Penha and C. Hauff. 2020. Challenges in the Evaluation of Conversational Search Systems. In Converse@KDD.Google Scholar
Siva Reddy, Danqi Chen, and Christopher D. Manning. 2019. CoQA: A Conversational Question Answering Challenge. Transactions of the Association for Computational Linguistics, Vol. 7 (March 2019), 249--266. https://doi.org/10.1162/tacl_a_00266Google ScholarCross Ref
Andrew Rutherford. 2001. Introducing ANOVA and ANCOVA: a GLM approach. Sage.Google Scholar
T. Sakai. 2020. On Fuhr's Guideline for IR Evaluation. SIGIR Forum, Vol. 54, 1 (June 2020), p14:1--p14:8.Google ScholarDigital Library
Chongyang Tao, Wei Wu, Can Xu, Wenpeng Hu, Dongyan Zhao, and Rui Yan. 2019. Multi-representation fusion network for multi-turn response selection in retrieval-based chatbots. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. 267--275.Google ScholarDigital Library
John W Tukey. 1949. Comparing individual means in the analysis of variance. Biometrics (1949), 99--114.Google Scholar
Yu Wu, Wei Wu, Chen Xing, Ming Zhou, and Zhoujun Li. 2017. Sequential Matching Network: A New Architecture for Multi-turn Response Selection in Retrieval-Based Chatbots. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 496--505.Google ScholarCross Ref
Rui Yan. 2018. "Chitty-Chitty-Chat Bot": Deep Learning for Conversational AI.. In IJCAI, Vol. 18. 5520--5526.Google Scholar
Zhou Yu, Ziyu Xu, Alan W Black, and Alexander Rudnicky. 2016. Strategy and policy learning for non-task-oriented conversational systems. In Proceedings of the 17th annual meeting of the special interest group on discourse and dialogue. 404--412.Google ScholarCross Ref

Index Terms

Hierarchical Dependence-aware Evaluation Measures for Conversational Search
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Retrieval effectiveness

Recommendations

How Am I Doing?: Evaluating Conversational Search Systems Offline
As conversational agents like Siri and Alexa gain in popularity and use, conversation is becoming a more and more important mode of interaction for search. Conversational search shares some features with traditional search, but differs in some important ...
Read More
Wizard of Oz Interface to Study System Initiative for Conversational Search
CHIIR '20: Proceedings of the 2020 Conference on Human Information Interaction and Retrieval

We describe a Wizard of Oz (WoZ) system, a WebApp, which we use to study how a conversational search system should take the initiative when engaging with users during collaborative search. This system integrates directly into Slack, a chat-messaging ...
Read More
First International Workshop on Conversational Approaches to Information Retrieval (CAIR'17)
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

Recent advances in commercial conversational services that allow naturally spoken and typed interaction, particularly for well-formulated questions and commands, have increased the need for more human-centric interactions in information retrieval. The ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2021
2998 pages
ISBN:9781450380379
DOI:10.1145/3404835
General Chairs:
Fernando Diaz
(Google)
,
Chirag Shah
University of Washington
,
Torsten Suel
New York University
,
Program Chairs:
Pablo Castells
Universidad Autónoma de Madrid, Amazon
,
Rosie Jones
Spotify
,
Tetsuya Sakai
Waseda University
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 July 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
conversation modelling
conversational search
evaluation
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 168
  Total Downloads
- Downloads (Last 12 months)19
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Hierarchical Dependence-aware Evaluation Measures for Conversational Search

SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

How Am I Doing?: Evaluating Conversational Search Systems Offline

Wizard of Oz Interface to Study System Initiative for Conversational Search

First International Workshop on Conversational Approaches to Information Retrieval (CAIR'17)