ABSTRACT
Conversational agents are drawing a lot of attention in the information retrieval (IR) community also thanks to the advancements in language understanding enabled by large contextualized language models. IR researchers have long ago recognized the importance o fa sound evaluation of new approaches. Yet, the development of evaluation techniques for conversational search is still an underlooked problem. Currently, most evaluation approaches rely on procedures directly drawn from ad-hoc search evaluation, treating utterances in a conversation as independent events, as if they were just separate topics, instead of accounting for the conversation context. We overcome this issue by proposing a framework for defining evaluation measures that are aware of the conversation context and the utterance semantic dependencies. In particular, we model the conversations as Direct Acyclic Graphs (DAG), where self-explanatory utterances are root nodes, while anaphoric utterances are linked to sentences that contain their missing semantic information. Then,we propose a family of hierarchical dependence-aware aggregations of the evaluation metrics driven by the conversational graph. In our experiments, we show that utterances from the same conversation are 20% more correlated than utterances from different conversations. Thanks to the proposed framework, we are able to include such correlation in our aggregations, and be more accurate when determining which pairs of conversational systems are deemed significantly different.
- Avishek Anand, Lawrence Cavedon, Hideo Joho, Mark Sanderson, and Benno Stein. 2020. Conversational Search (Dagstuhl Seminar 19461). In Dagstuhl Reports, Vol. 9. Schloss Dagstuhl-Leibniz-Zentrum für Informatik.Google ScholarDigital Library
- Ron Artstein, Sudeep Gandhe, Jillian Gerten, Anton Leuski, and David Traum. 2009. Semi-formal Evaluation of Conversational Characters .Springer Berlin Heidelberg, Berlin, Heidelberg, 22--35.Google Scholar
- S. Bangalore, G. Di Fabbrizio, and A. Stent. 2008. Learning the Structure of Task-Driven Human--Human Dialogs. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 16, 7 (2008), 1249--1259. https://doi.org/10.1109/TASL.2008.2001102Google ScholarDigital Library
- D. Banks, P. Over, and N.-F. Zhang. 1999. Blind men and elephants: Six approaches to TREC data. Inf. Retr., Vol. 1, 1 (1999), 7--34.Google ScholarDigital Library
- Ben Carterette. 2011. System Effectiveness, User Models, and User Utility: A Conceptual Framework for Investigation. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '11). Association for Computing Machinery, New York, NY, USA, 903--912. https://doi.org/10.1145/2009916.2010037Google ScholarDigital Library
- Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. 2018. QuAC : Question Answering in Context. arxiv: cs.CL/1808.07036Google Scholar
- Jeffrey Dalton, Chenyan Xiong, Vaibhav Kumar, and Jamie Callan. 2020. CAsT-19: A Dataset for Conversational Information Seeking .Association for Computing Machinery, New York, NY, USA, 1985--1988. https://doi.org/10.1145/3397271.3401206Google ScholarDigital Library
- Jan Deriu, Alvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko Agirre, and Mark Cieliebak. 2020. Survey on evaluation methods for dialogue systems. Artificial Intelligence Review (2020), 1--56.Google Scholar
- Mateusz Dubiel, Martin Halvey, Leif Azzopardi, and Sylvain Daronnat. 2018. Investigating how conversational search agents affect user's behaviour, performance and search experience. In The second international workshop on conversational approaches to information retrieval .Google Scholar
- N. Ferro and G. Silvello. 2018. Toward an anatomy of IR system component performances. jasist, Vol. 69, 2 (2018), 187--200.Google Scholar
- Asbjørn Følstad and Petter Bae Brandtzaeg. 2020. Users' experiences with chatbots: findings from a questionnaire study. Quality and User Experience, Vol. 5 (2020), 1--14.Google ScholarCross Ref
- Norbert Fuhr. 2018. Some Common Mistakes In IR Evaluation, And How They Can Be Avoided. SIGIR Forum, Vol. 51, 3 (Feb. 2018), 32--41. https://doi.org/10.1145/3190580.3190586Google ScholarDigital Library
- Jia-Chen Gu, Zhen-Hua Ling, and Quan Liu. 2020. Utterance-to-Utterance Interactive Matching Network for Multi-Turn Response Selection in Retrieval-Based Chatbots. IEEE/ACM Trans. Audio, Speech and Lang. Proc., Vol. 28 (Jan. 2020), 369--379. https://doi.org/10.1109/TASLP.2019.2955290Google ScholarDigital Library
- Hyunhoon Jung, Changhoon Oh, Gilhwan Hwang, Cindy Yoonjung Oh, Joonhwan Lee, and Bongwon Suh. 2019. Tell Me More: Understanding User Interaction of Smart Speaker News Powered by Conversational Search. In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems (CHI EA '19). Association for Computing Machinery, New York, NY, USA, 1--6. https://doi.org/10.1145/3290607.3312979Google ScholarDigital Library
- Diane Kelly. 2009. Methods for evaluating interactive information retrieval systems with users. Foundations and trends in Information Retrieval, Vol. 3, 1-2 (2009), 1--224.Google Scholar
- Chia-Wei Liu, Ryan Lowe, Iulian V. Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. 2017. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. arxiv: cs.CL/1603.08023Google Scholar
- Alistair Moffat, Falk Scholer, and Paul Thomas. 2012. Models and Metrics: IR Evaluation as a User Process. In Proceedings of the Seventeenth Australasian Document Computing Symposium (ADCS '12). Association for Computing Machinery, New York, NY, USA, 47--54. https://doi.org/10.1145/2407085.2407092Google ScholarDigital Library
- Gustavo Penha, Alexandru Balan, and Claudia Hauff. 2019. Introducing MANtIS: a novel multi-domain information seeking dialogues dataset. arXiv preprint arXiv:1912.04639 (2019).Google Scholar
- Gustavo Penha and C. Hauff. 2020. Challenges in the Evaluation of Conversational Search Systems. In Converse@KDD.Google Scholar
- Siva Reddy, Danqi Chen, and Christopher D. Manning. 2019. CoQA: A Conversational Question Answering Challenge. Transactions of the Association for Computational Linguistics, Vol. 7 (March 2019), 249--266. https://doi.org/10.1162/tacl_a_00266Google ScholarCross Ref
- Andrew Rutherford. 2001. Introducing ANOVA and ANCOVA: a GLM approach. Sage.Google Scholar
- T. Sakai. 2020. On Fuhr's Guideline for IR Evaluation. SIGIR Forum, Vol. 54, 1 (June 2020), p14:1--p14:8.Google ScholarDigital Library
- Chongyang Tao, Wei Wu, Can Xu, Wenpeng Hu, Dongyan Zhao, and Rui Yan. 2019. Multi-representation fusion network for multi-turn response selection in retrieval-based chatbots. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. 267--275.Google ScholarDigital Library
- John W Tukey. 1949. Comparing individual means in the analysis of variance. Biometrics (1949), 99--114.Google Scholar
- Yu Wu, Wei Wu, Chen Xing, Ming Zhou, and Zhoujun Li. 2017. Sequential Matching Network: A New Architecture for Multi-turn Response Selection in Retrieval-Based Chatbots. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 496--505.Google ScholarCross Ref
- Rui Yan. 2018. "Chitty-Chitty-Chat Bot": Deep Learning for Conversational AI.. In IJCAI, Vol. 18. 5520--5526.Google Scholar
- Zhou Yu, Ziyu Xu, Alan W Black, and Alexander Rudnicky. 2016. Strategy and policy learning for non-task-oriented conversational systems. In Proceedings of the 17th annual meeting of the special interest group on discourse and dialogue. 404--412.Google ScholarCross Ref
Index Terms
- Hierarchical Dependence-aware Evaluation Measures for Conversational Search
Recommendations
How Am I Doing?: Evaluating Conversational Search Systems Offline
As conversational agents like Siri and Alexa gain in popularity and use, conversation is becoming a more and more important mode of interaction for search. Conversational search shares some features with traditional search, but differs in some important ...
Wizard of Oz Interface to Study System Initiative for Conversational Search
CHIIR '20: Proceedings of the 2020 Conference on Human Information Interaction and RetrievalWe describe a Wizard of Oz (WoZ) system, a WebApp, which we use to study how a conversational search system should take the initiative when engaging with users during collaborative search. This system integrates directly into Slack, a chat-messaging ...
First International Workshop on Conversational Approaches to Information Retrieval (CAIR'17)
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information RetrievalRecent advances in commercial conversational services that allow naturally spoken and typed interaction, particularly for well-formulated questions and commands, have increased the need for more human-centric interactions in information retrieval. The ...
Comments