ABSTRACT
User satisfaction metrics are an integral part of search engine development as they help system developers to understand and evaluate the quality of the user experience. Research to date has mostly focused on predicting success or frustration as a proxy for satisfaction. However, users' search experience is more complex than merely being either successful or not. As such, using success rate as a measure of satisfaction can be limiting. In this work, we propose the use of utility as a measure of searcher satisfaction. This concept represents the fulfillment a user receives from con-suming a service and explains how users aim to gain optimal overall satisfaction. Our utility metrics measure the user satisfac-tion by aggregating all their interaction with the search engine. These interactions are represented as a timeline of actions and their dwelltimes, where each action is classified as having a posi-tive or negative effect on the user. We examine sessions mined from Bing logs, with multi-point scale assessment of searcher satisfaction and show that utility is a better proxy for satisfaction compared to success. Leveraging that data, we design metrics of searcher satisfaction that assess the overall utility accumulated by a user during her search session. We use real user traffic from millions of users in an A/B setting to compare utility metrics to success rate metrics. We show that utility is a better metric for evaluating searcher satisfaction with the search engine, and a more sensitive and accurate metric when compared to predicting success. These metrics are currently adopted as the top-level met-ric for evaluating the thousands of A/B experiments that are run on Bing each year.
- Ageev, M et al. 2011. Find it if you can: a game for modeling different types of web search success using interaction data. In SIGIR '11: 345--354. Google ScholarDigital Library
- L. Azzopardi. 2014. Modelling interaction with economic models of search. In SIGIR'14: 3--12. Google ScholarDigital Library
- T. Crook et al. Seven pitfalls to avoid when running controlled experiments on the web. In KDD'09, 1105--1114, 2009. Google ScholarDigital Library
- P. Dmitriev, and X.Wu. 2015. Measuring Metrics. In CIKM'16 Google ScholarDigital Library
- A. Drutsa, A. Ufliand and G. Gussev. Practical Aspects of Sensitivity in Online Experimentation with User Engagement Metrics. In CIKM'15. 2015 Google ScholarDigital Library
- A. Drutsa, G. Gusev, and P. Serdyukov. Engagement periodicity in search en-gine usage: analysis and its application to search quality evaluation. In WSDM'15, 27--36, 2015. Google ScholarDigital Library
- H. Feild et al. Predicting searcher frustration. In SIGIR'10: 34--41, 2010. Google ScholarDigital Library
- S. Fox et al. Evaluating implicit measures to improve web search. ACM TOIS, 23(2): 147--168, 2005. 9}A. Hassan. 2012. A semi-supervised approach to modeling web search satisfac-tion. In SIGIR'12: 275--284. Google ScholarDigital Library
- A. Hassan et al. 2013. Beyond clicks: Query reformulation as a predictor of search satisfaction. In CIKM'13: 2019--2028. Google ScholarDigital Library
- A. Hassan et al. 2010. Beyond DCG: User behavior as a predictor of a successful search. In WSDM'10: 221--230. Google ScholarDigital Library
- S. B. Huffman and M. Hochster. 2007. How well does result relevance predict session satisfaction? In SIGIR'07: 567--574. Google ScholarDigital Library
- K. Järvelin and J. Kekäläinen. 2000. IR evaluation methods for retrieving highly relevant documents. In SIGIR'00: 41--48. Google ScholarDigital Library
- K. Järvelin et al. 2008. Discounted cumulated gain based evaluation of multiple-query IR sessions. In ECIR'08: 4--15. Google ScholarDigital Library
- J. Jiang et al. 2015. Understanding and Predicting Graded Search Satisfaction. In WSDM'15: 57--66. Google ScholarDigital Library
- E. Kanoulas et al. 2011. Evaluating multi-query sessions. In SIGIR'11: 1053--1062. Google ScholarDigital Library
- D. Kelly. 2009. Methods for evaluating interactive information retrieval systems with users. Foundation and Trends in Information Retrieval, 3(1--2): 1--224. Google ScholarDigital Library
- Y. Kim et al. 2014. Modeling dwell time to predict click-level satisfaction. In WSDM'14: 193--202. Google ScholarDigital Library
- R. Kohavi et al. 2009. Controlled experiments on the web: survey and practical guide. Data Mining and Knowledge Discovery. 18(1): 140--181. Google ScholarDigital Library
- R. Kohavi et al.. Trustworthy online controlled experiments: Five puzzling outcomes explained. In KDD'12, 786--794, 2012. {15} Google ScholarDigital Library
- R. Kohavi et al. 2013. Online controlled experiments at large scale. In KDD'13, 1168--1176. Google ScholarDigital Library
- R. Kohavi et al. Seven rules of thumb for web site experimenters. In KDD'14, 2014. Google ScholarDigital Library
- W. Machmouchi and G. Buscher. 2016. Principles for the design of online A/B experiments. In SIGIR'16: 589--590 Google ScholarDigital Library
- G. Mankiw. 2010. Principles of Macroeconomics. South-Western Cengage Learning.Google Scholar
- A. Moffat and J. Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst. 27(1). Google ScholarDigital Library
- L. T. Su. 2003. A comprehensive and systematic model of user evaluation of Web search engines. In JASIST, 54(13): 1175--1192. Google ScholarDigital Library
- D. Tang, et al. 2010. Overlapping Experiment Infrastructure: More, Better, Faster Experimentation. In KDD'10. Google ScholarDigital Library
- H. Wang et al. 2014. Modeling action-level satisfaction for search task satisfac-tion prediction. In SIGIR'14: 123--132. Google ScholarDigital Library
Index Terms
- Beyond Success Rate: Utility as a Search Quality Metric for Online Experiments
Recommendations
Understanding and Predicting Graded Search Satisfaction
WSDM '15: Proceedings of the Eighth ACM International Conference on Web Search and Data MiningUnderstanding and estimating satisfaction with search engines is an important aspect of evaluating retrieval performance. Research to date has modeled and predicted search satisfaction on a binary scale, i.e., the searchers are either satisfied or ...
Meta-evaluation of Online and Offline Web Search Evaluation Metrics
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information RetrievalAs in most information retrieval (IR) studies, evaluation plays an essential part in Web search research. Both offline and online evaluation metrics are adopted in measuring the performance of search engines. Offline metrics are usually based on ...
Query/Task Satisfaction and Grid-based Evaluation Metrics Under Different Image Search Intents
SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information RetrievalPeople use web image search with various search intents: from serious demands for work to just passing time by browsing images of a favorite actor. Such a diversity of intents can influence user satisfaction and evaluation metrics, both of which are ...
Comments