research-article

Predicting Search Satisfaction Metrics with Interleaved Comparisons

Authors:

Filip RadlinskiAuthors Info & Claims

SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 463 - 472

https://doi.org/10.1145/2766462.2767695

Published: 09 August 2015 Publication History

Abstract

The gold standard for online retrieval evaluation is AB testing. Rooted in the idea of a controlled experiment, AB tests compare the performance of an experimental system (treatment) on one sample of the user population, to that of a baseline system (control) on another sample. Given an online evaluation metric that accurately reflects user satisfaction, these tests enjoy high validity. However, due to the high variance across users, these comparisons often have low sensitivity, requiring millions of queries to detect statistically significant differences between systems. Interleaving is an alternative online evaluation approach, where each user is presented with a combination of results from both the control and treatment systems. Compared to AB tests, interleaving has been shown to be substantially more sensitive. However, interleaving methods have so far focused on user clicks only, and lack support for more sophisticated user satisfaction metrics as used in AB testing. In this paper we present the first method for integrating user satisfaction metrics with interleaving. We show how interleaving can be extended to (1) directly match user signals and parameters of AB metrics, and (2) how parameterized interleaving credit functions can be automatically calibrated to predict AB outcomes. We also develop a new method for estimating the relative sensitivity of interleaving and AB metrics, and show that our interleaving credit functions improve agreement with AB metrics without sacrificing sensitivity. Our results, using 38 large-scale online experiments en- compassing over 3 billion clicks in a web search setting, demonstrate up to a 22% improvement in agreement with AB metrics (constituting over a 50% error reduction), while maintaining sensitivity of one to two orders of magnitude above the AB tests. This paves the way towards more sensitive and accurate online evaluation.

References

[1]

O. Chapelle, T. Joachims, F. Radlinski, and Y. Yue. Large-scale validation and analysis of interleaved search evaluation. phACM TOIS, 30 (1): 1--41, Feb. 2012.

Digital Library

[2]

F. Diaz, R. White, G. Buscher, and D. Liebling. Robust models of mouse movement on dynamic web search results pages. In phCIKM, pages 1451--1460. ACM Press, Oct. 2013.

Digital Library

[3]

G. Dupret and M. Lalmas. Absence time and user engagement: Evaluating ranking functions. In phWSDM, pages 173--182, 2013.

Digital Library

[4]

S. Fox, K. Karnawat, M. Mydland, S.Dumais, and T.White. Evaluating implicit measures to improve web search. phACM TOIS, 23: 147--168, 2005.

Digital Library

[5]

Q. Guo and E. Agichtein. Understanding "Abandoned" Ads: Towards Personalized Commercial Intent Inference via Mouse Movement Analysis. phSIGIR-IRA, 2008.

[6]

Q. Guo and E. Agichtein. Towards predicting web searcher gaze position from mouse movements. In phCHI EA, page 3601, Apr. 2010.

Digital Library

[7]

A. Hassan, R. Jones, and K. Klinkner. Beyond DCG: user behavior as a predictor of a successful search. In phWSDM, 2010.

Digital Library

[8]

A. Hassan, X. Shi, N. Craswell, and B. Ramsey. Beyond clicks: query reformulation as a predictor of search satisfaction. In phCIKM, 2013.

Digital Library

[9]

Y. He and K. Wang. Inferring search behaviors using partially observable markov model with duration (POMD). In phWSDM, 2011.

Digital Library

[10]

W. Hersh, A. H. Turpin, S. Price, B. Chan, D. Kramer, L. Sacherek, and D. Olson. Do batch and user evaluations give the same results? In phSIGIR, pages 17--24, 2000.

Digital Library

[11]

K. Hofmann, S. Whiteson, and M. de Rijke. A probabilistic method for inferring preferences from clicks. In phCIKM, page 249, 2011.

Digital Library

[12]

K. Hofmann, F. Behr, and F. Radlinski. On Caption Bias in Interleaving Experiments. In phCIKM, 2012.

Digital Library

[13]

J. Huang, T. Lin, and R. White. No search result left behind. In phWSDM, page 203, 2012.

Digital Library

[14]

T. Joachims. Evaluating Retrieval Performance using Clickthrough Data. In J. Franke, G. Nakhaeizadeh, and I. Renz, editors, phText Mining, pages 79--96. Physica/Springer, 2003.

[15]

T. Joachims, L. A. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay. Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search. phACM TOIS, 25 (2): 7

Digital Library

[16]

, Apr. 2007.

[17]

Y. Kim, A. Hassan, R. White, and I. Zitouni. Modeling dwell time to predict click-level satisfaction. In phWSDM, 2014.

Digital Library

[18]

R. Kohavi, R. Longbotham, D. Sommerfield, and R. Henne. Controlled experiments on the web: survey and practical guide. phData Mining and Knowledge Discovery, 18 (1): 140--181, 2009.

Digital Library

[19]

D. Lagun, C. Hsieh, D. Webster, and V. Navalpakkam. Towards Better Measurement of Attention and Satisfaction in Mobile Search. In phSIGIR, 2014.

Digital Library

[20]

E. L. Lehmann and J. P. Romano. phTesting statistical hypotheses. springer, 2006.

[21]

J. Li, S. Huffman, and A. Tokuda. Good abandonment in mobile and pc internet search. In phSIGIR '09, pages 43--50, 2009.

Digital Library

[22]

N. Meinshausen. Quantile regression forests. phjmlr, 7: 983--999, 2006.

Digital Library

[23]

F. Radlinski and N. Craswell. Comparing the sensitivity of information retrieval metrics. In phSIGIR, pages 667--674, 2010.

Digital Library

[24]

F. Radlinski and N. Craswell. Optimized interleaving for online retrieval evaluation. In phWSDM, 2013.

Digital Library

[25]

F. Radlinski, M. Kurup, and T. Joachims. How does clickthrough data reflect retrieval quality? In phCIKM, pages 43--52. ACM Press, 2008.

Digital Library

[26]

M. Sanderson. Test Collection Based Evaluation of Information Retrieval Systems. phFoundations and Trends in Information Retrieval, 4 (4): 247--375, 2010.

[27]

A. Schuth, F. Sietsma, S. Whiteson, D. Lefortier, and M. de Rijke. Multileaved Comparisons for Fast Online Evaluation. In phCIKM, pages 71--80, 2014.

Digital Library

[28]

A. Schuth, R.-J. Bruintjes, F. Büttner, J. van Doorn, C. Groenland, H. Oosterhuis, C.-N. Tran, B. Veeling, J. van der Velde, R. Wechsler, D. Woudenberg, and M. de Rijke. Probabilistic multileave for online retrieval evaluation. In phSIGIR, 2015.

Digital Library

[29]

Y. Song, X. Shi, R. White, and A. Hassan. Context-Aware Web Search Abandonment Prediction. In phSIGIR, 2014.

Digital Library

[30]

J. Teevan, S. Dumais, and E. Horvitz. The potential value of personalizing search. In phSIGIR, pages 756--757, 2007.

[31]

A. Turpin and W. Hersh. Why batch and user evaluations do not give the same results. In phSIGIR, pages 225--231, 2001.

Digital Library

[32]

A. Turpin and F. Scholar. User performance versus precision measures for simple search tasks. In phSIGIR, pages 11--18, 2006.

Digital Library

[33]

E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. phIPM, 36 (5): 697 -- 716, 2000.

Digital Library

[34]

H. Wang, Y. Song, M. Chang, X. He, A. Hassan, and R. White. Modeling Action-level Satisfaction for Search Task Satisfaction Prediction. In phSIGIR, 2014.

Digital Library

[35]

K. Wang, T. Walker, and Z. Zheng. PSkip: Estimating relevance ranking quality from web search clickthrough data. In phKDD, pages 1355--1364, 2009.

Digital Library

[36]

K. Wang, N. Gloy, and X. Li. Inferring search behaviors using partially observable Markov (POM) model. In phWSDM, 2010.

Digital Library

[37]

E. Yilmaz, M. Verma, N. Craswell, F. Radlinski, and P. Bailey. Relevance and Effort: An Analysis of Document Utility. In phCIKM, 2014.

Digital Library

[38]

Yue, Gao, Chapelle, Zhang, and Joachims}Yue2010bY. Yue, Y. Gao, O. Chapelle, Y. Zhang, and T. Joachims. Learning more powerful test statistics for click-based retrieval evaluation. In phSIGIR, pages 507--514, 2010\natexlaba.

Digital Library

[39]

Yue, Patel, and Roehrig}Yue2010Y. Yue, R. Patel, and H. Roehrig. Beyond position bias: examining result attractiveness as a source of presentation bias in clickthrough data. In phWWW, pages 1011--1018, 2010\natexlabb.

Digital Library

Cited By

Schultzberg COttens B(2024)Navigating the Evaluation Funnel to Optimize Iteration Speed for Recommender SystemsProceedings of the Future Technologies Conference (FTC) 2024, Volume 110.1007/978-3-031-73110-5_11(138-157)Online publication date: 5-Nov-2024
https://doi.org/10.1007/978-3-031-73110-5_11
Bi NLi BGao REdge GAhuja S(2023)Interleaved Online Testing in Large-Scale SystemsCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3587572(921-926)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543873.3587572
Benedetti ARuggero A(2023)Stat-Weight: Improving the Estimator of Interleaved Methods Outcomes with Statistical Hypothesis TestingAdvances in Information Retrieval10.1007/978-3-031-28241-6_2(20-34)Online publication date: 16-Mar-2023
https://doi.org/10.1007/978-3-031-28241-6_2
Show More Cited By

Index Terms

Predicting Search Satisfaction Metrics with Interleaved Comparisons
1. Information systems
  1. Information retrieval

Recommendations

Metrics, User Models, and Satisfaction
WSDM '20: Proceedings of the 13th International Conference on Web Search and Data Mining

User satisfaction is an important factor when evaluating search systems, and hence a good metric should give rise to scores that have a strong positive correlation with user satisfaction ratings. A metric should also correspond to a plausible user model,...
Metrics for evaluating human information interaction systems

Society today has a wealth of information available due to information technology. The challenge facing researchers working in information access is how to help users easily locate the information needed. Evaluation methodologies and metrics are ...
Models and metrics: IR evaluation as a user process
ADCS '12: Proceedings of the Seventeenth Australasian Document Computing Symposium

Retrieval system effectiveness can be measured in two quite different ways: by monitoring the behavior of users and gathering data about the ease and accuracy with which they accomplish certain specified information-seeking tasks; or by using numeric ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

August 2015

1198 pages

ISBN:9781450336215

DOI:10.1145/2766462

General Chair:
Ricardo Baeza-Yates
Yahoo Labs, USA
,
Program Chairs:
Mounia Lalmas
Yahoo Labs, UK
,
Alistair Moffat
University of Melbourne, Australia
,
Berthier Ribeiro-Neto
Google, Brazil, and UFMG, Brazil

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 August 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '15

Sponsor:

SIGIR

SIGIR '15: The 38th International ACM SIGIR conference on research and development in Information Retrieval

August 9 - 13, 2015

Santiago, Chile

Acceptance Rates

SIGIR '15 Paper Acceptance Rate 70 of 351 submissions, 20%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

27
Total Citations
View Citations
369
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Schultzberg COttens B(2024)Navigating the Evaluation Funnel to Optimize Iteration Speed for Recommender SystemsProceedings of the Future Technologies Conference (FTC) 2024, Volume 110.1007/978-3-031-73110-5_11(138-157)Online publication date: 5-Nov-2024
https://doi.org/10.1007/978-3-031-73110-5_11
Bi NLi BGao REdge GAhuja S(2023)Interleaved Online Testing in Large-Scale SystemsCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3587572(921-926)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543873.3587572
Benedetti ARuggero A(2023)Stat-Weight: Improving the Estimator of Interleaved Methods Outcomes with Statistical Hypothesis TestingAdvances in Information Retrieval10.1007/978-3-031-28241-6_2(20-34)Online publication date: 16-Mar-2023
https://doi.org/10.1007/978-3-031-28241-6_2
Stone M(2022)Understanding and Evaluating Search ExperienceSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S01166ED1V01Y202202ICR07714:1(1-105)Online publication date: 28-Mar-2022
https://doi.org/10.2200/S01166ED1V01Y202202ICR077
Bi NCastells PGilbert DGalperin STardif PAhuja SAl Hasan MXiong L(2022)Debiased Balanced Interleaving at Amazon SearchProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557123(2913-2922)Online publication date: 17-Oct-2022
https://dl.acm.org/doi/10.1145/3511808.3557123
Lasri SNfaoui E(2022)Ranking Task in RAS: A Comparative Study of Learning to Rank Algorithms and Interleaving MethodsDigital Technologies and Applications10.1007/978-3-031-01942-5_16(158-168)Online publication date: 8-May-2022
https://doi.org/10.1007/978-3-031-01942-5_16
Iizuka KSeki YKato MHasibi FFang YAizawa A(2021)Decomposition and Interleaving for Variance Reduction of Post-click MetricsProceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3471158.3472235(221-230)Online publication date: 11-Jul-2021
https://dl.acm.org/doi/10.1145/3471158.3472235
Xu DRuan CKorpeoglu EKumar SAchan KZhu FChin Ooi BMiao CWang HSkrypnyk IHsu WChawla S(2021)Towards the D-Optimal Online Experiment Design for Recommender SelectionProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining10.1145/3447548.3467192(3817-3825)Online publication date: 14-Aug-2021
https://dl.acm.org/doi/10.1145/3447548.3467192
Du YWang SHuang LEl Fallah Seghrouchni ASukthankar GAn BYorke-Smith N(2020)Dueling BanditsProceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems10.5555/3398761.3398806(348-356)Online publication date: 5-May-2020
https://dl.acm.org/doi/10.5555/3398761.3398806
Hopfgartner FBalog KLommatzsch AKelly LKille BSchuth ALarson M(2019)Continuous Evaluation of Large-Scale Information Access Systems: A Case for Living LabsInformation Retrieval Evaluation in a Changing World10.1007/978-3-030-22948-1_21(511-543)Online publication date: 14-Aug-2019
https://doi.org/10.1007/978-3-030-22948-1_21
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten