skip to main content
10.1145/2766462.2767695acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Predicting Search Satisfaction Metrics with Interleaved Comparisons

Published: 09 August 2015 Publication History

Abstract

The gold standard for online retrieval evaluation is AB testing. Rooted in the idea of a controlled experiment, AB tests compare the performance of an experimental system (treatment) on one sample of the user population, to that of a baseline system (control) on another sample. Given an online evaluation metric that accurately reflects user satisfaction, these tests enjoy high validity. However, due to the high variance across users, these comparisons often have low sensitivity, requiring millions of queries to detect statistically significant differences between systems. Interleaving is an alternative online evaluation approach, where each user is presented with a combination of results from both the control and treatment systems. Compared to AB tests, interleaving has been shown to be substantially more sensitive. However, interleaving methods have so far focused on user clicks only, and lack support for more sophisticated user satisfaction metrics as used in AB testing. In this paper we present the first method for integrating user satisfaction metrics with interleaving. We show how interleaving can be extended to (1) directly match user signals and parameters of AB metrics, and (2) how parameterized interleaving credit functions can be automatically calibrated to predict AB outcomes. We also develop a new method for estimating the relative sensitivity of interleaving and AB metrics, and show that our interleaving credit functions improve agreement with AB metrics without sacrificing sensitivity. Our results, using 38 large-scale online experiments en- compassing over 3 billion clicks in a web search setting, demonstrate up to a 22% improvement in agreement with AB metrics (constituting over a 50% error reduction), while maintaining sensitivity of one to two orders of magnitude above the AB tests. This paves the way towards more sensitive and accurate online evaluation.

References

[1]
O. Chapelle, T. Joachims, F. Radlinski, and Y. Yue. Large-scale validation and analysis of interleaved search evaluation. phACM TOIS, 30 (1): 1--41, Feb. 2012.
[2]
F. Diaz, R. White, G. Buscher, and D. Liebling. Robust models of mouse movement on dynamic web search results pages. In phCIKM, pages 1451--1460. ACM Press, Oct. 2013.
[3]
G. Dupret and M. Lalmas. Absence time and user engagement: Evaluating ranking functions. In phWSDM, pages 173--182, 2013.
[4]
S. Fox, K. Karnawat, M. Mydland, S.Dumais, and T.White. Evaluating implicit measures to improve web search. phACM TOIS, 23: 147--168, 2005.
[5]
Q. Guo and E. Agichtein. Understanding "Abandoned" Ads: Towards Personalized Commercial Intent Inference via Mouse Movement Analysis. phSIGIR-IRA, 2008.
[6]
Q. Guo and E. Agichtein. Towards predicting web searcher gaze position from mouse movements. In phCHI EA, page 3601, Apr. 2010.
[7]
A. Hassan, R. Jones, and K. Klinkner. Beyond DCG: user behavior as a predictor of a successful search. In phWSDM, 2010.
[8]
A. Hassan, X. Shi, N. Craswell, and B. Ramsey. Beyond clicks: query reformulation as a predictor of search satisfaction. In phCIKM, 2013.
[9]
Y. He and K. Wang. Inferring search behaviors using partially observable markov model with duration (POMD). In phWSDM, 2011.
[10]
W. Hersh, A. H. Turpin, S. Price, B. Chan, D. Kramer, L. Sacherek, and D. Olson. Do batch and user evaluations give the same results? In phSIGIR, pages 17--24, 2000.
[11]
K. Hofmann, S. Whiteson, and M. de Rijke. A probabilistic method for inferring preferences from clicks. In phCIKM, page 249, 2011.
[12]
K. Hofmann, F. Behr, and F. Radlinski. On Caption Bias in Interleaving Experiments. In phCIKM, 2012.
[13]
J. Huang, T. Lin, and R. White. No search result left behind. In phWSDM, page 203, 2012.
[14]
T. Joachims. Evaluating Retrieval Performance using Clickthrough Data. In J. Franke, G. Nakhaeizadeh, and I. Renz, editors, phText Mining, pages 79--96. Physica/Springer, 2003.
[15]
T. Joachims, L. A. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay. Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search. phACM TOIS, 25 (2): 7
[16]
, Apr. 2007.
[17]
Y. Kim, A. Hassan, R. White, and I. Zitouni. Modeling dwell time to predict click-level satisfaction. In phWSDM, 2014.
[18]
R. Kohavi, R. Longbotham, D. Sommerfield, and R. Henne. Controlled experiments on the web: survey and practical guide. phData Mining and Knowledge Discovery, 18 (1): 140--181, 2009.
[19]
D. Lagun, C. Hsieh, D. Webster, and V. Navalpakkam. Towards Better Measurement of Attention and Satisfaction in Mobile Search. In phSIGIR, 2014.
[20]
E. L. Lehmann and J. P. Romano. phTesting statistical hypotheses. springer, 2006.
[21]
J. Li, S. Huffman, and A. Tokuda. Good abandonment in mobile and pc internet search. In phSIGIR '09, pages 43--50, 2009.
[22]
N. Meinshausen. Quantile regression forests. phjmlr, 7: 983--999, 2006.
[23]
F. Radlinski and N. Craswell. Comparing the sensitivity of information retrieval metrics. In phSIGIR, pages 667--674, 2010.
[24]
F. Radlinski and N. Craswell. Optimized interleaving for online retrieval evaluation. In phWSDM, 2013.
[25]
F. Radlinski, M. Kurup, and T. Joachims. How does clickthrough data reflect retrieval quality? In phCIKM, pages 43--52. ACM Press, 2008.
[26]
M. Sanderson. Test Collection Based Evaluation of Information Retrieval Systems. phFoundations and Trends in Information Retrieval, 4 (4): 247--375, 2010.
[27]
A. Schuth, F. Sietsma, S. Whiteson, D. Lefortier, and M. de Rijke. Multileaved Comparisons for Fast Online Evaluation. In phCIKM, pages 71--80, 2014.
[28]
A. Schuth, R.-J. Bruintjes, F. Büttner, J. van Doorn, C. Groenland, H. Oosterhuis, C.-N. Tran, B. Veeling, J. van der Velde, R. Wechsler, D. Woudenberg, and M. de Rijke. Probabilistic multileave for online retrieval evaluation. In phSIGIR, 2015.
[29]
Y. Song, X. Shi, R. White, and A. Hassan. Context-Aware Web Search Abandonment Prediction. In phSIGIR, 2014.
[30]
J. Teevan, S. Dumais, and E. Horvitz. The potential value of personalizing search. In phSIGIR, pages 756--757, 2007.
[31]
A. Turpin and W. Hersh. Why batch and user evaluations do not give the same results. In phSIGIR, pages 225--231, 2001.
[32]
A. Turpin and F. Scholar. User performance versus precision measures for simple search tasks. In phSIGIR, pages 11--18, 2006.
[33]
E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. phIPM, 36 (5): 697 -- 716, 2000.
[34]
H. Wang, Y. Song, M. Chang, X. He, A. Hassan, and R. White. Modeling Action-level Satisfaction for Search Task Satisfaction Prediction. In phSIGIR, 2014.
[35]
K. Wang, T. Walker, and Z. Zheng. PSkip: Estimating relevance ranking quality from web search clickthrough data. In phKDD, pages 1355--1364, 2009.
[36]
K. Wang, N. Gloy, and X. Li. Inferring search behaviors using partially observable Markov (POM) model. In phWSDM, 2010.
[37]
E. Yilmaz, M. Verma, N. Craswell, F. Radlinski, and P. Bailey. Relevance and Effort: An Analysis of Document Utility. In phCIKM, 2014.
[38]
Yue, Gao, Chapelle, Zhang, and Joachims}Yue2010bY. Yue, Y. Gao, O. Chapelle, Y. Zhang, and T. Joachims. Learning more powerful test statistics for click-based retrieval evaluation. In phSIGIR, pages 507--514, 2010\natexlaba.
[39]
Yue, Patel, and Roehrig}Yue2010Y. Yue, R. Patel, and H. Roehrig. Beyond position bias: examining result attractiveness as a source of presentation bias in clickthrough data. In phWWW, pages 1011--1018, 2010\natexlabb.

Cited By

View all
  • (2024)Navigating the Evaluation Funnel to Optimize Iteration Speed for Recommender SystemsProceedings of the Future Technologies Conference (FTC) 2024, Volume 110.1007/978-3-031-73110-5_11(138-157)Online publication date: 5-Nov-2024
  • (2023)Interleaved Online Testing in Large-Scale SystemsCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3587572(921-926)Online publication date: 30-Apr-2023
  • (2023)Stat-Weight: Improving the Estimator of Interleaved Methods Outcomes with Statistical Hypothesis TestingAdvances in Information Retrieval10.1007/978-3-031-28241-6_2(20-34)Online publication date: 16-Mar-2023
  • Show More Cited By

Index Terms

  1. Predicting Search Satisfaction Metrics with Interleaved Comparisons

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval
    August 2015
    1198 pages
    ISBN:9781450336215
    DOI:10.1145/2766462
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 August 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. evaluation
    2. information retrieval
    3. interleaved comparisons

    Qualifiers

    • Research-article

    Conference

    SIGIR '15
    Sponsor:

    Acceptance Rates

    SIGIR '15 Paper Acceptance Rate 70 of 351 submissions, 20%;
    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)10
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Navigating the Evaluation Funnel to Optimize Iteration Speed for Recommender SystemsProceedings of the Future Technologies Conference (FTC) 2024, Volume 110.1007/978-3-031-73110-5_11(138-157)Online publication date: 5-Nov-2024
    • (2023)Interleaved Online Testing in Large-Scale SystemsCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3587572(921-926)Online publication date: 30-Apr-2023
    • (2023)Stat-Weight: Improving the Estimator of Interleaved Methods Outcomes with Statistical Hypothesis TestingAdvances in Information Retrieval10.1007/978-3-031-28241-6_2(20-34)Online publication date: 16-Mar-2023
    • (2022)Understanding and Evaluating Search ExperienceSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S01166ED1V01Y202202ICR07714:1(1-105)Online publication date: 28-Mar-2022
    • (2022)Debiased Balanced Interleaving at Amazon SearchProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557123(2913-2922)Online publication date: 17-Oct-2022
    • (2022)Ranking Task in RAS: A Comparative Study of Learning to Rank Algorithms and Interleaving MethodsDigital Technologies and Applications10.1007/978-3-031-01942-5_16(158-168)Online publication date: 8-May-2022
    • (2021)Decomposition and Interleaving for Variance Reduction of Post-click MetricsProceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3471158.3472235(221-230)Online publication date: 11-Jul-2021
    • (2021)Towards the D-Optimal Online Experiment Design for Recommender SelectionProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining10.1145/3447548.3467192(3817-3825)Online publication date: 14-Aug-2021
    • (2020)Dueling BanditsProceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems10.5555/3398761.3398806(348-356)Online publication date: 5-May-2020
    • (2019)Continuous Evaluation of Large-Scale Information Access Systems: A Case for Living LabsInformation Retrieval Evaluation in a Changing World10.1007/978-3-030-22948-1_21(511-543)Online publication date: 14-Aug-2019
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media