skip to main content
10.1145/3018661.3018708acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Learning Sensitive Combinations of A/B Test Metrics

Published: 02 February 2017 Publication History

Abstract

Online search evaluation, and A/B testing in particular, is an irreplaceable tool for modern search engines. Typically, online experiments last for several days or weeks and require a considerable portion of the search traffic. This restricts their usefulness and applicability.
To alleviate the need for large sample sizes in A/B experiments, several approaches were proposed. Primarily, these approaches are based on increasing the sensitivity (informally, the ability to detect changes with less observations) of the evaluation metrics. Such sensitivity improvements are achieved by applying variance reduction methods, e.g. stratification and control covariates. However, the ability to learn sensitive metric combinations that (a) agree with the ground-truth metric, and (b) are more sensitive, was not explored in the A/B testing scenario.
In this work, we aim to close this gap. We formulate the problem of finding a sensitive metric combination as a data-driven machine learning problem and propose two intuitive optimization approaches to address it. Next, we perform an extensive experimental study of our proposed approaches. In our experiments, we use a dataset of 118 A/B tests performed by Yandex and study eight state-of-the-art ground-truth user engagement metrics, including Sessions per User and Absence Time. Our results suggest that a considerable sensitivity improvements over the ground-truth metrics can be achieved by using our proposed approaches.

References

[1]
E. Bakshy and D. Eckles. Uncertainty in online experiments with dependent data: An evaluation of bootstrap methods. In KDD'2013, pages 1303--1311, 2013.
[2]
S. Chakraborty, F. Radlinski, M. Shokouhi, and P. Baecke. On correlation of absence time and search effectiveness. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2014.
[3]
O. Chapelle, T. Joachims, F. Radlinski, and Y. Yue. Large-scale validation and analysis of interleaved search evaluation. ACM Transactions on Information Systems, 30, 2012.
[4]
L. Clemmensen, T. Hastie, D. Witten, and B. Ersbøll. Sparse discriminant analysis. Technometrics, 2012.
[5]
C. Cleverdon. The cranfield tests on index language devices. Aslib Proceedings, 19(6):173--194, 1967.
[6]
A. Deng, J. Lu, and S. Chen. Continuous Monitoring of A/B Tests without Pain: Optional Stopping in Bayesian Testing. ArXiv e-prints.
[7]
A. Deng and X. Shi. Data-driven metric development for online controlled experiments: Seven lessons learned. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.
[8]
A. Deng, Y. Xu, R. Kohavi, and T. Walker. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining, 2013.
[9]
A. Drutsa, G. Gusev, and P. Serdyukov. Future user engagement prediction and its application to improve the sensitivity of online experiments. In Proceedings of the 24th International Conference on World Wide Web, WWW '15, pages 256--266, New York, NY, USA, 2015. ACM.
[10]
A. Drutsa, A. Ufliand, and G. Gusev. Practical aspects of sensitivity in online experimentation with user engagement metrics. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, 2015.
[11]
G. Dupret and M. Lalmas. Absence time and user engagement: evaluating ranking functions. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining, 2013.
[12]
R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2):179--188, 1936.
[13]
T. Hastie, R. Tibshirani, J. Friedman, and J. Franklin. The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer, 27(2):83--85, 2005.
[14]
R. Johari, L. Pekelis, and D. J. Walsh. Always valid inference: Bringing sequential analysis to A/B testing. arXiv preprint arXiv:1512.04922, 2015.
[15]
E. Jones, T. Oliphant, and P. Peterson. SciPy: Open source scientific tools for Python, 2001--. URL http://www. scipy. org, 2016.
[16]
E. Kharitonov, C. Macdonald, P. Serdyukov, and I. Ounis. Generalized Team Draft interleaving. In Proceedings of the 24th ACM International Conference on Information and Knowledge Management, 2015.
[17]
E. Kharitonov, A. Vorobyev, C. Macdonald, P. Serdyukov, and I. Ounis. Sequential testing for early stopping of online experiments. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2015.
[18]
R. Kohavi, T. Crook, R. Longbotham, B. Frasca, R. Henne, J. L. Ferres, and T. Melamed. Online experimentation at Microsoft. Data Mining Case Studies, page 11, 2009.
[19]
R. Kohavi, A. Deng, B. Frasca, R. Longbotham, T. Walker, and Y. Xu. Trustworthy online controlled experiments: Five puzzling outcomes explained. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2012.
[20]
R. Kohavi, A. Deng, B. Frasca, T. Walker, Y. Xu, and N. Pohlmann. Online controlled experiments at large scale. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2013.
[21]
R. Kohavi, A. Deng, R. Longbotham, and Y. Xu. Seven rules of thumb for web site experimenters. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014.
[22]
R. Kohavi, R. Longbotham, D. Sommerfield, and R. M. Henne. Controlled experiments on the web: survey and practical guide. Data Mining and Knowledge Discovery, 18, 2009.
[23]
A. Poyarkov, A. Drutsa, A. Khalyavin, G. Gusev, and P. Serdyukov. Boosted decision tree regression adjustment for variance reduction of online controlled experiments. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.
[24]
F. Radlinski, M. Kurup, and T. Joachims. How does clickthrough data reflect retrieval quality? In Proceedings of the 17th ACM Conference on Information and Knowledge management, 2008.
[25]
M. Sanderson. Test collection based evaluation of information retrieval systems. Now Publishers, 2010.
[26]
B. Scholkopft and K.-R. Mullert. Fisher discriminant analysis with kernels. Neural networks for signal processing IX, 1(1):1, 1999.
[27]
A. Schuth, K. Hofmann, and F. Radlinski. Predicting search satisfaction metrics with interleaved comparisons. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2015.
[28]
D. Tang, A. Agarwal, D. O'Brien, and M. Meyer. Overlapping experiment infrastructure: More, better, faster experimentation. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2010.
[29]
M. Welling. Fisher linear discriminant analysis. Technical report, Department of Computer Science, University of Toronto, 3:1--4, 2005.
[30]
H. Xie and J. Aurisset. Improving the sensitivity of online controlled experiments: Case studies at netflix. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.
[31]
Y. Yue, Y. Gao, O. Chapelle, Y. Zhang, and T. Joachims. Learning more powerful test statistics for click-based retrieval evaluation. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2010.

Cited By

View all
  • (2024)Automating Pipelines of A/B Tests with Population Split Using Self-Adaptation and Machine LearningProceedings of the 19th International Symposium on Software Engineering for Adaptive and Self-Managing Systems10.1145/3643915.3644087(84-97)Online publication date: 15-Apr-2024
  • (2024)Multi-Objective Recommendation via Multivariate Policy LearningProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688132(712-721)Online publication date: 8-Oct-2024
  • (2024)Powerful A/B-Testing Metrics and Where to Find ThemProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688036(816-818)Online publication date: 8-Oct-2024
  • Show More Cited By

Index Terms

  1. Learning Sensitive Combinations of A/B Test Metrics

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WSDM '17: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining
    February 2017
    868 pages
    ISBN:9781450346757
    DOI:10.1145/3018661
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 February 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. a/b tests
    2. metric combination
    3. online controlled experiments
    4. online evaluation
    5. sensitivity improvement

    Qualifiers

    • Research-article

    Conference

    WSDM 2017

    Acceptance Rates

    WSDM '17 Paper Acceptance Rate 80 of 505 submissions, 16%;
    Overall Acceptance Rate 498 of 2,863 submissions, 17%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)50
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Automating Pipelines of A/B Tests with Population Split Using Self-Adaptation and Machine LearningProceedings of the 19th International Symposium on Software Engineering for Adaptive and Self-Managing Systems10.1145/3643915.3644087(84-97)Online publication date: 15-Apr-2024
    • (2024)Multi-Objective Recommendation via Multivariate Policy LearningProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688132(712-721)Online publication date: 8-Oct-2024
    • (2024)Powerful A/B-Testing Metrics and Where to Find ThemProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688036(816-818)Online publication date: 8-Oct-2024
    • (2024)On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-n RecommendationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671687(1222-1233)Online publication date: 25-Aug-2024
    • (2024)Learning Metrics that Maximise Power for Accelerated A/B-TestsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671512(5183-5193)Online publication date: 25-Aug-2024
    • (2024)A/B testingJournal of Systems and Software10.1016/j.jss.2024.112011211:COnline publication date: 2-Jul-2024
    • (2024)Variance Reduction in Ratio Metrics for Efficient Online ExperimentsAdvances in Information Retrieval10.1007/978-3-031-56069-9_34(292-297)Online publication date: 24-Mar-2024
    • (2023)Statistical Challenges in Online Controlled Experiments: A Review of A/B Testing MethodologyThe American Statistician10.1080/00031305.2023.225723778:2(135-149)Online publication date: 18-Oct-2023
    • (2022)Using Survival Models to Estimate User Engagement in Online ExperimentsProceedings of the ACM Web Conference 202210.1145/3485447.3512038(3186-3195)Online publication date: 25-Apr-2022
    • (2020)ML for UX? - An Inventory and Predictions on the Use of Machine Learning Techniques for UX ResearchProceedings of the 11th Nordic Conference on Human-Computer Interaction: Shaping Experiences, Shaping Society10.1145/3419249.3420163(1-11)Online publication date: 25-Oct-2020
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media