research-article

Learning Sensitive Combinations of A/B Test Metrics

Authors:

Eugene Kharitonov,

Pavel SerdyukovAuthors Info & Claims

WSDM '17: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining

Pages 651 - 659

https://doi.org/10.1145/3018661.3018708

Published: 02 February 2017 Publication History

Abstract

Online search evaluation, and A/B testing in particular, is an irreplaceable tool for modern search engines. Typically, online experiments last for several days or weeks and require a considerable portion of the search traffic. This restricts their usefulness and applicability.

To alleviate the need for large sample sizes in A/B experiments, several approaches were proposed. Primarily, these approaches are based on increasing the sensitivity (informally, the ability to detect changes with less observations) of the evaluation metrics. Such sensitivity improvements are achieved by applying variance reduction methods, e.g. stratification and control covariates. However, the ability to learn sensitive metric combinations that (a) agree with the ground-truth metric, and (b) are more sensitive, was not explored in the A/B testing scenario.

In this work, we aim to close this gap. We formulate the problem of finding a sensitive metric combination as a data-driven machine learning problem and propose two intuitive optimization approaches to address it. Next, we perform an extensive experimental study of our proposed approaches. In our experiments, we use a dataset of 118 A/B tests performed by Yandex and study eight state-of-the-art ground-truth user engagement metrics, including Sessions per User and Absence Time. Our results suggest that a considerable sensitivity improvements over the ground-truth metrics can be achieved by using our proposed approaches.

References

[1]

E. Bakshy and D. Eckles. Uncertainty in online experiments with dependent data: An evaluation of bootstrap methods. In KDD'2013, pages 1303--1311, 2013.

Digital Library

[2]

S. Chakraborty, F. Radlinski, M. Shokouhi, and P. Baecke. On correlation of absence time and search effectiveness. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2014.

Digital Library

[3]

O. Chapelle, T. Joachims, F. Radlinski, and Y. Yue. Large-scale validation and analysis of interleaved search evaluation. ACM Transactions on Information Systems, 30, 2012.

Digital Library

[4]

L. Clemmensen, T. Hastie, D. Witten, and B. Ersbøll. Sparse discriminant analysis. Technometrics, 2012.

[5]

C. Cleverdon. The cranfield tests on index language devices. Aslib Proceedings, 19(6):173--194, 1967.

[6]

A. Deng, J. Lu, and S. Chen. Continuous Monitoring of A/B Tests without Pain: Optional Stopping in Bayesian Testing. ArXiv e-prints.

[7]

A. Deng and X. Shi. Data-driven metric development for online controlled experiments: Seven lessons learned. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.

Digital Library

[8]

A. Deng, Y. Xu, R. Kohavi, and T. Walker. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining, 2013.

Digital Library

[9]

A. Drutsa, G. Gusev, and P. Serdyukov. Future user engagement prediction and its application to improve the sensitivity of online experiments. In Proceedings of the 24th International Conference on World Wide Web, WWW '15, pages 256--266, New York, NY, USA, 2015. ACM.

Digital Library

[10]

A. Drutsa, A. Ufliand, and G. Gusev. Practical aspects of sensitivity in online experimentation with user engagement metrics. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, 2015.

Digital Library

[11]

G. Dupret and M. Lalmas. Absence time and user engagement: evaluating ranking functions. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining, 2013.

Digital Library

[12]

R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2):179--188, 1936.

[13]

T. Hastie, R. Tibshirani, J. Friedman, and J. Franklin. The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer, 27(2):83--85, 2005.

[14]

R. Johari, L. Pekelis, and D. J. Walsh. Always valid inference: Bringing sequential analysis to A/B testing. arXiv preprint arXiv:1512.04922, 2015.

[15]

E. Jones, T. Oliphant, and P. Peterson. SciPy: Open source scientific tools for Python, 2001--. URL http://www. scipy. org, 2016.

[16]

E. Kharitonov, C. Macdonald, P. Serdyukov, and I. Ounis. Generalized Team Draft interleaving. In Proceedings of the 24th ACM International Conference on Information and Knowledge Management, 2015.

Digital Library

[17]

E. Kharitonov, A. Vorobyev, C. Macdonald, P. Serdyukov, and I. Ounis. Sequential testing for early stopping of online experiments. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2015.

Digital Library

[18]

R. Kohavi, T. Crook, R. Longbotham, B. Frasca, R. Henne, J. L. Ferres, and T. Melamed. Online experimentation at Microsoft. Data Mining Case Studies, page 11, 2009.

[19]

R. Kohavi, A. Deng, B. Frasca, R. Longbotham, T. Walker, and Y. Xu. Trustworthy online controlled experiments: Five puzzling outcomes explained. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2012.

Digital Library

[20]

R. Kohavi, A. Deng, B. Frasca, T. Walker, Y. Xu, and N. Pohlmann. Online controlled experiments at large scale. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2013.

Digital Library

[21]

R. Kohavi, A. Deng, R. Longbotham, and Y. Xu. Seven rules of thumb for web site experimenters. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014.

Digital Library

[22]

R. Kohavi, R. Longbotham, D. Sommerfield, and R. M. Henne. Controlled experiments on the web: survey and practical guide. Data Mining and Knowledge Discovery, 18, 2009.

Digital Library

[23]

A. Poyarkov, A. Drutsa, A. Khalyavin, G. Gusev, and P. Serdyukov. Boosted decision tree regression adjustment for variance reduction of online controlled experiments. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.

Digital Library

[24]

F. Radlinski, M. Kurup, and T. Joachims. How does clickthrough data reflect retrieval quality? In Proceedings of the 17th ACM Conference on Information and Knowledge management, 2008.

Digital Library

[25]

M. Sanderson. Test collection based evaluation of information retrieval systems. Now Publishers, 2010.

[26]

B. Scholkopft and K.-R. Mullert. Fisher discriminant analysis with kernels. Neural networks for signal processing IX, 1(1):1, 1999.

[27]

A. Schuth, K. Hofmann, and F. Radlinski. Predicting search satisfaction metrics with interleaved comparisons. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2015.

Digital Library

[28]

D. Tang, A. Agarwal, D. O'Brien, and M. Meyer. Overlapping experiment infrastructure: More, better, faster experimentation. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2010.

Digital Library

[29]

M. Welling. Fisher linear discriminant analysis. Technical report, Department of Computer Science, University of Toronto, 3:1--4, 2005.

[30]

H. Xie and J. Aurisset. Improving the sensitivity of online controlled experiments: Case studies at netflix. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.

Digital Library

[31]

Y. Yue, Y. Gao, O. Chapelle, Y. Zhang, and T. Joachims. Learning more powerful test statistics for click-based retrieval evaluation. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2010.

Digital Library

Cited By

Quin FWeyns DBaresi LMa XPasquale L(2024)Automating Pipelines of A/B Tests with Population Split Using Self-Adaptation and Machine LearningProceedings of the 19th International Symposium on Software Engineering for Adaptive and Self-Managing Systems10.1145/3643915.3644087(84-97)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1145/3643915.3644087
Jeunen OMandav JPotapov IAgarwal NVaid SShi WUstimenko A(2024)Multi-Objective Recommendation via Multivariate Policy LearningProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688132(712-721)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3640457.3688132
Jeunen OBaweja SPokharna NUstimenko A(2024)Powerful A/B-Testing Metrics and Where to Find ThemProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688036(816-818)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3640457.3688036
Show More Cited By

Index Terms

Learning Sensitive Combinations of A/B Test Metrics
1. Information systems
  1. Information retrieval

Recommendations

Machine Learning Powered A/B Testing
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

Online search evaluation, and A/B testing in particular, is an irreplaceable tool for modern search engines. Typically, online experiments last for several days or weeks and require a considerable portion of the search traffic. Despite the increasing ...
Cost effective software test metrics

This paper discusses software test metrics and their ability to show objective evidence necessary to make process improvements in a development organization. When used properly, test metrics assist in the improvement of the software development process ...
Meta-evaluation of Online and Offline Web Search Evaluation Metrics
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

As in most information retrieval (IR) studies, evaluation plays an essential part in Web search research. Both offline and online evaluation metrics are adopted in measuring the performance of search engines. Offline metrics are usually based on ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '17: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining

February 2017

868 pages

ISBN:9781450346757

DOI:10.1145/3018661

General Chairs:
Maarten de Rijke
University of Amsterdam
,
Milad Shokouhi
Microsoft
,
Program Chairs:
Andrew Tomkins
Google
,
Min Zhang
Tsinghua University

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 February 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WSDM 2017

Sponsor:

WSDM 2017: Tenth ACM International Conference on Web Search and Data Mining

February 6 - 10, 2017

Cambridge, United Kingdom

Acceptance Rates

WSDM '17 Paper Acceptance Rate 80 of 505 submissions, 16%;

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
559
Total Downloads

Downloads (Last 12 months)50
Downloads (Last 6 weeks)2

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Quin FWeyns DBaresi LMa XPasquale L(2024)Automating Pipelines of A/B Tests with Population Split Using Self-Adaptation and Machine LearningProceedings of the 19th International Symposium on Software Engineering for Adaptive and Self-Managing Systems10.1145/3643915.3644087(84-97)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1145/3643915.3644087
Jeunen OMandav JPotapov IAgarwal NVaid SShi WUstimenko A(2024)Multi-Objective Recommendation via Multivariate Policy LearningProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688132(712-721)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3640457.3688132
Jeunen OBaweja SPokharna NUstimenko A(2024)Powerful A/B-Testing Metrics and Where to Find ThemProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688036(816-818)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3640457.3688036
Jeunen OPotapov IUstimenko ABaeza-Yates RBonchi F(2024)On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-n RecommendationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671687(1222-1233)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671687
Jeunen OUstimenko ABaeza-Yates RBonchi F(2024)Learning Metrics that Maximise Power for Accelerated A/B-TestsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671512(5183-5193)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671512
Quin FWeyns DGalster MSilva C(2024)A/B testingJournal of Systems and Software10.1016/j.jss.2024.112011211:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.jss.2024.112011
Baweja SPokharna NUstimenko AJeunen O(2024)Variance Reduction in Ratio Metrics for Efficient Online ExperimentsAdvances in Information Retrieval10.1007/978-3-031-56069-9_34(292-297)Online publication date: 24-Mar-2024
https://dl.acm.org/doi/10.1007/978-3-031-56069-9_34
Larsen NStallrich JSengupta SDeng AKohavi RStevens N(2023)Statistical Challenges in Online Controlled Experiments: A Review of A/B Testing MethodologyThe American Statistician10.1080/00031305.2023.225723778:2(135-149)Online publication date: 18-Oct-2023
https://doi.org/10.1080/00031305.2023.2257237
Chandar PSt. Thomas BMaystre LPappu VSanchis-Ojeda RWu TCarterette BLalmas MJebara T(2022)Using Survival Models to Estimate User Engagement in Online ExperimentsProceedings of the ACM Web Conference 202210.1145/3485447.3512038(3186-3195)Online publication date: 25-Apr-2022
https://dl.acm.org/doi/10.1145/3485447.3512038
Chromik MLachner FButz ALamas DSarapuu HŠmorgun IBerget G(2020)ML for UX? - An Inventory and Predictions on the Use of Machine Learning Techniques for UX ResearchProceedings of the 11th Nordic Conference on Human-Computer Interaction: Shaping Experiences, Shaping Society10.1145/3419249.3420163(1-11)Online publication date: 25-Oct-2020
https://dl.acm.org/doi/10.1145/3419249.3420163
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten