extended-abstract

Powerful A/B-Testing Metrics and Where to Find Them

Authors:

Olivier Jeunen,

Shubham Baweja,

Neeti Pokharna,

Aleksei UstimenkoAuthors Info & Claims

RecSys '24: Proceedings of the 18th ACM Conference on Recommender Systems

Pages 816 - 818

https://doi.org/10.1145/3640457.3688036

Published: 08 October 2024 Publication History

Abstract

Online controlled experiments, colloquially known as A/B-tests, are the bread and butter of real-world recommender system evaluation. Typically, end-users are randomly assigned some system variant, and a plethora of metrics are then tracked, collected, and aggregated throughout the experiment. A North Star metric (e.g. long-term growth or revenue) is used to assess which system variant should be deemed superior. As a result, most collected metrics are supporting in nature, and serve to either (i) provide an understanding of how the experiment impacts user experience, or (ii) allow for confident decision-making when the North Star metric moves insignificantly (i.e. a false negative or type-II error). The latter is not straightforward: suppose a treatment variant leads to fewer but longer sessions, with more views but fewer engagements; should this be considered a positive or negative outcome?

The question then becomes: how do we assess a supporting metric’s utility when it comes to decision-making using A/B-testing? Online platforms typically run dozens of experiments at any given time. This provides a wealth of information about interventions and treatment effects that can be used to evaluate metrics’ utility for online evaluation. We propose to collect this information and leverage it to quantify type-I, type-II, and type-III errors for the metrics of interest, alongside a distribution of measurements of their statistical power (e.g. z-scores and p-values). We present results and insights from building this pipeline at scale for two large-scale short-video platforms: ShareChat and Moj; leveraging hundreds of past experiments to find online metrics with high statistical power.

References

[1]

Shubham Baweja, Neeti Pokharna, Aleksei Ustimenko, and Olivier Jeunen. 2024. Variance Reduction in Ratio Metrics for Efficient Online Experiments. In Proc. of the 46th European Conference on Information Retrieval(ECIR ’24). Springer.

Digital Library

[2]

Roman Budylin, Alexey Drutsa, Ilya Katsev, and Valeriya Tsoy. 2018. Consistent Transformation of Ratio Metrics for Efficient Online Controlled Experiments. In Proc. of the Eleventh ACM International Conference on Web Search and Data Mining(WSDM ’18). ACM, 55–63. https://doi.org/10.1145/3159652.3159699

Digital Library

[3]

Alex Deng and Xiaolin Shi. 2016. Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned. In Proc. of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD ’16). ACM, 77–86. https://doi.org/10.1145/2939672.2939700

Digital Library

[4]

Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. 2013. Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data. In Proc. of the Sixth ACM International Conference on Web Search and Data Mining(WSDM ’13). ACM, 123–132. https://doi.org/10.1145/2433396.2433413

Digital Library

[5]

Pavel Dmitriev, Somit Gupta, Dong Woo Kim, and Garnet Vaz. 2017. A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Halifax, NS, Canada) (KDD ’17). ACM, 1427–1436. https://doi.org/10.1145/3097983.3098024

Digital Library

[6]

Ronald Aylmer Fisher. 1921. Statistical methods for research workers.Statistical methods for research workers.1st Ed (1921).

[7]

Olivier Jeunen. 2023. A Common Misassumption in Online Experiments with Machine Learning Models. SIGIR Forum 57, 1, Article 13 (dec 2023), 9 pages. https://doi.org/10.1145/3636341.3636358

Digital Library

[8]

Olivier Jeunen and Aleksei Ustimenko. 2024. Learning Metrics that Maximise Power for Accelerated A/B-Tests. In Proc. of the 30th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(KDD ’24). ACM. arxiv:2402.03915

Digital Library

[9]

Eugene Kharitonov, Alexey Drutsa, and Pavel Serdyukov. 2017. Learning Sensitive Combinations of A/B Test Metrics. In Proc. of the Tenth ACM International Conference on Web Search and Data Mining(WSDM ’17). ACM, 651–659. https://doi.org/10.1145/3018661.3018708

Digital Library

[10]

Ron Kohavi, Alex Deng, and Lukas Vermeer. 2022. A/B Testing Intuition Busters: Common Misunderstandings in Online Controlled Experiments. In Proc. of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(KDD ’22). ACM, 3168–3177. https://doi.org/10.1145/3534678.3539160

Digital Library

[11]

Ron Kohavi, Diane Tang, and Ya Xu. 2020. Trustworthy online controlled experiments: A practical guide to A/B testing. Cambridge University Press.

[12]

Alexey Poyarkov, Alexey Drutsa, Andrey Khalyavin, Gleb Gusev, and Pavel Serdyukov. 2016. Boosted Decision Tree Regression Adjustment for Variance Reduction in Online Controlled Experiments. In Proc. of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD ’16). ACM, 235–244. https://doi.org/10.1145/2939672.2939688

Digital Library

[13]

Lee Richardson, Alessandro Zito, Dylan Greaves, and Jacopo Soriano. 2023. Pareto optimal proxy metrics. arxiv:2307.01000 [stat.ME]

[14]

Donald B Rubin. 1974. Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of educational Psychology 66, 5 (1974), 688.

[15]

Nilesh Tripuraneni, Lee Richardson, Alexander D’Amour, Jacopo Soriano, and Steve Yadlowsky. 2023. Choosing a Proxy Metric from Past Experiments. arxiv:2309.07893 [stat.ME]

[16]

Julián Urbano, Harlley Lima, and Alan Hanjalic. 2019. Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors. In Proc. of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’19). ACM, 505–514. https://doi.org/10.1145/3331184.3331259

Digital Library

Index Terms

Powerful A/B-Testing Metrics and Where to Find Them

Index terms have been assigned to the content through auto-classification.

Recommendations

Marketing metrics: evaluating advertising & promotions
Measuring Metrics
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

You get what you measure, and you can't manage what you don't measure. Metrics are a powerful tool used in organizations to set goals, decide which new products and features should be released to customers, which new tests and experiments should be ...
In-process metrics for software testing

In-process tracking and measurements play a critical role in software development, particularly for software testing. Although there are many discussions and publications on this subject and numerous proposed metrics, few in-process metrics are ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

RecSys '24: Proceedings of the 18th ACM Conference on Recommender Systems

October 2024

1438 pages

ISBN:9798400705052

DOI:10.1145/3640457

Copyright © 2024 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 October 2024

Check for updates

Qualifiers

Extended-abstract
Research
Refereed limited

Conference

RecSys '24

Sponsor:

RecSys '24: 18th ACM Conference on Recommender Systems

October 14 - 18, 2024

Bari, Italy

Acceptance Rates

Overall Acceptance Rate 254 of 1,295 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
307
Total Downloads

Downloads (Last 12 months)307
Downloads (Last 6 weeks)18

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten