skip to main content
10.1145/3640457.3688036acmconferencesArticle/Chapter ViewAbstractPublication PagesrecsysConference Proceedingsconference-collections
extended-abstract

Powerful A/B-Testing Metrics and Where to Find Them

Published: 08 October 2024 Publication History

Abstract

Online controlled experiments, colloquially known as A/B-tests, are the bread and butter of real-world recommender system evaluation. Typically, end-users are randomly assigned some system variant, and a plethora of metrics are then tracked, collected, and aggregated throughout the experiment. A North Star metric (e.g. long-term growth or revenue) is used to assess which system variant should be deemed superior. As a result, most collected metrics are supporting in nature, and serve to either (i) provide an understanding of how the experiment impacts user experience, or (ii) allow for confident decision-making when the North Star metric moves insignificantly (i.e. a false negative or type-II error). The latter is not straightforward: suppose a treatment variant leads to fewer but longer sessions, with more views but fewer engagements; should this be considered a positive or negative outcome?
The question then becomes: how do we assess a supporting metric’s utility when it comes to decision-making using A/B-testing? Online platforms typically run dozens of experiments at any given time. This provides a wealth of information about interventions and treatment effects that can be used to evaluate metrics’ utility for online evaluation. We propose to collect this information and leverage it to quantify type-I, type-II, and type-III errors for the metrics of interest, alongside a distribution of measurements of their statistical power (e.g. z-scores and p-values). We present results and insights from building this pipeline at scale for two large-scale short-video platforms: ShareChat and Moj; leveraging hundreds of past experiments to find online metrics with high statistical power.

References

[1]
Shubham Baweja, Neeti Pokharna, Aleksei Ustimenko, and Olivier Jeunen. 2024. Variance Reduction in Ratio Metrics for Efficient Online Experiments. In Proc. of the 46th European Conference on Information Retrieval(ECIR ’24). Springer.
[2]
Roman Budylin, Alexey Drutsa, Ilya Katsev, and Valeriya Tsoy. 2018. Consistent Transformation of Ratio Metrics for Efficient Online Controlled Experiments. In Proc. of the Eleventh ACM International Conference on Web Search and Data Mining(WSDM ’18). ACM, 55–63. https://doi.org/10.1145/3159652.3159699
[3]
Alex Deng and Xiaolin Shi. 2016. Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned. In Proc. of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD ’16). ACM, 77–86. https://doi.org/10.1145/2939672.2939700
[4]
Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. 2013. Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data. In Proc. of the Sixth ACM International Conference on Web Search and Data Mining(WSDM ’13). ACM, 123–132. https://doi.org/10.1145/2433396.2433413
[5]
Pavel Dmitriev, Somit Gupta, Dong Woo Kim, and Garnet Vaz. 2017. A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Halifax, NS, Canada) (KDD ’17). ACM, 1427–1436. https://doi.org/10.1145/3097983.3098024
[6]
Ronald Aylmer Fisher. 1921. Statistical methods for research workers.Statistical methods for research workers.1st Ed (1921).
[7]
Olivier Jeunen. 2023. A Common Misassumption in Online Experiments with Machine Learning Models. SIGIR Forum 57, 1, Article 13 (dec 2023), 9 pages. https://doi.org/10.1145/3636341.3636358
[8]
Olivier Jeunen and Aleksei Ustimenko. 2024. Learning Metrics that Maximise Power for Accelerated A/B-Tests. In Proc. of the 30th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(KDD ’24). ACM. arxiv:2402.03915
[9]
Eugene Kharitonov, Alexey Drutsa, and Pavel Serdyukov. 2017. Learning Sensitive Combinations of A/B Test Metrics. In Proc. of the Tenth ACM International Conference on Web Search and Data Mining(WSDM ’17). ACM, 651–659. https://doi.org/10.1145/3018661.3018708
[10]
Ron Kohavi, Alex Deng, and Lukas Vermeer. 2022. A/B Testing Intuition Busters: Common Misunderstandings in Online Controlled Experiments. In Proc. of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(KDD ’22). ACM, 3168–3177. https://doi.org/10.1145/3534678.3539160
[11]
Ron Kohavi, Diane Tang, and Ya Xu. 2020. Trustworthy online controlled experiments: A practical guide to A/B testing. Cambridge University Press.
[12]
Alexey Poyarkov, Alexey Drutsa, Andrey Khalyavin, Gleb Gusev, and Pavel Serdyukov. 2016. Boosted Decision Tree Regression Adjustment for Variance Reduction in Online Controlled Experiments. In Proc. of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD ’16). ACM, 235–244. https://doi.org/10.1145/2939672.2939688
[13]
Lee Richardson, Alessandro Zito, Dylan Greaves, and Jacopo Soriano. 2023. Pareto optimal proxy metrics. arxiv:2307.01000 [stat.ME]
[14]
Donald B Rubin. 1974. Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of educational Psychology 66, 5 (1974), 688.
[15]
Nilesh Tripuraneni, Lee Richardson, Alexander D’Amour, Jacopo Soriano, and Steve Yadlowsky. 2023. Choosing a Proxy Metric from Past Experiments. arxiv:2309.07893 [stat.ME]
[16]
Julián Urbano, Harlley Lima, and Alan Hanjalic. 2019. Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors. In Proc. of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’19). ACM, 505–514. https://doi.org/10.1145/3331184.3331259

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
RecSys '24: Proceedings of the 18th ACM Conference on Recommender Systems
October 2024
1438 pages
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 October 2024

Check for updates

Qualifiers

  • Extended-abstract
  • Research
  • Refereed limited

Conference

Acceptance Rates

Overall Acceptance Rate 254 of 1,295 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 307
    Total Downloads
  • Downloads (Last 12 months)307
  • Downloads (Last 6 weeks)18
Reflects downloads up to 18 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media