ABSTRACT
Controlled experimentation, also called A/B testing, is widely adopted to accelerate product innovations in the online world. However, how fast we innovate can be limited by how we run experiments. Most experiments go through a "ramp up" process where we gradually increase the traffic to the new treatment to 100%. We have seen huge inefficiency and risk in how experiments are ramped, and it is getting in the way of innovation. This can go both ways: we ramp too slowly and much time and resource is wasted; or we ramp too fast and suboptimal decisions are made. In this paper, we build up a ramping framework that can effectively balance among Speed, Quality and Risk (SQR). We start out by identifying the top common mistakes experimenters make, and then introduce the four SQR principles corresponding to the four ramp phases of an experiment. To truly scale SQR to all experiments, we develop a statistical algorithm that is embedded into the process of running every experiment to automatically recommend ramp decisions. Finally, to complete the whole picture, we briefly cover the auto-ramp engineering infrastructure that can collect inputs and execute on the recommendations timely and reliably.
- Kohavi, Ron, Deng, Alex, Frasca, Brian, Walker, Toby, Xu, Ya, and Pohlmann, Nils. Online Controlled Experiments at Large Scale. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining (Chicago 2013), 1168--1176. Google ScholarDigital Library
- Tang, Diane, Agarwal, Ashish, O'Brien, Deirdre, and Meyer, Mike. Overlapping Experiment Infrastructure: More, Better, Faster Experimentation. In Proceedings 16th Conference on Knowledge Discovery and Data Mining (Washington, DC 2010), 17--26. Google ScholarDigital Library
- Xu, Ya, Chen, Nanyu, Fernandez, Addrian, Sinno, Omar, and Bhasin, Anmol. From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Sydney 2015), 2227--2236. Google ScholarDigital Library
- Ries, Eric. The Lean Startup: How Today's Entrepreneurs Use Continuous Innovation to Create Radically Successful Businesses. Crown Business, 2011.Google Scholar
- Kohavi, Ron, Longbotham, Roger, Sommerfield, Dan, and Henne, Randal M. Controlled experiments on the web: survey and practical guide. Data Mining and Knowledge Discovery, 18, 1 (Feb 2009), 140--181. Google ScholarDigital Library
- Box, Joan F. R. A. Fisher and the Design of Experiments, 1922--1926. The American Statistician, 34, 1 (1980), 1--7.Google Scholar
- Tamhane, Ajit C. Statistical analysis of designed experiments: theory and applications. John Wiley &Sons, Inc., 2009.Google ScholarCross Ref
- Rubin, Donald B. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology (0 1974) Key: citeulike:4632390, 66, 5 (October 1974), 688--701.Google Scholar
- Bakshy, Eytan, Eckles, Dean, and Bernstein, Michael S. Designing and Deploying Online Field Experiments. (Seoul 2014), Proceedings of the 23rd international conference on World wide web, 283--292. Google ScholarDigital Library
- Wald, Abraham. Sequential Tests of Statistical Hypotheses. Annals of Mathematical Statistics, 16, 2 (June 1945), 117--186.Google ScholarCross Ref
- Johnson, N. L. Sequential analysis:A survey. Journal of the Royal Statistical Society. Series A (General), 124, 3 (1961), 372--411.Google ScholarCross Ref
- Lai, T. L. Sequential analysis. John Wiley &Sons, Ltd., 2001.Google Scholar
- Bartroff, Jay, Lai, Tze Leung, and Shih, Mei-Chiung. Sequential experimentation in clinical trials: design and analysis. Springer Science &Business Media, 2012.Google Scholar
- Chang, Yuan-chin Ivan. Application of sequential probability ratio test to computerized criterion-referenced testing. Sequential Analysis, 23, 1 (2004), 45--61.Google Scholar
- Johari, Ramesh, Pekelis, Leo, and Walsh, David J. Alwaysvalid inference: Bringing sequential analysis to A/B testing. eprint arXiv:1512.04922 (Dec. 2015).Google Scholar
- Siroker, Dan and Koomen, Pete. A / B Testing: The Most Powerful Way to Turn Clicks Into Customers. Wiley Publishing, 2013. Google ScholarDigital Library
- Lehmann, Erich L. and Romano, Joseph P. Testing Statistical Hypotheses. Springer, 2008.Google Scholar
- Deng, Alex, Xu, Ya, Kohavi, Ron, and Walker, Toby. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. (Rome), Proceedings of the sixth ACM international conference on Web search and data mining. Google ScholarDigital Library
- Kohavi, Ron, Deng, Alex, Longbotham, Roger, and Xu, Ya. Seven Rules of Thumb for Web Site Experimenters. (New York 2014), Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. Google ScholarDigital Library
- Kohavi, Ron, Deng, Alex, Frasca, Brian, Longbotham, Roger, Walker, Toby, and Xu, Ya. Trustworthy online controlled experiments: Five puzzling outcomes explained. (Beijing 2012), Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. Google ScholarDigital Library
- Novikov, Andrey. Optimal sequential multiple hypothesis tests. ArXiv e-prints (Nov. 2008).Google Scholar
- Benjamini, Yoav and Hochberg, Yosef. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological), 57, 1 (1995), 289--300.Google Scholar
- Benjamini, Yoav and Yekutieli, Daniel. The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics, 29, 4 (2001), 1165--1188.Google ScholarCross Ref
- Hohnhold, Henning, O'Brien, Deirdre, and Tang, Diane. Focus on the Long-Term: It's better for Users and Business. (Sydney 2015), Proceedings of the 21st Conference on Knowledge Discovery and Data Mining, 1849--1858.Google Scholar
Index Terms
- SQR: Balancing Speed, Quality and Risk in Online Experiments
Recommendations
How A/B Tests Could Go Wrong: Automatic Diagnosis of Invalid Online Experiments
WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data MiningWe have seen a massive growth of online experiments at Internet companies. Although conceptually simple, A/B tests can easily go wrong in the hands of inexperienced users and on an A/B testing platform with little governance. An invalid A/B test hurts ...
Trustworthy and Powerful Online Marketplace Experimentation with Budget-split Design
KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data MiningOnline experimentation, also known as A/B testing, is the gold standard for measuring product impacts and making business decisions in the tech industry. The validity and utility of experiments, however, hinge on unbiasedness and sufficient power. In ...
A/B Testing at Scale: Accelerating Software Innovation
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information RetrievalThe Internet provides developers of connected software, including web sites, applications, and devices, an unprecedented opportunity to accelerate innovation by evaluating ideas quickly and accurately using controlled experiments, also known as A/B ...
Comments