Multi-armed bandits in the wild: Pitfalls and strategies in online experiments
Introduction
Delivering faster value to customers with online experimentation is an emerging practice in industry [1], [2], [3]. Web-facing software companies (such as Microsoft, Google, Netflix, Booking.com, Yelp, and Amazon, among others) often report on success cases and the competitive advantage of using post-deployment data together with online controlled experiments as an integral part of their development methodologies [2], [4], [5], [6], [7], [8], [9], [10], [11]. This competitive advantage leads companies to start experimenting with almost every change made in their systems, from developing new functionality to the fine-tuning their systems. This intensive use is leading companies to deploy thousands of experiments every year [11], [12], [13]. A famous example of an online experiment is the ‘50 shades of blue’ experiment at Google. In this experiment, Google's engineers ran an experiment to determine the best shade of blue for a hyperlink on Google's search page. The best shade of blue resulted in an additional 200 million dollars in revenue [14], [15].
To support the diversity and the scale of experiments, software companies and academic researchers are developing innovative solutions in automating experiments, scaling the experimentation infrastructure, and in developing new algorithms and experimental designs [6], [11], [12], [16], [17], [18]. One emerging class of algorithms known as Multi-Armed Bandit (MAB) [19], [20], is being widely explored in the context of online experiments, having the potential to deliver faster results with better allocation of resources [16] compared to traditional experiments, such as A/B testing. However, the incorrect use of MAB-based experiments can lead to misinterpretations and wrong conclusions that can potentially hurt the company's business.
To the best of the authors’ knowledge, there is no work that discusses the limitations of MAB-based experiments. This work attempts to address this gap from the industry perspective using a combination of a multiple case study with simulations. This study provides analyzes some limitations faced by companies using MAB and discusses strategies used to overcome them. The results are summarized into practitioners’ guidelines with criteria to select an appropriated experimental design.
The remainder of the paper is organized as follows. Section 2 provides a background review of the MAB problem and algorithms, controlled experiments and A/B testing and the experimentation processes. Section 3 discusses the research method and threats to validity. Section 4 presents and discusses the restrictions associated with MAB implementations for online experiments. Section 5 presents a discussion of the results, use cases where MAB algorithms are desired and a guideline process to select between traditional experimentation techniques such as A/B experiments and MABs. Section 6 concludes and discusses related research challenges.
Section snippets
Background
In this section, we consider the different aspects of running online experiments. We describe a traditional online experiment in the form of an A/B test and discuss some of the limitations of this method. Next, we present the MAB class of problems and discuss some of the advantages of MAB. In the appendix, we present the MAB algorithms used in the simulations.
Research method
In earlier discussions with practitioners, we identified that, although academic research suggests that MAB algorithms provided several benefits, companies were not using MAB extensively in practice. Some of these companies suggested that these algorithms did not provide the expected benefits and that they even showed several limitations. Based on these observations, we designed this study to identify what are the restrictions and pitfalls of MAB-based experiments from the point of view of
Results
This section discusses the results obtained from the collected empirical data from the interviews and from the simulations.
Discussion
Feature experiments, powered by MABs, can provide a competitive edge for organizations, but only when skillfully applied. Several potential pitfalls can hinder the benefits of using MABs. For example, popular experimental models, such as the HYPEX [22] or the RIGHT [23], may not be well-aligned with MAB-based experiments. In particular, these models often assume that the experimental process should minimize type I errors (false positives) instead of minimizing regret (opportunity cost of
Conclusion
Delivering faster value to customers with online experiments is an emerging practice in industry. MAB algorithms have the potential to deliver even faster results with a better allocation of resources over traditional A/B experiments. This work describes common models, paradigms, and algorithms for MAB-based feature experiments currently used industry. Based on a study with 11 experts across 5 companies, we identified potential mistakes that can occur when designing a feature experiment and
Acknowledgments
This work was partially supported by the Wallenberg Artificial Intelligence, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. The authors also thank the companies and the interviewees involved in this study for the opportunity to conduct this study with them. Finally, the authors gratefully acknowledge anonymous reviewers, whose comments significantly improved this paper.
References (53)
- et al.
Building blocks for continuous experimentation
- et al.
A dirty dozen: twelve common metric interpretation pitfalls in online controlled experiments
- et al.
A stochastic bandit algorithm for scratch games
J. Mach. Learn. Res. Track
(2012) Building products as innovation experiment systems
Lect. Notes Bus. Inf. Process.
(2012)- et al.
Controlled experiments on the web: survey and practical guide
Data Min. Knowl. Discov.
(Feb. 2009) - et al.
Time to say ‘good bye’: feature lifecycle
- et al.
The benefits of controlled experimentation at scale
- et al.
Designing and deploying online field experiments
- et al.
Overlapping experiment infrastructure
- L. Li, W. Chu, J. Langford, R.E. Schapire, A Contextual-bandit approach to personalized news article recommendation,...
Network A/B testing
Practical lessons from predicting clicks on ads at Facebook
The evolution of continuous experimentation in software product development
Characterizing experimentation in continuous deployment: a case study on Bing
Online controlled experiments at large scale
Why Google has 200 m reasons to put engineers over designers
Guardian
Counterfactual reasoning and learning systems
J. Mach. Learn. Res.
Measuring metrics
Multi-armed bandit experiments in the online service economy
Appl. Stoch. Model. Bus. Ind.
Sutton & Barto Book: Reinforcement Learning: An Introduction
Design and Analysis of Experiments
The HYPEX Model: from opinions to data-driven software development
Contin. Softw. Eng.
The RIGHT model for continuous experimentation
J. Syst. Softw.
Cited by (14)
An empirical evaluation of active inference in multi-armed bandits
2021, Neural NetworksCitation Excerpt :Importantly, in bandit problems there is no need to plan ahead because available choices and rewards in the next run are not affected by current choices2. The lack of need for planning simplifies the problem substantially and puts a focus on the exploration–exploitation trade-off, making the bandit problem a standard test-bed for any algorithm that purports to address the trade-off (Mattos, Bosch, & Olsson, 2019). Bandit problems were theoretically developed largely in statistics and machine learning, usually focusing on the canonical stationary bandit problem (Auer et al., 2002; Kaufmann et al., 2012a, 2012b; Lai & Robbins, 1985; Lattimore & Szepesvári, 2020; Slivkins et al., 2019).
Big Data analytics in Agile software development: A systematic mapping study
2021, Information and Software TechnologyCitation Excerpt :Furthermore, ASD is often referred to as iterative software development in papers (e.g. [S20]). In addition, several concepts such as regression testing, automatic app review classification (e.g. [S17,S36,S54]) or continuous integration, although popular in ASD, are not directly linked to Agile practices in papers describing them. For that reason, it was not always clear if a particular work discussed only plan-driven software development approaches or rather focused on modern Agile development practices, where incremental and iterative development plays a vital role.
Impatient Bandits: Optimizing Recommendations for the Long-Term Without Delay
2023, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data MiningThe HURRIER process for experimentation in business-to-business mission-critical systems
2023, Journal of Software: Evolution and Process