Multi-Armed Bandits in Recommendation Systems: A survey of the state-of-the-art and future directions

doi:10.1016/j.eswa.2022.116669

Expert Systems with Applications

Volume 197, 1 July 2022, 116669

https://doi.org/10.1016/j.eswa.2022.116669 Get rights and content

Highlights

•
A literature review of studies about MAB in recommender systems from 2000 to 2020.
•
A discussion about MAB algorithms, datasets, and evaluation metrics.
•
An updated panorama about the current practices and models applied in MAB researches.
•
Discussion on the applicability of MAB models in the main recommendation challenges.
•
Future directions to be explored by using MAB in the recommendation field.

Abstract

Recommender Systems (RSs) have assumed a crucial role in several digital companies by directly affecting their key performance indicators. Nowadays, in this era of big data, the information available about users and items has been continually updated and the application of traditional batch learning paradigms has become more restricted. In this sense, the current efforts in the recommendation field have concerned about this online environment and modeled their systems as a Multi-Armed Bandit (MAB) problem. Nevertheless, there is not a consensus about the best practices to design, perform, and evaluate the MAB implementations in the recommendation field. Thus, this work performs a systematic literature review (SLR) to shed light on this new topic. By inspecting 1327 articles published from the last twenty years (2000–2020), this work: (1) consolidates an updated picture of the main research conducted in this area so far; (2) highlights the most used concepts and methods, their core characteristics, and main limitations; and (3) evaluates the applicability of MAB-based recommendation approaches in some traditional RSs’ challenges, such as data sparsity, scalability, cold-start, and explainability. These discussions and analyzes also allow us to identify several gaps in the current literature, providing a strong guideline for future research.

Introduction

In the last three decades, the exponential growth of digital information on the Web has induced users to a stressful situation in which they do not know what to buy, listen to, or to watch. This problem is known in the literature as information overload and it has influenced several researchers to work on Recommendation Systems (RSs) to provide suggestions of items (e.g., movies, books, songs, etc.) and mitigate this problem (Shapira, Ricci, Kantor, & Rokach, 2011). Formally, RSs aim to estimate the user’s preference or even a specific rating for the available items in order to provide recommendations that increase both the user’s satisfaction and the system’s profit (Pathak, Garfinkel, Gopal, Venkatesan, & Yin, 2010). Distinct algorithms have been proposed so far based on the main recommendation strategies, such as Collaborative Filtering (CF), Demographic Filtering (DF), Content-based (CB), and Knowledge-based (KB) (Bobadilla et al., 2013, Jannach et al., 2010, Park et al., 2012).

Current efforts have proposed to handle the online recommendation task with concepts from the Reinforcement Learning (RL) field by modeling it as a Multi-Armed Bandit (MAB) problem (Wang et al., 2016, Wang, Wu et al., 2017, Wu et al., 2016, Zhao et al., 2013). Traditionally, MAB is defined as a sequential decision model that has to continually choose an action $a$ among a set of actions $A$ – a.k.a. arms. The selection of action $a \in A$ in a trial $t$ brings out in a certain reward $R_{t} (a_{t}) \in R$ , which can be summarized as a real number. The main goal is to maximize the reward returned $\sum_{t = 1}^{T} R_{t} (a_{t})$ for $T$ trials. In the recommendation domain, items available are usually modeled as the arms to be pulled. Selecting an arm is equivalent to recommending an item, and the reward is the user’s response (e.g., clicks, acceptance, satisfaction, etc.). Similar to traditional RL scenarios, to achieve its goal, the bandit model should balance the exploitation and exploration dilemma. While exploitation just means pull arms with the highest rewards in the past, maximizing the system short-term reward, exploration is achieved by recommending other arms to improve the knowledge available about users and items to maximize the system long-term reward (Sanz-Cruzado et al., 2019, Zhao et al., 2013).

The MAB problem has attracted a lot of attention from both industry and academy in the recommendation field. In the academy, it is possible to notice that more than 50% of all publications about this topic was only proposed in the last five years, as shown in Fig. 1. Similarly, in the industry, recent talks of research leaders of Netflix, Pandora, and Spotify in the main conferences, such as ACM Recommender Systems (RecSys), ACM Conference on Research and Development in Information Retrieval (SIGIR), and Web Conference (WWW), have revealed the growing interest of companies on this topic to handle the online recommendation task, especially.

However, even with this growing number of new publications available, there is no work that aims to map the main advances in the area, explain the main concepts, and clarify the best practices. Therefore, this work performs a systematic literature review (SLR) about MAB in the recommendation field to achieve three main goals:

(1)
Provide a summary overview of the most important research in this area;
(2)
Highlight the most popular concepts and methods, their core characteristics, and their main limitations to provide future directions of the field to guide the next research questions;
(3)
Discuss the applicability of MAB-based recommendation approaches in some traditional RSs’ challenges, such as the data sparsity, scalability, cold-start, privacy, and explainability.

Searching all conferences available in the Google Scholar from 2000 to 2020, our SLR identified 1327 articles based on three main strings designed according to our goals and research questions. Then, we conducted two main reading steps to filter and identify the most relevant studies. While the first step performs a short reading by analyzing titles, year, conference, and abstract, the second one performs a more complete analysis by reading the introduction, experimentation, and conclusion of each work. In the first reading, only 408 papers (30.75%) were selected for the second step. Then, in the second stage, other 178 papers were rejected and only 230 papers were selected as relevant studies about MAB in the recommendation field.¹ These works were deeply studied to achieve our three main goals. They were read to improve our knowledge but only those with an experimental setup were used to fill a data extraction form designed to catch the current practices in the literature.

In general, the application of our SLR provides distinct contributions for academia and industry. In this paper, we highlight: (1) the main conferences where MAB studies have been published; (2) the most usual scenarios simulated by the publications; (3) the datasets applied for these studies; (4) the main algorithms usually applied to address the MAB problem; and (5) the main advances by combining traditional bandit algorithms with concepts of recommendation systems. Our work also identifies several gaps in the current literature and proposes relevant future directions. For instance, we noticed the absence of a strict evaluation criteria that reflect the traditional RS goals. Relevant metrics usually related to user satisfaction or engagement have been neglected by only applying the traditional evaluation criteria of learning algorithms based on rewards (or regrets). Moreover, it is not clear how the most relevant challenges of the recommendation field can affect bandit algorithms. We discussed common problems that still are trends in the field, like sparsity, scalability, cold-start, privacy, and explainability, by pointing out the future directions for research in these topics.

The remainder of this paper is organized as follows. First, Section 2 highlights background concepts about the traditional MAB problem. Then, Section 3 presents the SLR process by showing each step performed by our inspection. Section 4 organizes the main discussion of our paper by highlighting the works developed so far, the current evaluation criteria, and how MAB has faced the current challenges of the field. These discussions allow us to point out the main future directions in Section 5. Finally, Section 6 presents our main conclusions.

Section snippets

Background concepts

The Multi-Armed Bandit (MAB) problem, sometimes called the $K$ -armed bandit problem (Zhao, Xia, Tang and Yin, 2019), is a classic problem in which a fixed limited set of resources (arms) must be selected between competing choices to maximize their expected gain (reward). The name ‘bandit’ comes from imagining a gambler at a row of slot machines in a casino, who has to improve his/her profit by maximizing the sum of rewards earned through a sequence of lever pulls. Basically, at each trial, the

The systematic literature review protocol

A systematic literature review (SLR) is a scientific methodology designed to answer some well-formulated research questions. It aims to identify and synthesize all of the scholarly research on a particular topic by applying a rigorous, unbiased, and reproducible protocol. In general, there is a standard protocol usually defined by several steps at a high level to not consider the influence of research question type on the review procedures. Here, we design a protocol inspired by Çano and

Multi-Armed BAndits in the recommendation field

Nowadays, several works have modeled the online recommendation task as a Multi-Armed Bandit problem (Felício et al., 2017, Wang, Wang et al., 2017, Wang, Zeng et al., 2018). In most of the bandit representations, the items to be recommended are modeled as the arms to be pulled. Selecting an arm $a$ is equivalent to recommending an item $i$ and the reward is the user response to this recommendation (e.g., clicks, ratings, acceptance, etc.) (Sanz-Cruzado et al., 2019). Thus, the main goal is also to

Future directions and research opportunities

As aforementioned, the applicability of MAB in the recommendation field is very recent and there still are a lot of research opportunities and improvements available for future works. In this section, we go beyond the discussions of what has been done and propose future directions to be concerned in future research. First, we highlight the main approaches that should be more studied or even improved according to the answers achieved by our first research question. Then, we open a new discussion

Conclusion

In this work we have presented a systematic literature review of Multi-Armed Bandits in the recommendation field to shed light upon their applicability and open challenges. By inspecting 1327 articles published from the last twenty years (2000–2020), we identified 230 works as the most relevant studies about MAB in the field. These articles were read in detail and analyzed to fill a specific data extraction form. This form guides this work to achieve three main goals: (1) it consolidates an

CRediT authorship contribution statement

Nícollas Silva: Conceptualization, Methodology, Papers reading, Filling the data extract form, Validation, Formal analysis, Writing – review & editing. Heitor Werneck: Papers reading, Filling the data extract form, Validation, Plot graphics, Write tables , Writing – review & editing. Thiago Silva: Papers search, Papers reading, Filling the data extract form, Validation, Writing – review & editing. Adriano C.M. Pereira: Supervision, Methodology, Validation, Formal analysis, Writing – review &

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was partially supported by CNPq, Brazil, CAPES, Brazil, FINEP, Brazil, Fapemig, Brazil, and INWEB, Brazil .

References (199)

BobadillaJ. et al.
A collaborative filtering approach to mitigate the new user cold start problem
Knowledge-Based Systems
(2012)
KunaverM. et al.
Diversity in recommender systems–A survey
Knowledge-Based Systems
(2017)
LacerdaA.
Multi-objective ranked bandits for recommender systems
Neurocomputing
(2017)
MartínM. et al.
A numerical analysis of allocation strategies for the multi-armed bandit problem under delayed rewards conditions in digital campaign management
Neurocomputing
(2019)
AdomaviciusG. et al.
Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions
IEEE Transactions on Knowledge and Data Engineering
(2005)
Aharon, M., Kagian, A., Kaplan, Y., Nissim, R., & Somekh, O. (2015). Serving ads to” Yahoo Answers” occasional...
AuerP.
Using confidence bounds for exploitation-exploration trade-offs
Journal of Machine Learning Research
(2002)
AuerP. et al.
Finite-time analysis of the multiarmed bandit problem
Machine Learning
(2002)
BagariaV. et al.
Medoids in almost-linear time via multi-armed bandits
BalakrishnanA. et al.
Using multi-armed bandits to learn ethical priorities for online AI systems
IBM Journal of Research and Development
(2019)

Barraza-UrbinaA. et al.

BEARS: Towards an evaluation framework for bandit-based interactive recommender systems

BasuS. et al.

Blocking bandits

BernardiL. et al.

Recommending accommodation filters with online learning

BobadillaJ. et al.

Recommender systems survey

Knowledge-Based Systems

(2013)

Bostandjiev, S., O’Donovan, J., & Höllerer, T. (2012). TasteWeights: a visual interactive hybrid recommender system. In...

BouneffoufD.

Freshness-aware Thompson sampling

BouneffoufD.

Contextual bandit algorithm for risk-aware recommender systems

BouneffoufD. et al.

A contextual-bandit algorithm for mobile context-aware recommender system

BouneffoufD. et al.

Hybrid- $ɛ$ -greedy for mobile context-aware recommender system

BouneffoufD. et al.

Following the user’s interests in mobile context-aware recommender systems: The hybrid-e-greedy algorithm

(2012)

BouneffoufD. et al.

Learning exploration for contextual bandit

BouneffoufD. et al.

Contextual bandit for active learning: Active Thompson sampling

BouneffoufD. et al.

Context attentive bandits: Contextual bandit with restricted context

(2017)

BreslerG. et al.

A latent source model for online collaborative filtering

Advances in Neural Information Processing Systems

(2014)

BrodénB. et al.

Ensemble recommendations via Thompson sampling: an experimental study within e-commerce

BrodénB. et al.

A bandit-based ensemble framework for exploration/exploitation of diverse recommendation components: An experimental study within e-commerce

ACM Transactions on Interactive Intelligent Systems (TiiS)

(2019)

CañamaresR. et al.

Multi-armed recommender system bandit ensembles

ÇanoE. et al.

Hybrid recommender systems: A systematic literature review

Intelligent Data Analysis

(2017)

CaoY. et al.

Nearly optimal adaptive procedure with change detection for piecewise-stationary bandit

CaronS. et al.

Mixing bandits: A recipe for improved cold-start recommendations in a social network

CastellsP. et al.

Novelty and diversity in recommender systems

Castells, P., Vargas, S., & Wang, J. (2011). Novelty and diversity metrics for recommender systems: Choice, discovery...

Celis, L. E., Kapoor, S., Salehi, F., & Vishnoi, N. (2019). Controlling polarization in personalization: An algorithmic...

Cesa-BianchiN. et al.

A gang of bandits

ChapelleO. et al.

An empirical evaluation of Thompson sampling

ChatterjiN. et al.

Osom: A simultaneously optimal algorithm for multi-armed and linear contextual bandits

ChenL. et al.

Interactive submodular bandit

ChenM. et al.

Performance evaluation of recommender systems

International Journal of Performability Engineering

(2017)

ChenL. et al.

Contextual combinatorial multi-armed bandits with volatile arms and submodular reward

ChiC.M. et al.

Online clustering of bandits with high-dimensional sparse relevant user features

ChristakopoulouK. et al.

Learning to interact with users: A collaborative-bandit approach

Christakopoulou, K., Radlinski, F., & Hofmann, K. (2016). Towards conversational recommender systems. In Proceedings of...

CrammerK. et al.

Multiclass classification with bandit feedback using adaptive regularization

Machine Learning

(2011)

CremonesiP. et al.

Looking for “good” recommendations: A comparative evaluation of recommender systems

DuchiJ.

CS229 supplemental lecture notes Hoeffding’s inequality

(2017)

DumitrascuB. et al.

PG-TS: Improved thompson sampling for logistic contextual bandits

Advances in Neural Information Processing Systems

(2018)

edwB.

Selecting multiple web adverts: A contextual multi-armed bandit with state uncertainty

Journal of the Operational Research Society

(2020)

Eide, S., & Zhou, N. (2018). Deep neural network marketplace recommenders in online experiments. In Proceedings of the...

Felício, C., Paixão, K., Barcelos, C., & Preux, P. (2017). A multi-armed bandit model selection for cold-start user...

FriedrichG. et al.

A taxonomy for generating explanations in recommender systems

AI Magazine

(2011)

Cited by (0)

View full text

Multi-Armed Bandits in Recommendation Systems: A survey of the state-of-the-art and future directions

Highlights

Abstract

Introduction

Section snippets

Background concepts

The systematic literature review protocol

Multi-Armed BAndits in the recommendation field

Future directions and research opportunities

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Knowledge-Based Systems

Knowledge-Based Systems

Neurocomputing

Neurocomputing

Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions

IEEE Transactions on Knowledge and Data Engineering

Using confidence bounds for exploitation-exploration trade-offs

Journal of Machine Learning Research

Finite-time analysis of the multiarmed bandit problem

Machine Learning

Medoids in almost-linear time via multi-armed bandits

Using multi-armed bandits to learn ethical priorities for online AI systems

IBM Journal of Research and Development

BEARS: Towards an evaluation framework for bandit-based interactive recommender systems

Blocking bandits

Recommending accommodation filters with online learning

Recommender systems survey

Knowledge-Based Systems

Freshness-aware Thompson sampling

Contextual bandit algorithm for risk-aware recommender systems

A contextual-bandit algorithm for mobile context-aware recommender system

Hybrid-ɛ-greedy for mobile context-aware recommender system

Following the user’s interests in mobile context-aware recommender systems: The hybrid-e-greedy algorithm

Learning exploration for contextual bandit

Contextual bandit for active learning: Active Thompson sampling

Context attentive bandits: Contextual bandit with restricted context

A latent source model for online collaborative filtering

Advances in Neural Information Processing Systems

Ensemble recommendations via Thompson sampling: an experimental study within e-commerce

A bandit-based ensemble framework for exploration/exploitation of diverse recommendation components: An experimental study within e-commerce

ACM Transactions on Interactive Intelligent Systems (TiiS)

Multi-armed recommender system bandit ensembles

Hybrid recommender systems: A systematic literature review

Intelligent Data Analysis

Nearly optimal adaptive procedure with change detection for piecewise-stationary bandit

Mixing bandits: A recipe for improved cold-start recommendations in a social network

Novelty and diversity in recommender systems

A gang of bandits

An empirical evaluation of Thompson sampling

Osom: A simultaneously optimal algorithm for multi-armed and linear contextual bandits

Interactive submodular bandit

Performance evaluation of recommender systems

International Journal of Performability Engineering

Contextual combinatorial multi-armed bandits with volatile arms and submodular reward

Online clustering of bandits with high-dimensional sparse relevant user features

Learning to interact with users: A collaborative-bandit approach

Multiclass classification with bandit feedback using adaptive regularization

Machine Learning

Looking for “good” recommendations: A comparative evaluation of recommender systems

CS229 supplemental lecture notes Hoeffding’s inequality

PG-TS: Improved thompson sampling for logistic contextual bandits

Advances in Neural Information Processing Systems

Selecting multiple web adverts: A contextual multi-armed bandit with state uncertainty

Journal of the Operational Research Society

A taxonomy for generating explanations in recommender systems

AI Magazine

Hybrid- $ɛ$ -greedy for mobile context-aware recommender system