Interactive Thompson Sampling for Multi-objective Multi-armed Bandits

Roijers, Diederik M.; Zintgraf, Luisa M.; Nowé, Ann

doi:10.1007/978-3-319-67504-6_2

Diederik M. Roijers⁶,
Luisa M. Zintgraf⁶ &
Ann Nowé⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10576))

Included in the following conference series:

International Conference on Algorithmic Decision Theory

1182 Accesses
8 Citations

Abstract

In multi-objective reinforcement learning (MORL), much attention is paid to generating optimal solution sets for unknown utility functions of users, based on the stochastic reward vectors only. In online MORL on the other hand, the agent will often be able to elicit preferences from the user, enabling it to learn about the utility function of its user directly. In this paper, we study online MORL with user interaction employing the multi-objective multi-armed bandit (MOMAB) setting — perhaps the most fundamental MORL setting. We use Bayesian learning algorithms to learn about the environment and the user simultaneously. Specifically, we propose two algorithms: Utility-MAP UCB (umap-UCB) and Interactive Thompson Sampling (ITS), and show empirically that the performance of these algorithms in terms of regret closely approximates the regret of UCB and regular Thompson sampling provided with the ground truth utility function of the user from the start, and that ITS outperforms umap-UCB.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We note that logistic regression based on maximum likelihood can lead to problems in earlier iterations of umap-UCB when there is little data available. We observed this empirically. Specifically, in earlier iterations umap-UCB with ML logistic regression instead of Bayesian logistic regression makes an estimate, \(\bar{\mathbf{w}}\), with a sheer-infinite weight on one objective, such that no comparison will be asked from the user again. This can be prevented with a reasonable choice of prior in Bayesian logistic regression.
2.
\(\mathbb {E}[\mathbf{x} \cdot \mathbf{y}]= \mathbb {E}[\mathbf{x}] \cdot \mathbb {E}[\mathbf{y}]\), iff \(\mathbf{x}\) and \(\mathbf{y}\) are independent.
3.
Please note that for obtaining 0 regret, it is not necessary that the MAP estimate \(\bar{\mathbf{w}}\) is identical to the ground truth \(\mathbf{w}^*\), as long as it leads to selecting the same arm.

References

Agrawal, S., Goyal, N.: Analysis of Thompson sampling for the multi-armed bandit problem. In: COLT, p. 39.1–39.26 (2012)
Google Scholar
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)
Article Google Scholar
Auer, P., Chiang, C.-K., Ortner, R., Drugan, M.M.: Pareto front identification from stochastic bandit feedback. In: AISTATS, pp. 939–947 (2016)
Google Scholar
Benabbou, N., Perny, P.: Combining preference elicitation and search in multiobjective state-space graphs. In: IJCAI, pp. 297–303 (2015)
Google Scholar
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)
MATH Google Scholar
Brochu, E., de Freitas, N., Ghosh, A.: Active preference learning with discrete choice data. In: NIPS, pp. 409–416 (2008)
Google Scholar
Chapelle, O., Li, L.: An empirical evaluation of Thompson sampling. In: NIPS, pp. 2249–2257 (2011)
Google Scholar
Clemen, R.T., Decisions, M.H.: An Introduction to Decision Analysis. PWS-Kent, Boston (1997)
Google Scholar
Drugan, M.M., Nowé, A.: Designing multi-objective multi-armed bandits algorithms: a study. In: IJCNN, pp. 1–8. IEEE (2013)
Google Scholar
Igarashi, A., Roijers, D.M.: Multi-criteria coalition formation games. In: Rothe, J. (ed.) ADT 2017. LNAI, vol. 10576, pp. 197–213. Springer, Cham (2017)
Chapter Google Scholar
Libin, P., Verstraeten, T., Theys, K., Roijers, D.M., Vrancx, P., Nowé, A.: Efficient evaluation of influenza mitigation strategies using preventive bandits. In: ALA, 9 p. (2017)
Google Scholar
Mannion, P., Duggan, J., Howley, E.: A theoretical and empirical analysis of reward transformations in multi-objective stochastic games. In: AAMAS, pp. 1625–1627 (2017)
Google Scholar
Roijers, D.M., Vamplew, P., Whiteson, S., Dazeley, R.: A survey of multi-objective sequential decision-making. JAIR 48, 67–113 (2013)
Article MathSciNet Google Scholar
Roijers, D.M., Whiteson, S.: Multi-objective decision making. Synth. Lect. Artif. Intell. Mach. Learn. 11(1), 1–129 (2017)
Article Google Scholar
Tesauro, G.: Connectionist learning of expert preferences by comparison training. In: NIPS, vol. 1, pp. 99–106 (1988)
Google Scholar
Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4), 285–294 (1933)
Article Google Scholar
Van Moffaert, K., Nowé, A.: Multi-objective reinforcement learning using sets of Pareto dominating policies. JMLR 15(1), 3483–3512 (2014)
MathSciNet MATH Google Scholar
Van Moffaert, K., Van Vaerenbergh, K., Vrancx, P., Nowé, A.: Multi-objective \(\chi \)-armed bandits. In: IJCNN, pp. 2331–2338 (2014)
Google Scholar
Wiering, M.A., Withagen, M., Drugan, M.M.: Model-based multi-objective reinforcement learning. In: ADPRL, pp. 1–6 (2014)
Google Scholar
Wilson, N., Razak, A., Marinescu, R.: Computing possibly optimal solutions for multi-objective constraint optimisation with tradeoffs. In: IJCAI, pp. 815–822 (2015)
Google Scholar
Wu, H., Liu, X.: Double Thompson sampling for dueling bandits. In: NIPS, pp. 649–657 (2016)
Google Scholar
Yahyaa, S.Q., Drugan, M.M., Manderick, B.: Thompson sampling in the adaptive linear scalarized multi objective multi armed bandit. In: ICAART, pp. 55–65 (2015)
Google Scholar
Zoghi, M., Whiteson, S., Munos, R., De Rijke, M.: Relative upper confidence bound for the k-armed dueling bandit problem. In: ICML, pp. 10–18 (2014)
Google Scholar

Download references

Acknowledgements

The first author is a postdoctoral fellow of the Research Foundation – Flanders (FWO). This research was in part supported by Innoviris – Brussels Institute for Research and Innovation.

Author information

Authors and Affiliations

Artificial Intelligence Laboratory, Vrije Universiteit Brussel, Brussels, Belgium
Diederik M. Roijers, Luisa M. Zintgraf & Ann Nowé

Authors

Diederik M. Roijers
View author publications
You can also search for this author in PubMed Google Scholar
Luisa M. Zintgraf
View author publications
You can also search for this author in PubMed Google Scholar
Ann Nowé
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Diederik M. Roijers .

Editor information

Editors and Affiliations

Heinrich Heine University, Düsseldorf, Germany
Jörg Rothe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Roijers, D.M., Zintgraf, L.M., Nowé, A. (2017). Interactive Thompson Sampling for Multi-objective Multi-armed Bandits. In: Rothe, J. (eds) Algorithmic Decision Theory. ADT 2017. Lecture Notes in Computer Science(), vol 10576. Springer, Cham. https://doi.org/10.1007/978-3-319-67504-6_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-67504-6_2
Published: 24 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67503-9
Online ISBN: 978-3-319-67504-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Interactive Thompson Sampling for Multi-objective Multi-armed Bandits