Abstract
Recently, Frazier et al. proposed a natural model for crowdsourced exploration of different a priori unknown options: a principal is interested in the long-term welfare of a population of agents who arrive one by one in a multi-armed bandit setting. However, each agent is myopic, so in order to incentivize him to explore options with better long-term prospects, the principal must offer the agent money. Frazier et al. showed that a simple class of policies called time-expanded are optimal in the worst case, and characterized their budget-reward tradeoff. The previous work assumed that all agents are equally and uniformly susceptible to financial incentives. In reality, agents may have different utility for money. We therefore extend the model of Frazier et al. to allow agents that have heterogeneous and non-linear utilities for money. The principal is informed of the agent’s tradeoff via a signal that could be more or less informative.
Our main result is to show that a convex program can be used to derive a signal-dependent time-expanded policy which achieves the best possible Lagrangian reward in the worst case. The worst-case guarantee is matched by so-called “Diamonds in the Rough” instances; the proof that the guarantees match is based on showing that two different convex programs have the same optimal solution for these specific instances.
Notes
- 1.
To avoid ambiguity, we consistently refer to the principal as female and the agents as male.
- 2.
Both Frazier et al. [4] and our work in fact consider a generalization in which each arm constitutes an independent Markov chain with Martingale rewards.
- 3.
We use the terms “round” and “time” interchangeably.
- 4.
When the signal space is uncountable, defining the posterior probability density requires the use of Radon-Nikodym derivatives, and raises computational and representational issues. In Sect. 6, we consider what is perhaps the most interesting special case: that the signal reveals the precise value of r to the principal.
- 5.
In Eq. (1), if the support of r is finite, f(r) can be replaced by the probability mass function.
- 6.
A natural justification for having the same discount factor is that after each round, with probability \(1-\gamma \), the game ends.
- 7.
Note that all \(R^{(\gamma )}(\mathcal {A})\), \(C^{(\gamma )}(\mathcal {A})\) and \({ \mathrm{OPT} }_\gamma \) depend on the MAB instance.
- 8.
As in [4], in order to facilitate the analysis, this may include myopic and non-myopic pulls of arm i. For instance, if arm 1 was pulled as non-myopic arm at times 1 and 6, and a myopic pull of arm 1 occurred at time 3, then we would use the state of arm 1 after the pulls at times 1 and 3.
- 9.
This is in contrast to the case where the performance of a policy is evaluated on a class of instances rather than single instance.
- 10.
Note that a priori, it is not clear that this threshold will not change in subsequent rounds; hence, we cannot yet state that a threshold policy is optimal.
References
Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: Gambling in a rigged casino: the adversarial multi-armed banditproblem. In: Proceedings of the 36th IEEE Symposium on Foundations of Computer Science, pp. 322–331 (1995)
Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1), 48–77 (2003)
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Frazier, P., Kempe, D., Kleinberg, J., Kleinberg, R.: Incentivizing exploration. In: Proceedings of the 16th ACM Conference on Economics and Computation, pp. 5–22 (2014)
Gittins, J.C.: Multi-Armed Bandit Allocation Indices. Wiley, New York (1989)
Gittins, J.C., Glazebrook, K.D., Weber, R.: Multi-Armed Bandit Allocation Indices, 2nd edn. Wiley, New York (2011)
Gittins, J.C., Jones, D.M.: A dynamic allocation index for the sequential design of experiments. In: Gani, J. (ed.) Progress in Statistics, pp. 241–266 (1974)
Ho, C.J., Slivkins, A., Vaughan, J.W.: Adaptive contract design for crowdsourcing markets: bandit algorithms for repeated principal-agent problems. In: Proceedings of the 16th ACM Conf. on Economics and Computation, pp. 359–376 (2014)
Katehakis, M.N., Veinott Jr., A.F.: The multi-armed bandit problem: decomposition and computation. Math. Oper. Res. 12(2), 262–268 (1987)
Kremer, I., Mansour, Y., Perry, M.: Implementing the “wisdom of the crowd”. In: Proceedings of the 15th ACM Conf. on Electronic Commerce, pp. 605–606 (2013)
Lai, T.L., Robbins, H.E.: Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1), 4–22 (1985)
Mansour, Y., Slivkins, A., Syrgkanis, V.: Bayesian incentive-compatible bandit exploration. In: Proceedings of the 17th ACM Conference on Economics and Computation, pp. 565–582 (2015)
Robbins, H.E.: Some aspects of the sequential design of experiments. Bull. Am. Math. Soc. 58, 527–535 (1952)
Singla, A., Krause, A.: Truthful incentives in crowdsourcing tasks using regret minimization mechanisms. In: 22nd International World Wide Web Conference, pp. 1167–1178 (2013)
Slivkins, A., Wortman Vaughan, J.: Online decision making in crowdsourcing markets: theoretical challenges (position paper). ACM SIGecam Exch. 12(2), 4–23 (2013)
Spence, M.: Job market signaling. Q. J. Econ. 87, 355–374 (1973)
Whittle, P.: Multi-armed bandits and the Gittins index. J. Roy. Stat. Soc. Ser. B (Methodol.) 42(2), 143–149 (1980)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Han, L., Kempe, D., Qiang, R. (2015). Incentivizing Exploration with Heterogeneous Value of Money. In: Markakis, E., Schäfer, G. (eds) Web and Internet Economics. WINE 2015. Lecture Notes in Computer Science(), vol 9470. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-48995-6_27
Download citation
DOI: https://doi.org/10.1007/978-3-662-48995-6_27
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-48994-9
Online ISBN: 978-3-662-48995-6
eBook Packages: Computer ScienceComputer Science (R0)