Skip to main content

Incentivizing Exploration with Heterogeneous Value of Money

  • Conference paper
  • First Online:
Web and Internet Economics (WINE 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9470))

Included in the following conference series:

Abstract

Recently, Frazier et al. proposed a natural model for crowdsourced exploration of different a priori unknown options: a principal is interested in the long-term welfare of a population of agents who arrive one by one in a multi-armed bandit setting. However, each agent is myopic, so in order to incentivize him to explore options with better long-term prospects, the principal must offer the agent money. Frazier et al. showed that a simple class of policies called time-expanded are optimal in the worst case, and characterized their budget-reward tradeoff. The previous work assumed that all agents are equally and uniformly susceptible to financial incentives. In reality, agents may have different utility for money. We therefore extend the model of Frazier et al. to allow agents that have heterogeneous and non-linear utilities for money. The principal is informed of the agent’s tradeoff via a signal that could be more or less informative.

Our main result is to show that a convex program can be used to derive a signal-dependent time-expanded policy which achieves the best possible Lagrangian reward in the worst case. The worst-case guarantee is matched by so-called “Diamonds in the Rough” instances; the proof that the guarantees match is based on showing that two different convex programs have the same optimal solution for these specific instances.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Notes

  1. 1.

    To avoid ambiguity, we consistently refer to the principal as female and the agents as male.

  2. 2.

    Both Frazier et al. [4] and our work in fact consider a generalization in which each arm constitutes an independent Markov chain with Martingale rewards.

  3. 3.

    We use the terms “round” and “time” interchangeably.

  4. 4.

    When the signal space is uncountable, defining the posterior probability density requires the use of Radon-Nikodym derivatives, and raises computational and representational issues. In Sect. 6, we consider what is perhaps the most interesting special case: that the signal reveals the precise value of r to the principal.

  5. 5.

    In Eq. (1), if the support of r is finite, f(r) can be replaced by the probability mass function.

  6. 6.

    A natural justification for having the same discount factor is that after each round, with probability \(1-\gamma \), the game ends.

  7. 7.

    Note that all \(R^{(\gamma )}(\mathcal {A})\), \(C^{(\gamma )}(\mathcal {A})\) and \({ \mathrm{OPT} }_\gamma \) depend on the MAB instance.

  8. 8.

    As in [4], in order to facilitate the analysis, this may include myopic and non-myopic pulls of arm i. For instance, if arm 1 was pulled as non-myopic arm at times 1 and 6, and a myopic pull of arm 1 occurred at time 3, then we would use the state of arm 1 after the pulls at times 1 and 3.

  9. 9.

    This is in contrast to the case where the performance of a policy is evaluated on a class of instances rather than single instance.

  10. 10.

    Note that a priori, it is not clear that this threshold will not change in subsequent rounds; hence, we cannot yet state that a threshold policy is optimal.

References

  1. Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: Gambling in a rigged casino: the adversarial multi-armed banditproblem. In: Proceedings of the 36th IEEE Symposium on Foundations of Computer Science, pp. 322–331 (1995)

    Google Scholar 

  2. Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1), 48–77 (2003)

    Article  MathSciNet  Google Scholar 

  3. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)

    Book  Google Scholar 

  4. Frazier, P., Kempe, D., Kleinberg, J., Kleinberg, R.: Incentivizing exploration. In: Proceedings of the 16th ACM Conference on Economics and Computation, pp. 5–22 (2014)

    Google Scholar 

  5. Gittins, J.C.: Multi-Armed Bandit Allocation Indices. Wiley, New York (1989)

    Google Scholar 

  6. Gittins, J.C., Glazebrook, K.D., Weber, R.: Multi-Armed Bandit Allocation Indices, 2nd edn. Wiley, New York (2011)

    Book  Google Scholar 

  7. Gittins, J.C., Jones, D.M.: A dynamic allocation index for the sequential design of experiments. In: Gani, J. (ed.) Progress in Statistics, pp. 241–266 (1974)

    Google Scholar 

  8. Ho, C.J., Slivkins, A., Vaughan, J.W.: Adaptive contract design for crowdsourcing markets: bandit algorithms for repeated principal-agent problems. In: Proceedings of the 16th ACM Conf. on Economics and Computation, pp. 359–376 (2014)

    Google Scholar 

  9. Katehakis, M.N., Veinott Jr., A.F.: The multi-armed bandit problem: decomposition and computation. Math. Oper. Res. 12(2), 262–268 (1987)

    Google Scholar 

  10. Kremer, I., Mansour, Y., Perry, M.: Implementing the “wisdom of the crowd”. In: Proceedings of the 15th ACM Conf. on Electronic Commerce, pp. 605–606 (2013)

    Google Scholar 

  11. Lai, T.L., Robbins, H.E.: Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1), 4–22 (1985)

    Article  MathSciNet  Google Scholar 

  12. Mansour, Y., Slivkins, A., Syrgkanis, V.: Bayesian incentive-compatible bandit exploration. In: Proceedings of the 17th ACM Conference on Economics and Computation, pp. 565–582 (2015)

    Google Scholar 

  13. Robbins, H.E.: Some aspects of the sequential design of experiments. Bull. Am. Math. Soc. 58, 527–535 (1952)

    Article  MathSciNet  Google Scholar 

  14. Singla, A., Krause, A.: Truthful incentives in crowdsourcing tasks using regret minimization mechanisms. In: 22nd International World Wide Web Conference, pp. 1167–1178 (2013)

    Google Scholar 

  15. Slivkins, A., Wortman Vaughan, J.: Online decision making in crowdsourcing markets: theoretical challenges (position paper). ACM SIGecam Exch. 12(2), 4–23 (2013)

    Article  Google Scholar 

  16. Spence, M.: Job market signaling. Q. J. Econ. 87, 355–374 (1973)

    Article  Google Scholar 

  17. Whittle, P.: Multi-armed bandits and the Gittins index. J. Roy. Stat. Soc. Ser. B (Methodol.) 42(2), 143–149 (1980)

    MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruixin Qiang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Han, L., Kempe, D., Qiang, R. (2015). Incentivizing Exploration with Heterogeneous Value of Money. In: Markakis, E., Schäfer, G. (eds) Web and Internet Economics. WINE 2015. Lecture Notes in Computer Science(), vol 9470. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-48995-6_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-48995-6_27

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-48994-9

  • Online ISBN: 978-3-662-48995-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics