A stochastic policy search model for matching behavior

Cheng, ZhenBo; Zhang, Yu; Deng, ZhiDong

doi:10.1007/s11432-011-4304-x

A stochastic policy search model for matching behavior

Research Papers
Published: 30 June 2011

Volume 54, pages 1430–1443, (2011)
Cite this article

Science China Information Sciences Aims and scope Submit manuscript

ZhenBo Cheng^1,2,
Yu Zhang¹ &
ZhiDong Deng¹

91 Accesses
2 Citations
Explore all metrics

Abstract

The matching law is one of the basic empirical laws in decision theory, and it states that a subject’s preference to optional targets depends on which choices are reinforced. In this paper, we study the possible mechanisms that explain why subjects’ decisions often obey this law. On the basis of reinforcement learning theory, we put forward a decision-making model in which the policy is updated by a policy parameter, and the model might be implemented in the brain through the prefrontal cortex and the basal ganglia neural circuit. Based on this model, an algorithm that satisfies the matching law is derived under some simple assumptions. Theoretical analysis and simulation results show that the decision behavior achieved by the algorithm obeys the matching law. In addition, the matching behaviors in two classical experiments are reproduced using the algorithm. Our results provide a reasonable strategy for the matching law and a useful computational tool for rewarded decision-making tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

Schultz W. Behavioral theories and the neurophysiology of reward. Annu Rev Psychol, 2006, 57: 87–115
Article Google Scholar
Behrens T E J, Woolrich M W, Walton M E, et al. Learning the value of information in an uncertain world. Nat Neurosci, 2007, 10: 1214–1221
Article Google Scholar
Sugrue L, Corrado G, Newsome W. Matching behavior and the representation of value in the parietal cortex. Science, 2004, 304: 1782–1787
Article Google Scholar
Herrnstein R J. Relative and absolute strength of response as a function of frequency of reinforcement. J Exp Anal Behav, 1961, 4: 267–272
Article Google Scholar
Gallistel C R, Mark T A, King A P, et al. The rat approximates an ideal detector of changes in rates of reward: implications for the law of effect. J Exp Psychol Anim Behav Process, 2001, 27: 354–372
Article Google Scholar
Lau B, Glimcher P. Dynamic response-by-response models of matching behavior in rhesus monkeys. J Exp Anal Behav, 2005, 84: 555–579
Article Google Scholar
Bradshaw C M, Szabadi E, Bevan P. Behavior of humans in variable-interval schedules of reinforcement. J Exp Anal Behav, 1976, 26: 135–141
Article Google Scholar
Corrado G, Sugrue L, Seung H, et al. Linear-nonlinear-poisson models of primate choice dynamics. J Exp Anal Behav, 2005, 84: 581–617
Article Google Scholar
Vaughan W. Melioration, matching, and maximization. J Exp Anal Behav, 1981, 36: 141–149
Article Google Scholar
Herrnstein R J, Prelec D. Melioration: A theory of distributed choice. J Econ Perspect, 1991, 5: 137–156
Google Scholar
Hinson J M, Staddon J E R. Matching, maximizing, and hill-climbing. J Exp Anal Behav, 1983, 40: 321–331
Article Google Scholar
Sakai Y, Fukai T. The actor-critic learning is behind the matching law: matching versus optimal behaviors. Neural Comput, 2008, 20: 227–251
Article MATH MathSciNet Google Scholar
Sutton R, Barto A. Reinforcement Learning: An Introduction. Cambridge: The MIT Press, 1998
Google Scholar
Dayan P, Niv Y. Reinforcement learning: the good, the bad and the ugly. Curr Opin Neurobiol, 2008, 18: 185–196
Article Google Scholar
Williams R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn, 1992, 8: 229–256
MATH Google Scholar
Baxter J, Bartlett P L. Infinite-horizon policy-gradient estimation. J Artif Intell Res, 2001, 15: 319–350
MATH MathSciNet Google Scholar
Luce R. Individual Choice Behavior: A Theoretical Analysis. New York: Wiley, 1959
MATH Google Scholar
Sakai Y, Fukai T. When does reward maximization lead to matching law? PLoS One, 2008, 3: e3795
Article Google Scholar
Simon H. The Foundation Stone for Modern Decision-Making (in Chinese). Beijing: Beijing Economic College Press, 1989
Google Scholar
Lau B, Glimcher P W. Value representations in the primate striatum during matching behavior. Neuron, 2008, 58: 451–463
Article Google Scholar
Soltani A, Wang X J. A biophysically based neural model of matching law behavior: melioration by stochastic synapses. J Neurosci, 2006, 26: 3731–3744
Article Google Scholar
Loewenstein Y, Seung H S. Operant matching is a generic outcome of synaptic plasticity based on the covariance between reward and neural activity. Proc Natl Acad Sci USA, 2006, 103: 15224–15229
Article Google Scholar
Miller E K, Li L, Desimone R. A neural mechanism for working and recognition memory in inferior temporal cortex. Science, 1991, 254: 1377–1379
Article Google Scholar
Miller E K, Erickson C A, Desimone R. Neural mechanisms of visual working memory in prefrontal cortex of the macaque. J Neurosci, 1996, 16: 5154–5167
Google Scholar
Graybiel A M. The basal ganglia: learning new tricks and loving it. Curr Opin Neurobiol, 2005, 15: 638–644
Article Google Scholar
Amalric M, Koob G F. Functionally selective neurochemical afferents and efferents of the mesocorticolimbic and nigrostriatal dopamine system. Prog Brain Res, 1993, 99: 209–226
Article Google Scholar
Voorn P, Vanderschuren L J M J, Groenewegen H J, et al. Putting a spin on the dorsal-ventral divide of the striatum. Trends Neurosci, 2004, 27: 468–474
Article Google Scholar
Joel D, Weiner I. The connections of the dopaminergic system with the striatum in rats and primates: an analysis with respect to the functional and compartmental organization of the striatum. Neuroscience, 2000, 96: 451–474
Article Google Scholar
Samejima K, Ueda Y, Doya K, et al. Representation of action-specific reward values in the striatum. Science, 2005, 310: 1337–1340
Article Google Scholar
Cohen M X, Frank M J. Neurocomputational models of basal ganglia function in learning, memory and choice. Behav Brain Res, 2009, 199: 141–156
Article Google Scholar
O’Doherty J, Dayan P, Schultz J, et al. Dissociable roles of ventral and dorsal striatum in instrumental conditioning. Science, 2004, 304: 452–454
Article Google Scholar
Montague P R, Hyman S E, Cohen J D. Computational roles for dopamine in behavioural control. Nature, 2004, 431: 760–767
Article Google Scholar
Schultz W, Dayan P, Montague P R. A neural substrate of prediction and reward. Science, 1997, 275: 1593–1599
Article Google Scholar
Pessiglione M, Seymour B, Flandin G, et al. Dopamine-dependent prediction errors underpin reward-seeking behaviour in humans. Nature, 2006, 442: 1042–1045
Article Google Scholar
Cohen J D, Braver T S, Brown J W. Computational perspectives on dopamine function in prefrontal cortex. Curr Opin Neurobiol, 2002, 12: 223–229
Article Google Scholar
O’Reilly R C, Noelle D C, Braver T S, et al. Prefrontal cortex and dynamic categorization tasks: representational organization and neuromodulatory control. Cereb Cortex, 2002, 12: 246–257
Article Google Scholar
O’Reilly R C, Frank M J. Making working memory work: a computational model of learning in the prefrontal cortex and basal ganglia. Neural Comput, 2006, 18: 283–328
Article MATH MathSciNet Google Scholar
Gruber A J, Dayan P, Gutkin B S, et al. Dopamine modulation in the basal ganglia locks the gate to working memory. J Comput Neurosci, 2006, 20: 153–166
Article MATH MathSciNet Google Scholar
Cheng Z B, Deng Z D, Yang B. Computational model for simple Bayesian decision (in Chinese). Sci China Ser C-Life Sci, 2009, 39: 783–779
Google Scholar
Cheng J Q, Li Y H, Sui N. Decision making and its underlying brain mechanism based on rodent research (in Chinese). Adv Psycholog Sci, 2008, 16: 721–725
Google Scholar

Download references

Author information

Authors and Affiliations

State Key Laboratory on Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science, Tsinghua University, Beijing, 100084, China
ZhenBo Cheng, Yu Zhang & ZhiDong Deng
Department of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, 310014, China
ZhenBo Cheng

Authors

ZhenBo Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Yu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
ZhiDong Deng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to ZhiDong Deng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cheng, Z., Zhang, Y. & Deng, Z. A stochastic policy search model for matching behavior. Sci. China Inf. Sci. 54, 1430–1443 (2011). https://doi.org/10.1007/s11432-011-4304-x

Download citation

Received: 20 September 2010
Accepted: 27 January 2011
Published: 30 June 2011
Issue Date: July 2011
DOI: https://doi.org/10.1007/s11432-011-4304-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

A stochastic policy search model for matching behavior

Abstract

Access this article

Similar content being viewed by others

Exploring Feature Dimensions to Learn a New Policy in an Uninformed Reinforcement Learning Task

Dopamine blockade impairs the exploration-exploitation trade-off in rats

Linear reinforcement learning in planning, grid fields, and cognitive control

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A stochastic policy search model for matching behavior

Abstract

Access this article

Similar content being viewed by others

Exploring Feature Dimensions to Learn a New Policy in an Uninformed Reinforcement Learning Task

Dopamine blockade impairs the exploration-exploitation trade-off in rats

Linear reinforcement learning in planning, grid fields, and cognitive control

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation