Skip to main content

Reward-Based Learning, Model-Based and Model-Free

Encyclopedia of Computational Neuroscience

Definition

Reinforcement learning (RL) techniques are a set of solutions for optimal long-term action choice such that actions take into account both immediate and delayed consequences. They fall into two broad classes. Model-based approaches assume an explicit model of the environment and the agent. The model describes the consequences of actions and the associated returns. From this, optimal policies can be inferred. Psychologically, model-based descriptions apply to goal-directed decisions, in which choices reflect current preferences over outcomes. Model-free approaches forgo any explicit knowledge of the dynamics of the environment or the consequences of actions and evaluate how good actions are through trial-and-error learning. Model-free values underlie habitual and Pavlovian conditioned responses that are emitted reflexively when faced with certain stimuli. While model-based techniques have substantial computational demands, model-free techniques require extensive experience.

De...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  • Balleine B, Dickinson A (1994) Role of cholecystokinin in the motivational control of instrumental action in rats. Behav Neurosci 108(3):590–605

    Article  CAS  PubMed  Google Scholar 

  • Barto A, Sutton R, Anderson C (1983) Neuronlike elements that can solve difficult learning control problems. IEEE Trans Syst Man Cybern 13(5):834–846

    Article  Google Scholar 

  • Bayer HM, Glimcher PW (2005) Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron 47(1):129–141

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Bayer HM, Lau B, Glimcher PW (2007) Statistics of midbrain dopamine neuron spike trains in the awake primate. JNeurophysiol 98(3):1428–1439

    Article  Google Scholar 

  • Bellman RE (1957) Dynamic programming. Princeton University Press, Princeton

    Google Scholar 

  • Bertsekas DP, Tsitsiklis JN (1996) Neuro-dynamic programming. Athena Scientific, Belmon

    Google Scholar 

  • Boutilier C, Dearden R, Goldszmidt M (1995) Exploiting structure in policy construction. In: Proceedings of the IJCAI Montreal, Quebec, Canada August 20–25,1995, vol 14, pp 1104–1113

    Google Scholar 

  • Bouton ME (2006) Learning and behavior: a contemporary synthesis. Sinauer, Sunderland

    Google Scholar 

  • Campbell M, Hoane A et al (2002) Deep blue. Artif Intell 134(1–2):57–83

    Article  Google Scholar 

  • Cardinal RN, Parkinson JA, Lachenal G, Halkerston KM, Rudarakanchana N, Hall J, Morrison CH, Howes SR, Robbins TW, Everitt BJ (2002) Effects of selective excitotoxic lesions of the nucleus accumbens core, anterior cingulate cortex, and central nucleus of the amygdala on autoshaping performance in rats. Behav Neurosci 116(4):553–567

    Article  PubMed  Google Scholar 

  • Corbit LH, Balleine BW (2005a) Double dissociation of basolateral and central amygdala lesions on the general and outcome-specific forms of Pavlovian-instrumental transfer. J Neurosci 25(4):962–970

    Article  CAS  PubMed  Google Scholar 

  • Balleine BW, Corbit LH (2005b) Double dissociation of nucleus accumbens core and shell on the general and ouctome-specific forms of Pavlovian-instrumental transfer. Program No. 71.16. 2005 Neuroscience Meeting Planner. Washington, DC: Society for Neuroscience, 2005. Online

    Google Scholar 

  • D’Ardenne K, McClure SM, Nystrom LE, Cohen JD (2008) Bold responses reflecting dopaminergic signals in the human ventral tegmental area. Science 319(5867):1264–1267

    Article  PubMed  Google Scholar 

  • Daw ND, Niv Y, Dayan P (2005) Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat Neurosci 8(12):1704–1711

    Article  CAS  PubMed  Google Scholar 

  • Daw ND, Gershman SJ, Seymour B, Dayan P, Dolan RJ (2011) Model-based influences on humans’ choices and striatal prediction errors. Neuron 69(6):1204–1215

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Day JJ, Roitman MF, Wightman RM, Carelli RM (2007) Associative learning mediates dynamic shifts in dopamine signaling in the nucleus accumbens. Nat Neurosci 10(8):1020–1028

    Article  CAS  PubMed  Google Scholar 

  • Dayan P, Berridge KC (2013) Pavlovian values. Cogn Affect Behav Neurosci. 2014 Mar 20. [Epub ahead of print] doi: 10.3758/s13415-014-0277-8

    Google Scholar 

  • Dayan P, Niv Y, Seymour B, Daw ND (2006) The misbehavior of value and the discipline of the will. Neural Netw 19(8):1153–1160

    Article  PubMed  Google Scholar 

  • Dickinson A, Dearing MF (1979) Appetitive-aversive interactions and inhibitory processes. In: Dickinson A, Boakes RA (eds) Mechanisms of learning and motivation. Erlbaum, Hillsdale, pp 203–231

    Google Scholar 

  • Dickinson A, Smith J, Mirenowicz J (2000) Dissociation of Pavlovian and instrumental incentive learning under dopamine antagonists. Behav Neurosci 114(3):468–483

    Article  CAS  PubMed  Google Scholar 

  • Dietterich TG (1999) Hierarchical reinforcement learning with the maxq value function decomposition. CoRR, cs.LG/9905014

    Google Scholar 

  • Enomoto K, Matsumoto N, Nakai S, Satoh T, Sato TK, Ueda Y, Inokawa H, Haruno M, Kimura M (2011) Dopamine neurons learn to encode the long-term value of multiple future rewards. Proc Natl Acad Sci U S A 108(37):15462–15467

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Flagel SB, Clark JJ, Robinson TE, Mayo L, Czuj A, Willuhn I, Akers CA, Clinton SM, Phillips PEM, Akil H (2011) A selective role for dopamine in stimulus-reward learning. Nature 469(7328):53–57

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Frank MJ, Seeberger LC, O’Reilly RC (2004) By carrot or by stick: cognitive reinforcement learning in Parkinsonism. Science 306(5703):1940–1943

    Article  CAS  PubMed  Google Scholar 

  • Gillan CM, Papmeyer M, Morein-Zamir S, Sahakian BJ, Fineberg NA, Robbins TW, de Wit S (2011) Disruption in the balance between goal-directed behavior and habit learning in obsessive-compulsive disorder. Am J Psychiatry 168(7):718–726

    Article  PubMed Central  PubMed  Google Scholar 

  • Gillan CM, Morein-Zamir S, Urcelay GP, Sule A, Voon V, Apergis-Schoute AM, Fineberg NA, Sahakian BJ, Robbins TW (2014) Enhanced avoidance habits in obsessive-compulsive disorder. Biol Psychiatry 75:631–638

    Google Scholar 

  • Gläscher J, Daw N, Dayan P, O’Doherty JP (2010) States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning. Neuron 66(4):585–595

    Article  PubMed Central  PubMed  Google Scholar 

  • Guitart-Masip M, Fuentemilla L, Bach DR, Huys QJM, Dayan P, Dolan RJ, Duzel E (2011) Action dominates valence in anticipatory representations in the human striatum and dopaminergic midbrain. J Neurosci 31(21):7867–7875

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Hampton AN, Bossaerts P, O’Doherty JP (2006) The role of the ventromedial pre-frontal cortex in abstract state-based inference during decision making in humans. J Neurosci 26(32):8360–8367, 6

    Article  CAS  PubMed  Google Scholar 

  • Hull C (1943) Principles of behavior. Appleton, New York

    Google Scholar 

  • Huys QJM (2007) Reinforcers and control. Towards a computational etiology of depression. PhD thesis, Gatsby Computational Neuroscience Unit, UCL, University of London

    Google Scholar 

  • Huys QJM, Tobler PT, Hasler G, Flagel S. The role of learning-related dopamine signals in addiction vulnerability. Prog Neurobiol (In Press)

    Google Scholar 

  • Huys QJM, Cools R, Gölzer M, Friedel E, Heinz A, Dolan RJ, Dayan P (2011) Disentangling the roles of approach, activation and valence in instrumental and pavlovian responding. PLoS Comput Biol 7(4):e1002028

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Huys QJM, Eshel N, O’Nions E, Sheridan L, Dayan P, Roiser JP (2012) Bonsai trees in your head: how the Pavlovian system sculpts goal-directed choices by pruning decision trees. PLoS Comput Biol 8(3):e1002410

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Johnson A, Redish AD (2007) Neural ensembles in ca3 transiently encode paths forward of the animal at a decision point. J Neurosci 27(45):12176–12189

    Article  CAS  PubMed  Google Scholar 

  • Kaelbling LP, Littman ML, Cassandra AR (1998) Planning and acting in partially observable stochastic domains. Artif intell 101(1):99–134

    Article  Google Scholar 

  • Kamin LJ (1969) Predictability, surprise, attention and conditioning. In: Campbell BA, Church RM (eds) Punishment and aversive behavior. Appleton, New York

    Google Scholar 

  • Kearns M, Singh S (2002) Near-optimal reinforcement learning in polynomial time. Mach Learn 49(2–3):209–232

    Article  Google Scholar 

  • Keramati M, Dezfouli A, Piray P (2011) Speed/accuracy trade-off between the habitual and the goal-directed processes. PLoS Comput Biol 7(5):e1002055

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Killcross S, Coutureau E (2003) Coordination of actions and habits in the medial prefrontal cortex of rats. Cereb Cortex 13(4):400–408

    Article  PubMed  Google Scholar 

  • Knuth D, Moore R (1975) An analysis of alpha-beta pruning. Artif Intell 6(4):293–326

    Article  Google Scholar 

  • Kocsis L, Szepesvàri C (2006) Bandit based Monte-Carlo planning. In: Proceedings of the Machine learning: ECML 2006, Berlin, Germany, Springer, pp 282–293

    Google Scholar 

  • Maia TV, Frank MJ (2011) From reinforcement learning models to psychiatric and neurological disorders. Nat Neurosci 14(2):154–162

    Article  CAS  PubMed  Google Scholar 

  • McClure SM, Daw ND, Montague PR (2003) A computational substrate for incentive salience. Trends Neurosci 26:423–428

    Article  CAS  PubMed  Google Scholar 

  • McDannald MA, Lucantonio F, Burke KA, Niv Y, Schoenbaum G (2011) Ventral striatum and orbitofrontal cortex are both required for model-based, but not model-free, reinforcement learning. J Neurosci 31(7):2700–2705

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Montague PR, Dayan P, Sejnowski TJ (1996) A framework for mesencephalic dopamine systems based on predictive hebbian learning. J Neurosci 16(5):1936–1947

    CAS  PubMed  Google Scholar 

  • Morris G, Nevet A, Arkadir D, Vaadia E, Bergman H (2006) Midbrain dopamine neurons encode decisions for future action. Nat Neurosci 9(8):1057–1063

    Article  CAS  PubMed  Google Scholar 

  • Nelson A, Killcross S (2006) Amphetamine exposure enhances habit formation. J Neurosci 26(14):3805–3812

    Article  CAS  PubMed  Google Scholar 

  • Pfeiffer BE, Foster DJ (2013) Hippocampal place-cell sequences depict future paths to remembered goals. Nature 497(7447):74–79

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Puterman ML (2005) Markov decision processes: discrete stochastic dynamic programming, Wiley series in probability and statistics. Wiley-Interscience, New York

    Google Scholar 

  • Redish AD, Jensen S, Johnson A (2008) A unified framework for addiction: vulnerabilities in the decision process. Behav Brain Sci 31(4):415–437; discussion 437–487

    PubMed Central  PubMed  Google Scholar 

  • Robbins TW, Gillan CM, Smith DG, de Wit S, Ersche KD (2012) Neurocognitive endophenotypes of impulsivity and compulsivity: towards dimensional psychiatry. Trends Cogn Sci 16(1):81–91

    Article  PubMed  Google Scholar 

  • Robinson MJF, Berridge KC (2013) Instant transformation of learned repulsion into motivational “wanting”. Curr Biol 23(4):282–289

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Roesch MR, Calu DJ, Schoenbaum G (2007) Dopamine neurons encode the better option in rats deciding between differently delayed or sized rewards. Nat Neurosci 10(12):1615–1624

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Schoenbaum G, Roesch MR, Stalnaker TA, Takahashi YK (2009) A new perspective on the role of the orbitofrontal cortex in adaptive behaviour. Nat Rev Neurosci 10(12):885–892

    CAS  PubMed Central  PubMed  Google Scholar 

  • Schultz W, Romo R (1990) Dopamine neurons of the monkey midbrain: contingencies of responses to stimuli eliciting immediate behavioral reactions. J Neurophysiol 63(3):607–624

    CAS  PubMed  Google Scholar 

  • Schultz W, Dayan P, Montague PR (1997) A neural substrate of prediction and reward. Science 275(5306):1593–1599

    Article  CAS  PubMed  Google Scholar 

  • Sebold M, Deserno L, Nebe S, Schad D, Garbusow M, Hägele C, Keller J, Jünger E, Kathmann N, Smolka M, Rapp MA, Schlagenhauf F, Heinz A, Huys QJM. Model-based and model-free decisions in alcohol dependence. Neuropsychobiology (In Press)

    Google Scholar 

  • Smith KS, Graybiel AM (2013) A dual operator view of habitual behavior reflecting cortical and striatal dynamics. Neuron 79(2):361–374

    Article  CAS  PubMed  Google Scholar 

  • Steinberg EE, Keiflin R, Boivin JR, Witten IB, Deisseroth K, Janak PH (2013) A causal link between prediction errors, dopamine neurons and learning. Nat Neurosci 16(7):966–973

    Article  CAS  PubMed  Google Scholar 

  • Sutton R (1990) Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In: Proceedings of the seventh international conference on machine learning, Austin, Texas, USA, vol 216, p 224

    Google Scholar 

  • Sutton RS, Barto AG (1998) Reinforcement learning: an introduction, Computation and machine learning. The MIT Press, Cambridge, MA

    Google Scholar 

  • Sutton RS, Precup D, Singh S et al (1999) Between mdps and semi-mdps: a framework for temporal abstraction in reinforcement learning. Artif Intell 112(1):181–211

    Article  Google Scholar 

  • Tobler PN, Fiorillo CD, Schultz W (2005) Adaptive coding of reward value by dopamine neurons. Science 307(5715):1642–1645

    Article  CAS  PubMed  Google Scholar 

  • Tolman EC (1948) Cognitive maps in rats and men. Psychol Rev 55(4):189–208

    Article  CAS  PubMed  Google Scholar 

  • Valentin VV, Dickinson A, O’Doherty JP (2007) Determining the neural substrates of goal-directed learning in the human brain. J Neurosci 27(15):4019–4026

    Article  CAS  PubMed  Google Scholar 

  • Waelti P, Dickinson A, Schultz W (2001) Dopamine responses comply with basic assumptions of formal learning theory. Nature 412(6842):43–48

    Article  CAS  PubMed  Google Scholar 

  • Watkins C, Dayan P (1992) Q-learning. Mach Learn 8(3):279–292

    Google Scholar 

  • Wunderlich K, Smittenaar P, Dolan RJ (2012) Dopamine enhances model-based over model-free choice behavior. Neuron 75(3):418–424

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Yin HH, Knowlton BJ, Balleine BW (2004) Lesions of dorsolateral striatum preserve outcome expectancy but disrupt habit formation in instrumental learning. Eur J Neurosci 19(1):181–189

    Article  PubMed  Google Scholar 

  • Yin HH, Ostlund SB, Knowlton BJ, Balleine BW (2005) The role of the dorsomedial striatum in instrumental conditioning. Eur J Neurosci 22(2):513–523

    Article  PubMed  Google Scholar 

  • Zaghloul KA, Blanco JA, Weidemann CT, McGill K, Jaggi JL, Baltuch GH, Kahana MJ (2009) Human substantia nigra neurons encode unexpected financial rewards. Science 323(5920):1496–1499

    Article  CAS  PubMed Central  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Quentin J. M. Huys .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media New York

About this entry

Cite this entry

Huys, Q.J.M., Cruickshank, A., Seriès, P. (2014). Reward-Based Learning, Model-Based and Model-Free. In: Jaeger, D., Jung, R. (eds) Encyclopedia of Computational Neuroscience. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-7320-6_674-1

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-7320-6_674-1

  • Received:

  • Accepted:

  • Published:

  • Publisher Name: Springer, New York, NY

  • Online ISBN: 978-1-4614-7320-6

  • eBook Packages: Springer Reference Biomedicine and Life SciencesReference Module Biomedical and Life Sciences

Publish with us

Policies and ethics

Chapter history

  1. Latest

    Reward-Based Learning, Model-Based and Model-Free
    Published:
    15 August 2019

    DOI: https://doi.org/10.1007/978-1-4614-7320-6_674-2

  2. Original

    Reward-Based Learning, Model-Based and Model-Free
    Published:
    21 February 2014

    DOI: https://doi.org/10.1007/978-1-4614-7320-6_674-1