Abstract
Understanding extended natural behavior will require a theoretical understanding of the entire system as it is engaged in perception and action involving multiple concurrent goals such as foraging for different foods while avoiding different predators and looking for a mate. A promising way to do so is reinforcement learning (RL) as it considers in a very general way the problem of choosing actions in order to maximize a measure of cumulative benefits through some form of learning, and many connections between RL and animal learning have been established. Within this framework, we consider the problem faced by a single agent comprising multiple separate elemental task learners that we call modules, which jointly learn to solve tasks that arise as different combinations of concurrent individual tasks across episodes. While sometimes the goal may be to collect different types of food, at other times avoidance of several predators may be required. The individual modules have separate state representations, i.e. they obtain different inputs but have to carry out actions jointly in the common action space of the agent. Only a single measure of success is observed, which is the sum of the reward contributions from all component tasks. We provide a computational solution for learning elemental task solutions as they contribute to composite goals and a solution for how to learn to schedule these modules for different composite tasks across episodes. The algorithm learns to choose the appropriate modules for a particular task and solves the problem of calculating each module’s contribution to the total reward. The latter calculation works by combining current reward estimates with an error signal resulting from the difference between the global reward and the sum of reward estimates of other co-active modules. As the modules interact through their action value estimates, action selection is based on their composite contribution to individual task combinations. The algorithm learns good action value functions for component tasks and task combinations which is demonstrated on small classical problems and a more complex visuomotor navigation task.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ballard, D. H., Hayhoe, M. M., Pelz, J. (1995). Memory representations in natural tasks. Journal of Cognitive Neuroscience, 7(1), 68–82.
Ballard, D. H., Hayhoe, M. M., Pook, P., Rao, R. P. N. R. (1997). Deictic codes for the embodiment of cognition. Behavioral and Brain Sciences, 20, 723–767.
Barrett, H., & Kurzban, R. (2006). Modularity in cognition: framing the debate. Psychological Review; Psychological Review, 113(3), 628.
Brooks, R. (1986). A robust layered control system for a mobile robot. IEEE Journal of Robotics and Automation, 2(1).
Chang, Y.-H., Ho, T., Kaelbling, L. P. (2004). All learning is local: multi-agent learning in global reward games. In S. Thrun, L. Saul, B. Schölkopf (Eds.), Advances in neural information processing systems 16. Cambridge: MIT.
Daw, N., & Doya, K. (2006). The computational neurobiology of learning and reward. Current opinion in Neurobiology, 16(2), 199–204.
Daw, N. D., Niv, Y., Dayan, P. (2005). Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature Neuroscience, 8(12), 1704–1711.
Dayan, P., & Hinton, G. E. (1992). Feudal reinforcement learning. In Advances in neural information processing systems 5 (pp. 271–271). Los Altos: Morgan Kaufmann Publishers, Inc.
Doya, K., Samejima, K., Katagiri, K.-I., Kawato, M. (2002). Multiple model-based reinforcement learning. Neural Computation, 14(6), 1347–1369.
Fodor, J. A. (1983). Modularity of Mind. Cambridge: MIT.
Gábor, Z., Kalmár, Z., Szepesvári, C. (1998). Multi-criteria reinforcement learning. In Proceedings of the fifteenth international conference on machine learning (pp. 197–205). Los Altos: Morgan Kaufmann Publishers Inc.
Gershman, S., Pesaran, B., Daw, N. (2009). Human reinforcement learning subdivides structured action spaces by learning effector-specific values. The Journal of Neuroscience, 29(43), 13524–13531.
Guestrin, C., Koller, D., Parr, R., Venkataraman, S. (2003). Efficient solution algorithms for factored MDPs. Journal of Artificial Intelligence Research, 19, 399–468.
Humphrys, M. (1996). Action selection methods using reinforcement learning. In P. Maes, M. Mataric, J.-A. Meyer, J. Pollack, S. W. Wilson (Eds.), From animals to animats 4: proceedings of the fourth international conference on simulation of adaptive behavior (pp. 135–144). Cambridge: MIT, Bradford Books.
Jacobs, R., Jordan, M., Nowlan, S., Hinton, G. (1991). Adaptive mixtures of local experts. Neural Computation, 3(1), 79–87.
Kable, J., & Glimcher, P. (2009). The neurobiology of decision: consensus and controversy. Neuron, 63(6), 733–745.
Kaelbling, L. P. (1993). Hierarchical learning in stochastic domains: Preliminary results. In Proceedings of the tenth international conference on machine learning (vol. 951, pp. 167–173). Los Altos: Morgan Kaufmann.
Karlsson, J. (1997). Learning to solve multiple goals. PhD thesis, University of Rochester.
Kok, J. R., & Vlassis, N. (2004). Sparse cooperative q-learning. In Proceedings of the international conference on machine learning (pp. 481–488). New York: ACM.
Land, M. F., & McLeod, P. (2000). From eye movements to actions: how batsmen hit the ball. Nature Neuroscience, 3, 1340–1345.
Mannor, S., & Shimkin, N. (2004). A geometric approach to multi-criterion reinforcement learning. The Journal of Machine Learning Research, 5, 325–360.
Meuleau, N., Hauskrecht, M., Kim, K.-E., Peshkin, L., Kaelbling, L., Dean, T., Boutilier, C. (1998). Solving very large weakly coupled markov decision processes. In AAAI/IAAI (pp. 165–172). Menlo Park: AAAI Press.
Minsky, M. (1988). The society of mind. New York: Simon and Schuster.
Morris, G., Nevet, A., Arkadir, D., Vaadia, E., Bergman, H. (2006). Midbrain dopamine neurons encode decisions for future action. Nature Neuroscience, 9(8), 1057–1063.
Natarajan, S., & Tadepalli, P. (2005). Dynamic preferences in multi-criteria reinforcement learning. In Proceedings of the 22nd international conference on machine learning (pp. 601–608). New York: ACM.
Pinker, S. (1999). How the mind works. Annals of the New York Academy of Sciences, 882(1), 119–127.
Ring, M. B. (1994). Continual learning in reinforcement environments. PhD thesis, University of Texas at Austin.
Rothkopf, C. A. (2008). Modular models of task based visually guided behavior. PhD thesis, Department of Brain and Cognitive Sciences, Department of Computer Science, University of Rochester.
Rothkopf, C. A., & Ballard, D. H. (2010). Credit assignment in multiple goal embodied visuomotor behavior. Frontiers in Psychology, 1, Special Issue on Embodied Cognition(00173).
Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department.
Russell, S., & Zimdars, A. L. (2003). Q-decomposition for reinforcement learning agents. In Proceedings of the international conference on machine learning (vol. 20, p. 656). Menlo Park: AAAI Press.
Sallans, B., & Hinton, G. E. (2004). Reinforcement learning with factored states and actions. Journal of Machine Learning Research, 5, 1063–1088.
Samejima, K., Ueda, Y., Doya, K., Kimura, M. (2005). Representation of action-specific reward values in the striatum. Science, 310(5752), 1337.
Schneider, J., Wong, W.-K., Moore, A., Riedmiller, M. (1999). Distributed value functions. In Proceedings of the 16th international conference on machine learning (pp. 371–378). San Francisco: Morgan Kaufmann.
Schultz, W., Dayan, P., Montague, P. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599.
Singh, S., & Cohn, D. (1998). How to dynamically merge markov decision processes. In Neural information processing systems 10 (pp. 1057–1063). Cambridge: The MIT Press.
Sprague, N., & Ballard, D. (2003). Multiple-goal reinforcement learning with modular sarsa(0). In International joint conference on artificial intelligence (pp. 1445–1447). Morgan Kaufmann: Acapulco.
Sprague, N., Ballard, D., Robinson, A. (2007). Modeling embodied visual behaviors. ACM Transactions on Applied Perception, 4(2), 11.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: an introduction. Cambridge: MIT.
Toutounji, H., Rothkopf, C. A., Triesch, J. (2011). Scalable reinforcement learning through hierarchical decompositions for weakly-coupled problems. In 2011 IEEE 10th international conference on development and learning (ICDL) (Vol. 2, pp. 1–7). New York: IEEE.
Ullman, S. (1984). Visual routines. Cognition, 18, 97–157.
Von Neumann, J., Morgenstern, O., Rubinstein, A., Kuhn, H. (1947). Theory of games and economic behavior. Princeton: Princeton University Press.
Watkins, C. J. (1989). Learning from delayed rewards. PhD thesis, University of Cambridge.
Yarbus, A. (1967). Eye movements and vision. New York: Plenum Press.
Acknowledgements
The research reported herein was supported by NIH Grants RR009283 and MH060624. C. R. was additionally supported by EC MEXT-project PLICON and by the EU-Project IM-CLeVeR, FP7- ICT-IP-231722.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Rothkopf, C.A., Ballard, D.H. (2013). Learning and Coordinating Repertoires of Behaviors with Common Reward: Credit Assignment and Module Activation. In: Baldassarre, G., Mirolli, M. (eds) Computational and Robotic Models of the Hierarchical Organization of Behavior. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39875-9_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-39875-9_6
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39874-2
Online ISBN: 978-3-642-39875-9
eBook Packages: Computer ScienceComputer Science (R0)