Finding intrinsic rewards by embodied evolution and constrained reinforcement learning

doi:10.1016/j.neunet.2008.09.013

Neural Networks

Volume 21, Issue 10, December 2008, Pages 1447-1455

https://doi.org/10.1016/j.neunet.2008.09.013 Get rights and content

Abstract

Understanding the design principle of reward functions is a substantial challenge both in artificial intelligence and neuroscience. Successful acquisition of a task usually requires not only rewards for goals, but also for intermediate states to promote effective exploration. This paper proposes a method for designing ‘intrinsic’ rewards of autonomous agents by combining constrained policy gradient reinforcement learning and embodied evolution. To validate the method, we use Cyber Rodent robots, in which collision avoidance, recharging from battery packs, and ‘mating’ by software reproduction are three major ‘extrinsic’ rewards. We show in hardware experiments that the robots can find appropriate ‘intrinsic’ rewards for the vision of battery packs and other robots to promote approach behaviors.

Introduction

In application of reinforcement learning algorithms to real world problems, the design of the reward function is critical for successful achievement of a task. Although it appears straightforward to assign positive rewards to desired goal states and negative rewards to states to be avoided, finding a good balance between multiple rewards often needs careful tuning (Kamioka, Uchibe, & Doya, 2007). Furthermore, if rewards are given only at isolated goal states, blind exploration of the state space takes an extremely long time except in toy problems. Rewards at intermediate sub-goals or even along the trajectories leading to goal promote focused exploration, but appropriate design of such additional rewards usually requires prior knowledge of the task or trial and error by the experimenter.

In this paper, we consider a reinforcement learning framework with two types of reward functions: the extrinsic rewards that are directly linked with the achievement of a task or the fitness of an agent and the intrinsic rewards that implicitly help success of the task or fitness of the agent. We propose a method for autonomous agents to find appropriate intrinsic rewards by combining constrained reinforcement learning (Uchibe & Doya, 2007a) and embodied evolution (Elfwing, 2007, Elfwing et al., in press).

A popular way for promoting exploratory behavior is the use of exploratory bonuses (Främling, 2007), or equivalently, to set optimistic initial value functions. While exploration bonuses just promote uniform scanning of the state space, recent studies called ‘Intrinsically Motivated Reinforcement Learning’ (IMRL) aimed at designing intrinsic rewards to guide robots to ‘interesting’ parts of the state space. The criteria for such intrinsic rewards include the prediction errors of the robot’s internal model (Barto et al., 2004, Meeden et al., 2004, Singh et al., 2005, Stout et al., 2005) and the reduction in the prediction errors (Oudeyer and Kaplan, 2004, Oudeyer et al., 2007). In this study, instead of assuming particular forms of intrinsic rewards, we let distributed autonomous robots, Cyber Rodents (Doya & Uchibe, 2005), find appropriate intrinsic reward functions through evolution. While a fixed set of extrinsic rewards specify the agents’ constraints of survival by capturing battery packs and reproduction by exchanging their ‘genes’ by infrared communication, intrinsic reward functions that facilitate goal-directed exploration are found by evolution in a colony of the robots.

We first outline the general method of reinforcement learning by intrinsic reward functions under the task constraints imposed as extrinsic reward functions (Uchibe & Doya, 2007a) and finding appropriate intrinsic reward functions by embodied evolution. We then introduce the Cyber Rodent platform we use for our experiments and describe the implementation and result of the experiments. We presented our preliminary results at ICONIP2007 (Uchibe & Doya, 2007b). Here we present the results of more systematic experiments on the robustness and the effectiveness of our proposed approach.

Section snippets

Embodied evolution of intrinsic reward function for constrained reinforcement learning

Designing appropriate reward functions is a non-trivial, time-consuming process in practical applications of reinforcement learning. Reward functions can usually be classified into two types: those directly representing the successful achievement of the task and those aimed for facilitating efficient and robust learning. In this paper, we assume that the former, ‘extrinsic rewards’, are fixed for a given task and consider how the latter, ‘intrinsic rewards’, can be optimized by the agents

Cyber rodent hardware

Before getting into detail of experiments, our hardware system is explained briefly. Fig. 4(a) shows a hardware of the Cyber Rodent (CR) (Doya & Uchibe, 2005). Its body is 22 cm in length and 1.75 kg in weight. The CR is endowed with a variety of sensory inputs, including an omni-directional CMOS camera, an IR range sensor, seven IR proximity sensors, gyros, and accelerometer. Its motion system consists of two wheels that allow the CR to move at a velocity of 1.3 m/s, but the maximum velocity

Reward functions

In order to investigate the embodied evolution for finding intrinsic rewards, we tested surviving and mating tasks in the Cyber Rodents. Fig. 5(a) shows a snapshot of actual embodied evolution in this study. There exist three CRs (named as CR1, CR2, CR3 and therefore $N_{robot} = 3$ ), many battery packs and four obstacles in the environment. The size of the experimental field surrounded by the wall is 6 m×5 m as shown in Fig. 5(b). The objective for a group of CRs is to find appropriate intrinsic

Obtained intrinsic rewards and fitness values

At first, we showed the result under the setting T1 in detail. Due to slow learning of the policy gradient algorithm, it took a long time to obtain avoiding behaviors as compared with our previous studies (Doya and Uchibe, 2005, Uchibe and Doya, 2004). Fig. 6(a) shows the number of obtained battery packs per 10 min. All CRs obtained foraging behaviors after about 130 min. This was sufficient for the CRs to continue to survive in the tested environment. Fig. 6(b) shows the number of successful

Conclusion

This paper proposed a method to find the appropriate intrinsic reward evolved by a group of the real mobile robot named Cyber Rodents. It is noted that maximization of the average of an intrinsic reward is meaningless from a viewpoint of experimenters because the CR without constraints just wandered in the environment. By introducing constraints in policy improvement, the intrinsic reward becomes meaningful. In order to evaluate the efficiency of the evolved intrinsic rewards, several

References (22)

L.J. Eshelman et al.
Real-coded genetic algorithms and interval-schemata
K. Främling
Guiding exploration by pre-existing knowledge without modifying reward
Neural Networks
(2007)
R.A. Watson et al.
Embodied evolution: Distributing an evolutionary algorithm in a population of robots
Robotics and Autonomous Systems
(2002)
Barto, A.G., Singh, S., & Chentanez, N. (2004). Intrinsically motivated learning of hierarchical collections of skills....
J. Baxter et al.
Infinite-horizon gradient-based policy search
Journal of Artificial Intelligence Research
(2001)
K. Doya et al.
The Cyber Rodent Project: Exploration of adaptive mechanisms for self-preservation and self-reproduction
Adaptive Behavior
(2005)
Elfwing, S. (2007). Embodied evolution of learning ability. Ph.D. thesis. Stockholm, Sweden: KTH School of Computer...
Elfwing, S., Uchibe, E., Doya, K., & Christensen, H.I. Darwinian embodied evolution of the learning ability for...
T. Kamioka et al.
Max–min actor-critic for multiple reward reinforcement learning
IEICE TRANSACTIONS on Information and Systems J90-D
(2007)
V.R. Konda et al.
Actor-critic algorithms
SIAM Journal on Control and Optimization
(2003)

Meeden, L.A., Marshall, J.B., & Blank, D. (2004). Self-Motivated, Task-independent reinforcement learning for robots....

Cited by (19)

A multi-objective deep reinforcement learning framework
2020, Engineering Applications of Artificial Intelligence
Citation Excerpt :
The DST environment has two objectives whilst the Mountain Car problem has three objectives. Each objective is characterized by a reward signal that can be either intrinsic or extrinsic (Uchibe and Doya, 2008). The intrinsic reward takes a non-zero signal most of the time, e.g. the time penalty for each time step.
This paper introduces a new scalable multi-objective deep reinforcement learning (MODRL) framework based on deep Q-networks. We develop a high-performance MODRL framework that supports both single-policy and multi-policy strategies, as well as both linear and non-linear approaches to action selection. The experimental results on two benchmark problems (two-objective deep sea treasure environment and three-objective Mountain Car problem) indicate that the proposed framework is able to find the Pareto-optimal solutions effectively. The proposed framework is generic and highly modularized, which allows the integration of different deep reinforcement learning algorithms in different complex problem domains. This therefore overcomes many disadvantages involved with standard multi-objective reinforcement learning methods in the current literature. The proposed framework acts as a testbed platform that accelerates the development of MODRL for solving increasingly complicated multi-objective problems.
Toward evolutionary and developmental intelligence
2019, Current Opinion in Behavioral Sciences
Citation Excerpt :
On top of such autonomy, each agent explores the environment to incrementally acquire wider varieties of sensory–motor features and build dynamic models of the world including its peers and itself. This process is guided by learning with intrinsic rewards [13–15]. If such an agent is to perform a certain task which a human desires, it is guided by an additional social rewards, as we would for training animals or educating children.
Given the phenomenal advances in artificial intelligence in specific domains like visual object recognition and game playing by deep learning, expectations are rising for building artificial general intelligence (AGI) that can flexibly find solutions in unknown task domains. One approach to AGI is to set up a variety of tasks and design AI agents that perform well in many of them, including those the agent faces for the first time. One caveat for such an approach is that the best performing agent may be just a collection of domain-specific AI agents switched for a given domain. Here we propose an alternative approach of focusing on the process of acquisition of intelligence through active interactions in an environment. We call this approach evolutionary and developmental intelligence (EDI). We first review the current status of artificial intelligence, brain-inspired computing and developmental robotics and define the conceptual framework of EDI. We then explore how we can integrate advances in neuroscience, machine learning, and robotics to construct EDI systems and how building such systems can help us understand animal and human intelligence.
Phasic dopamine as a prediction error of intrinsic and extrinsic reinforcements driving both action acquisition and reward maximization: A simulated robotic study
2013, Neural Networks
Citation Excerpt :
Independently from which is the reason for this, our hypothesis states (and our model confirms) that this temporary nature of intrinsic reinforcements serves the critical function of letting the system learn new actions and then pass to learn other things. Recently, the topic of intrinsic motivations has been gaining increasing interest in the robotics and machine learning communities (Baldassarre & Mirolli, 2013; Barto et al., 2004; Huang & Weng, 2002; Kaplan & Oudeyer, 2003; Lee, Walker, Meeden, & Marshall, 2009; Oudeyer, Kaplan, & Hafner, 2007; Schmidhuber, 1991a, 1991b; Uchibe & Doya, 2008). The idea of using a sensory prediction error as an intrinsic reinforcement has been firstly proposed by Schmidhuber (1991a) and used in various subsequent models (e.g. Huang & Weng, 2002).
An important issue of recent neuroscientific research is to understand the functional role of the phasic release of dopamine in the striatum, and in particular its relation to reinforcement learning. The literature is split between two alternative hypotheses: one considers phasic dopamine as a reward prediction error similar to the computational TD-error, whose function is to guide an animal to maximize future rewards; the other holds that phasic dopamine is a sensory prediction error signal that lets the animal discover and acquire novel actions. In this paper we propose an original hypothesis that integrates these two contrasting positions: according to our view phasic dopamine represents a TD-like reinforcement prediction error learning signal determined by both unexpected changes in the environment (temporary, intrinsic reinforcements) and biological rewards (permanent, extrinsic reinforcements). Accordingly, dopamine plays the functional role of driving both the discovery and acquisition of novel actions and the maximization of future rewards. To validate our hypothesis we perform a series of experiments with a simulated robotic system that has to learn different skills in order to get rewards. We compare different versions of the system in which we vary the composition of the learning signal. The results show that only the system reinforced by both extrinsic and intrinsic reinforcements is able to reach high performance in sufficiently complex conditions.
The explainable model to multi-objective reinforcement learning toward an autonomous smart system
2023, Perspectives and Considerations on the Evolution of Smart Systems
Evolutionary Reinforcement Learning: A Survey
2023, arXiv
Scalar Reward is Not Enough JAAMAS Track
2023, Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS

View all citing articles on Scopus

View full text

2008 Special IssueFinding intrinsic rewards by embodied evolution and constrained reinforcement learning

Abstract

Introduction

Section snippets

Embodied evolution of intrinsic reward function for constrained reinforcement learning

Cyber rodent hardware

Reward functions

Obtained intrinsic rewards and fitness values

Conclusion

Neural Networks

Robotics and Autonomous Systems

Infinite-horizon gradient-based policy search

Journal of Artificial Intelligence Research

The Cyber Rodent Project: Exploration of adaptive mechanisms for self-preservation and self-reproduction

Adaptive Behavior

Max–min actor-critic for multiple reward reinforcement learning

IEICE TRANSACTIONS on Information and Systems J90-D

Actor-critic algorithms

SIAM Journal on Control and Optimization

2008 Special Issue
Finding intrinsic rewards by embodied evolution and constrained reinforcement learning