2008 Special IssueFinding intrinsic rewards by embodied evolution and constrained reinforcement learning
Introduction
In application of reinforcement learning algorithms to real world problems, the design of the reward function is critical for successful achievement of a task. Although it appears straightforward to assign positive rewards to desired goal states and negative rewards to states to be avoided, finding a good balance between multiple rewards often needs careful tuning (Kamioka, Uchibe, & Doya, 2007). Furthermore, if rewards are given only at isolated goal states, blind exploration of the state space takes an extremely long time except in toy problems. Rewards at intermediate sub-goals or even along the trajectories leading to goal promote focused exploration, but appropriate design of such additional rewards usually requires prior knowledge of the task or trial and error by the experimenter.
In this paper, we consider a reinforcement learning framework with two types of reward functions: the extrinsic rewards that are directly linked with the achievement of a task or the fitness of an agent and the intrinsic rewards that implicitly help success of the task or fitness of the agent. We propose a method for autonomous agents to find appropriate intrinsic rewards by combining constrained reinforcement learning (Uchibe & Doya, 2007a) and embodied evolution (Elfwing, 2007, Elfwing et al., in press).
A popular way for promoting exploratory behavior is the use of exploratory bonuses (Främling, 2007), or equivalently, to set optimistic initial value functions. While exploration bonuses just promote uniform scanning of the state space, recent studies called ‘Intrinsically Motivated Reinforcement Learning’ (IMRL) aimed at designing intrinsic rewards to guide robots to ‘interesting’ parts of the state space. The criteria for such intrinsic rewards include the prediction errors of the robot’s internal model (Barto et al., 2004, Meeden et al., 2004, Singh et al., 2005, Stout et al., 2005) and the reduction in the prediction errors (Oudeyer and Kaplan, 2004, Oudeyer et al., 2007). In this study, instead of assuming particular forms of intrinsic rewards, we let distributed autonomous robots, Cyber Rodents (Doya & Uchibe, 2005), find appropriate intrinsic reward functions through evolution. While a fixed set of extrinsic rewards specify the agents’ constraints of survival by capturing battery packs and reproduction by exchanging their ‘genes’ by infrared communication, intrinsic reward functions that facilitate goal-directed exploration are found by evolution in a colony of the robots.
We first outline the general method of reinforcement learning by intrinsic reward functions under the task constraints imposed as extrinsic reward functions (Uchibe & Doya, 2007a) and finding appropriate intrinsic reward functions by embodied evolution. We then introduce the Cyber Rodent platform we use for our experiments and describe the implementation and result of the experiments. We presented our preliminary results at ICONIP2007 (Uchibe & Doya, 2007b). Here we present the results of more systematic experiments on the robustness and the effectiveness of our proposed approach.
Section snippets
Embodied evolution of intrinsic reward function for constrained reinforcement learning
Designing appropriate reward functions is a non-trivial, time-consuming process in practical applications of reinforcement learning. Reward functions can usually be classified into two types: those directly representing the successful achievement of the task and those aimed for facilitating efficient and robust learning. In this paper, we assume that the former, ‘extrinsic rewards’, are fixed for a given task and consider how the latter, ‘intrinsic rewards’, can be optimized by the agents
Cyber rodent hardware
Before getting into detail of experiments, our hardware system is explained briefly. Fig. 4(a) shows a hardware of the Cyber Rodent (CR) (Doya & Uchibe, 2005). Its body is 22 cm in length and 1.75 kg in weight. The CR is endowed with a variety of sensory inputs, including an omni-directional CMOS camera, an IR range sensor, seven IR proximity sensors, gyros, and accelerometer. Its motion system consists of two wheels that allow the CR to move at a velocity of 1.3 m/s, but the maximum velocity
Reward functions
In order to investigate the embodied evolution for finding intrinsic rewards, we tested surviving and mating tasks in the Cyber Rodents. Fig. 5(a) shows a snapshot of actual embodied evolution in this study. There exist three CRs (named as CR1, CR2, CR3 and therefore ), many battery packs and four obstacles in the environment. The size of the experimental field surrounded by the wall is 6 m×5 m as shown in Fig. 5(b). The objective for a group of CRs is to find appropriate intrinsic
Obtained intrinsic rewards and fitness values
At first, we showed the result under the setting T1 in detail. Due to slow learning of the policy gradient algorithm, it took a long time to obtain avoiding behaviors as compared with our previous studies (Doya and Uchibe, 2005, Uchibe and Doya, 2004). Fig. 6(a) shows the number of obtained battery packs per 10 min. All CRs obtained foraging behaviors after about 130 min. This was sufficient for the CRs to continue to survive in the tested environment. Fig. 6(b) shows the number of successful
Conclusion
This paper proposed a method to find the appropriate intrinsic reward evolved by a group of the real mobile robot named Cyber Rodents. It is noted that maximization of the average of an intrinsic reward is meaningless from a viewpoint of experimenters because the CR without constraints just wandered in the environment. By introducing constraints in policy improvement, the intrinsic reward becomes meaningful. In order to evaluate the efficiency of the evolved intrinsic rewards, several
References (22)
- et al.
Real-coded genetic algorithms and interval-schemata
Guiding exploration by pre-existing knowledge without modifying reward
Neural Networks
(2007)- et al.
Embodied evolution: Distributing an evolutionary algorithm in a population of robots
Robotics and Autonomous Systems
(2002) - Barto, A.G., Singh, S., & Chentanez, N. (2004). Intrinsically motivated learning of hierarchical collections of skills....
- et al.
Infinite-horizon gradient-based policy search
Journal of Artificial Intelligence Research
(2001) - et al.
The Cyber Rodent Project: Exploration of adaptive mechanisms for self-preservation and self-reproduction
Adaptive Behavior
(2005) - Elfwing, S. (2007). Embodied evolution of learning ability. Ph.D. thesis. Stockholm, Sweden: KTH School of Computer...
- Elfwing, S., Uchibe, E., Doya, K., & Christensen, H.I. Darwinian embodied evolution of the learning ability for...
- et al.
Max–min actor-critic for multiple reward reinforcement learning
IEICE TRANSACTIONS on Information and Systems J90-D
(2007) - et al.
Actor-critic algorithms
SIAM Journal on Control and Optimization
(2003)
Cited by (19)
A multi-objective deep reinforcement learning framework
2020, Engineering Applications of Artificial IntelligenceCitation Excerpt :The DST environment has two objectives whilst the Mountain Car problem has three objectives. Each objective is characterized by a reward signal that can be either intrinsic or extrinsic (Uchibe and Doya, 2008). The intrinsic reward takes a non-zero signal most of the time, e.g. the time penalty for each time step.
Toward evolutionary and developmental intelligence
2019, Current Opinion in Behavioral SciencesCitation Excerpt :On top of such autonomy, each agent explores the environment to incrementally acquire wider varieties of sensory–motor features and build dynamic models of the world including its peers and itself. This process is guided by learning with intrinsic rewards [13–15]. If such an agent is to perform a certain task which a human desires, it is guided by an additional social rewards, as we would for training animals or educating children.
Phasic dopamine as a prediction error of intrinsic and extrinsic reinforcements driving both action acquisition and reward maximization: A simulated robotic study
2013, Neural NetworksCitation Excerpt :Independently from which is the reason for this, our hypothesis states (and our model confirms) that this temporary nature of intrinsic reinforcements serves the critical function of letting the system learn new actions and then pass to learn other things. Recently, the topic of intrinsic motivations has been gaining increasing interest in the robotics and machine learning communities (Baldassarre & Mirolli, 2013; Barto et al., 2004; Huang & Weng, 2002; Kaplan & Oudeyer, 2003; Lee, Walker, Meeden, & Marshall, 2009; Oudeyer, Kaplan, & Hafner, 2007; Schmidhuber, 1991a, 1991b; Uchibe & Doya, 2008). The idea of using a sensory prediction error as an intrinsic reinforcement has been firstly proposed by Schmidhuber (1991a) and used in various subsequent models (e.g. Huang & Weng, 2002).
The explainable model to multi-objective reinforcement learning toward an autonomous smart system
2023, Perspectives and Considerations on the Evolution of Smart SystemsScalar Reward is Not Enough JAAMAS Track
2023, Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS