Elsevier

Neural Networks

Volume 21, Issue 10, December 2008, Pages 1447-1455
Neural Networks

2008 Special Issue
Finding intrinsic rewards by embodied evolution and constrained reinforcement learning

https://doi.org/10.1016/j.neunet.2008.09.013Get rights and content

Abstract

Understanding the design principle of reward functions is a substantial challenge both in artificial intelligence and neuroscience. Successful acquisition of a task usually requires not only rewards for goals, but also for intermediate states to promote effective exploration. This paper proposes a method for designing ‘intrinsic’ rewards of autonomous agents by combining constrained policy gradient reinforcement learning and embodied evolution. To validate the method, we use Cyber Rodent robots, in which collision avoidance, recharging from battery packs, and ‘mating’ by software reproduction are three major ‘extrinsic’ rewards. We show in hardware experiments that the robots can find appropriate ‘intrinsic’ rewards for the vision of battery packs and other robots to promote approach behaviors.

Introduction

In application of reinforcement learning algorithms to real world problems, the design of the reward function is critical for successful achievement of a task. Although it appears straightforward to assign positive rewards to desired goal states and negative rewards to states to be avoided, finding a good balance between multiple rewards often needs careful tuning (Kamioka, Uchibe, & Doya, 2007). Furthermore, if rewards are given only at isolated goal states, blind exploration of the state space takes an extremely long time except in toy problems. Rewards at intermediate sub-goals or even along the trajectories leading to goal promote focused exploration, but appropriate design of such additional rewards usually requires prior knowledge of the task or trial and error by the experimenter.

In this paper, we consider a reinforcement learning framework with two types of reward functions: the extrinsic rewards that are directly linked with the achievement of a task or the fitness of an agent and the intrinsic rewards that implicitly help success of the task or fitness of the agent. We propose a method for autonomous agents to find appropriate intrinsic rewards by combining constrained reinforcement learning (Uchibe & Doya, 2007a) and embodied evolution (Elfwing, 2007, Elfwing et al., in press).

A popular way for promoting exploratory behavior is the use of exploratory bonuses (Främling, 2007), or equivalently, to set optimistic initial value functions. While exploration bonuses just promote uniform scanning of the state space, recent studies called ‘Intrinsically Motivated Reinforcement Learning’ (IMRL) aimed at designing intrinsic rewards to guide robots to ‘interesting’ parts of the state space. The criteria for such intrinsic rewards include the prediction errors of the robot’s internal model (Barto et al., 2004, Meeden et al., 2004, Singh et al., 2005, Stout et al., 2005) and the reduction in the prediction errors (Oudeyer and Kaplan, 2004, Oudeyer et al., 2007). In this study, instead of assuming particular forms of intrinsic rewards, we let distributed autonomous robots, Cyber Rodents (Doya & Uchibe, 2005), find appropriate intrinsic reward functions through evolution. While a fixed set of extrinsic rewards specify the agents’ constraints of survival by capturing battery packs and reproduction by exchanging their ‘genes’ by infrared communication, intrinsic reward functions that facilitate goal-directed exploration are found by evolution in a colony of the robots.

We first outline the general method of reinforcement learning by intrinsic reward functions under the task constraints imposed as extrinsic reward functions (Uchibe & Doya, 2007a) and finding appropriate intrinsic reward functions by embodied evolution. We then introduce the Cyber Rodent platform we use for our experiments and describe the implementation and result of the experiments. We presented our preliminary results at ICONIP2007 (Uchibe & Doya, 2007b). Here we present the results of more systematic experiments on the robustness and the effectiveness of our proposed approach.

Section snippets

Embodied evolution of intrinsic reward function for constrained reinforcement learning

Designing appropriate reward functions is a non-trivial, time-consuming process in practical applications of reinforcement learning. Reward functions can usually be classified into two types: those directly representing the successful achievement of the task and those aimed for facilitating efficient and robust learning. In this paper, we assume that the former, ‘extrinsic rewards’, are fixed for a given task and consider how the latter, ‘intrinsic rewards’, can be optimized by the agents

Cyber rodent hardware

Before getting into detail of experiments, our hardware system is explained briefly. Fig. 4(a) shows a hardware of the Cyber Rodent (CR) (Doya & Uchibe, 2005). Its body is 22 cm in length and 1.75 kg in weight. The CR is endowed with a variety of sensory inputs, including an omni-directional CMOS camera, an IR range sensor, seven IR proximity sensors, gyros, and accelerometer. Its motion system consists of two wheels that allow the CR to move at a velocity of 1.3 m/s, but the maximum velocity

Reward functions

In order to investigate the embodied evolution for finding intrinsic rewards, we tested surviving and mating tasks in the Cyber Rodents. Fig. 5(a) shows a snapshot of actual embodied evolution in this study. There exist three CRs (named as CR1, CR2, CR3 and therefore Nrobot=3), many battery packs and four obstacles in the environment. The size of the experimental field surrounded by the wall is 6 m×5 m as shown in Fig. 5(b). The objective for a group of CRs is to find appropriate intrinsic

Obtained intrinsic rewards and fitness values

At first, we showed the result under the setting T1 in detail. Due to slow learning of the policy gradient algorithm, it took a long time to obtain avoiding behaviors as compared with our previous studies (Doya and Uchibe, 2005, Uchibe and Doya, 2004). Fig. 6(a) shows the number of obtained battery packs per 10 min. All CRs obtained foraging behaviors after about 130 min. This was sufficient for the CRs to continue to survive in the tested environment. Fig. 6(b) shows the number of successful

Conclusion

This paper proposed a method to find the appropriate intrinsic reward evolved by a group of the real mobile robot named Cyber Rodents. It is noted that maximization of the average of an intrinsic reward is meaningless from a viewpoint of experimenters because the CR without constraints just wandered in the environment. By introducing constraints in policy improvement, the intrinsic reward becomes meaningful. In order to evaluate the efficiency of the evolved intrinsic rewards, several

References (22)

  • L.J. Eshelman et al.

    Real-coded genetic algorithms and interval-schemata

  • K. Främling

    Guiding exploration by pre-existing knowledge without modifying reward

    Neural Networks

    (2007)
  • R.A. Watson et al.

    Embodied evolution: Distributing an evolutionary algorithm in a population of robots

    Robotics and Autonomous Systems

    (2002)
  • Barto, A.G., Singh, S., & Chentanez, N. (2004). Intrinsically motivated learning of hierarchical collections of skills....
  • J. Baxter et al.

    Infinite-horizon gradient-based policy search

    Journal of Artificial Intelligence Research

    (2001)
  • K. Doya et al.

    The Cyber Rodent Project: Exploration of adaptive mechanisms for self-preservation and self-reproduction

    Adaptive Behavior

    (2005)
  • Elfwing, S. (2007). Embodied evolution of learning ability. Ph.D. thesis. Stockholm, Sweden: KTH School of Computer...
  • Elfwing, S., Uchibe, E., Doya, K., & Christensen, H.I. Darwinian embodied evolution of the learning ability for...
  • T. Kamioka et al.

    Max–min actor-critic for multiple reward reinforcement learning

    IEICE TRANSACTIONS on Information and Systems J90-D

    (2007)
  • V.R. Konda et al.

    Actor-critic algorithms

    SIAM Journal on Control and Optimization

    (2003)
  • Meeden, L.A., Marshall, J.B., & Blank, D. (2004). Self-Motivated, Task-independent reinforcement learning for robots....
  • Cited by (19)

    • A multi-objective deep reinforcement learning framework

      2020, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      The DST environment has two objectives whilst the Mountain Car problem has three objectives. Each objective is characterized by a reward signal that can be either intrinsic or extrinsic (Uchibe and Doya, 2008). The intrinsic reward takes a non-zero signal most of the time, e.g. the time penalty for each time step.

    • Toward evolutionary and developmental intelligence

      2019, Current Opinion in Behavioral Sciences
      Citation Excerpt :

      On top of such autonomy, each agent explores the environment to incrementally acquire wider varieties of sensory–motor features and build dynamic models of the world including its peers and itself. This process is guided by learning with intrinsic rewards [13–15]. If such an agent is to perform a certain task which a human desires, it is guided by an additional social rewards, as we would for training animals or educating children.

    • Phasic dopamine as a prediction error of intrinsic and extrinsic reinforcements driving both action acquisition and reward maximization: A simulated robotic study

      2013, Neural Networks
      Citation Excerpt :

      Independently from which is the reason for this, our hypothesis states (and our model confirms) that this temporary nature of intrinsic reinforcements serves the critical function of letting the system learn new actions and then pass to learn other things. Recently, the topic of intrinsic motivations has been gaining increasing interest in the robotics and machine learning communities (Baldassarre & Mirolli, 2013; Barto et al., 2004; Huang & Weng, 2002; Kaplan & Oudeyer, 2003; Lee, Walker, Meeden, & Marshall, 2009; Oudeyer, Kaplan, & Hafner, 2007; Schmidhuber, 1991a, 1991b; Uchibe & Doya, 2008). The idea of using a sensory prediction error as an intrinsic reinforcement has been firstly proposed by Schmidhuber (1991a) and used in various subsequent models (e.g. Huang & Weng, 2002).

    • The explainable model to multi-objective reinforcement learning toward an autonomous smart system

      2023, Perspectives and Considerations on the Evolution of Smart Systems
    • Scalar Reward is Not Enough JAAMAS Track

      2023, Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS
    View all citing articles on Scopus
    View full text