Toward a Mechanistic Account for Imitation Learning: An Analysis of Pendulum Swing-Up

Torii, Takuma; Hidaka, Shohei

doi:10.1007/978-3-319-61572-1_22

Takuma Torii²⁵ &
Shohei Hidaka²⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10247))

Included in the following conference series:

JSAI International Symposium on Artificial Intelligence

1158 Accesses

Abstract

Learning an action from others require to infer their underlying goals, and recent psychological studies have reported behavioral evidences that young children do infer others’ underlying goals by observing their actions. The goal of the present study is to propose a mechanistic account for how this goal inference is possible by observing others’ actions. For this purpose, we performed a series of simulations in which two agents control pendulums toward different goals, and analyzed with which types of features it is possible to infer their different latent goals and control schemes. Our analysis showed that pointwise dimension, a type of fractal dimension, of the pendulum movements is sufficiently informative to classify the types of agents. With respect to its invariant nature, this result suggests that the fine-grained movement patterns such as the fractal dimension reflect the structure of the underlying control schemes and goals.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Automatic Imitation and the Correspondence Problem of Imitation: A Brief Historical Overview of Theoretical Positions

Automatic Imitation in Infants and Children

The Promise and Pitfalls of Studying the Neurophysiological Correlates of Automatic Imitation

References

Astrom, K.J., Furuta, K.: Swinging up a pendulum by energy control. Automatica 36(2), 287–295 (2000)
Article MathSciNet Google Scholar
Bernstein, N.A.: Dexterity and Its Development. Psychology Press, Abingdon (1996)
Google Scholar
Breazeal, C., Scassellati, B.: Robots that imitate humans. TRENDS Cogn. Sci. 6(11), 481–487 (2002)
Article Google Scholar
Cutler, C.D.: A review of the theory and estimation of fractal dimension. In: Tong, H. (ed.) Dimension Estimation and Models, pp. 1–107. World Scientific (1993)
Google Scholar
Doya, K.: Reinforcement learning in continuous time and space. Neural Comput. 12, 243–269 (1999)
Google Scholar
Grondman, I., Vaandrager, M., Busoniu, L., Babuska, R., Schuitema, E.: Efficient model learning methods for actor-critic control. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 42(3), 591–602 (2012)
Article Google Scholar
Hidaka, S., Kashyap, N.: On the estimation of pointwise dimension. arXiv:1312.2298 (2013)
Kawato, M.: Computational Theory of Brain. Sangyo Tosho, Tokyo (1996). (in Japanese)
Google Scholar
Marr, D.: Vision. MIT Press, Cambridge (1982)
Google Scholar
Ng, A., Russell, S.J.: Algorithms for inverse reinforcement learning. In: Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), pp. 663–670 (2000)
Google Scholar
Schaal, S.: Learning from demonstration. In: Mozer, M., Jordan, M., Petsche, T. (eds.) Advances in Neural Information Processing Systems, vol. 9, pp. 1040–1046. MIT Press, Cambridge (1997)
Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)
MATH Google Scholar
Warneken, F., Tomasello, M.: Altruistic helping in human infants and young chimpanzees. Science 311, 1301–1303 (2006)
Article Google Scholar
Warneken, F., Tomasello, M.: The roots of human altruism. Br. J. Psychol. 100, 455–471 (2009)
Article Google Scholar
Young, L.S.: Dimension, entropy, and Lyapunov exponents. Ergodic Theory Dyn. Syst. 2(1), 109–124 (1982)
Article MathSciNet Google Scholar

Download references

Acknowledgment

This study is supported by the JSPS KAKENHI Grant-in-Aid for Young Scientists JP 16H05860.

Author information

Authors and Affiliations

Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, Ishikawa, Japan
Takuma Torii & Shohei Hidaka

Authors

Takuma Torii
View author publications
You can also search for this author in PubMed Google Scholar
Shohei Hidaka
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Takuma Torii .

Editor information

Editors and Affiliations

Graduate School of Business Sciences, University of Tsukuba, Tokyo, Japan
Setsuya Kurahashi
Fujitsu Laboratories Ltd., Kanagawa, Japan
Yuiko Ohta
Chiba University, Chiba, Japan
Sachiyo Arai
National Institute of Informatics, Tokyo, Japan
Ken Satoh
Ochanomizu University, Tokyo, Japan
Daisuke Bekki

A Reinforcement learning

Reinforcement learning [12] is a framework rooted in behavioral psychology and control theory. In the task environment in state s, the learner takes an action a and receives a reward r from the environment in response to the action. Next, the learner faces with the environment in a new state $s' = Q(s'|s)$, where Q is a transition function. The goal of learning is to acquire a control scheme g(a|s) that maximizes the cumulative reward.

The pendulum swing-up task is a classic control problem with continuous space and time [5, 11]. There are many researches to solve this task (e.g., [6] for recent updates). The simple and basic algorithm for this task is so-called actor-critic architecture [6, 12]. It is composed of two, the actor and critic components. The actor represents the control scheme g(a|s). On the other hand, the critic represents the value function V(s), that tells the learner the discounted expected reward of state s.

Since the task is in continuous space and time, it involves several engineering problems. The typical approach is discretization of the continuous space and time. For continuous time, we used discretized time steps for Eular integration (step size $\text {d}t = 0.01$) and we sampled per 3 time steps. For continuous state space, we adopted a discretized representation (tile coding [12]) in which the continuous state space $(\theta , \dot{\theta }) \in [-\pi ,\pi ] \times [-2\pi ,2\pi ]$ is equally divided into the grid of size 40 $\times $ 40 (one of them is called here state s).

Learning proceeds with estimation of state values, and that shapes g(a|s) to navigate to more rewarding states. Suppose the pendulum is now in state s (one of the grid) at time t. The learner supplies a control input u sampled from $g(u|s) = N(\mu _s,\sigma _s)$, a normal distribution of mean $\mu _s$ and variance $\sigma _s^2$. In response, the learner observes the pendulum in a new state $s'$ and a reward r. Then the learner increments his value function V(s) for every s by

$$\begin{aligned} \varDelta V(s) = \alpha _c\,[r + \gamma _c\,V(s') - V(s)]\,E_c(s) \end{aligned}$$

(3)

where $\alpha _c = 0.1$ is a learning rate, $\gamma _c = 0.97$ is a discount rate, and $E_c$ is an eligibility trace with exponential decay (given by $\lambda _c = 0.65$) that is a device for continuous tasks that assigns higher weights for recently visited states.

For continuous control inputs, the control scheme $g(\cdot )$ is expressed by a collection of normal distributions for each state s. So the learner has to determine the mean $\mu $ and variance $\sigma ^2$ for each state s. Formally, the learner modifies his control scheme $\mu _s$ and $\sigma _s$ for every s by

$$\begin{aligned}&\varDelta \mu _s = \alpha _a\,[r + \gamma _a\,V(s') - V(s)]\,\frac{\partial N(\mu _s,\sigma _s)}{\partial \mu _s}(u)\,E_a(s) \end{aligned}$$

(4)

$$\begin{aligned}&\varDelta \sigma _s = \alpha _a\,[r + \gamma _a\,V(s') - V(s)]\,\frac{\partial N(\mu _s,\sigma _s)}{\partial \sigma _s}(u)\,E_a(s) \end{aligned}$$

(5)

where $\alpha _a = 0.001$ is a learning rate, $\gamma _a = 0.65$ is a discount rate, and $E_a$ is an eligibility trace with decay (given by $\lambda _a = 0.0$).

The reward function $r = f(s, a)$ must be designed carefully. From [5], we set the reward function $r = \cos (\theta )$ for this task. It only depends on angle $\theta $. Remark that $\cos (\theta )$ characterizes the goal of this task, because the inverted position $\theta = 0$ gives the highest reward $\cos (0) = 1$, and the hanging-down position $\theta = \pm \pi $ gives the lowest reward $\cos (\pi ) = -1$.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Torii, T., Hidaka, S. (2017). Toward a Mechanistic Account for Imitation Learning: An Analysis of Pendulum Swing-Up. In: Kurahashi, S., Ohta, Y., Arai, S., Satoh, K., Bekki, D. (eds) New Frontiers in Artificial Intelligence. JSAI-isAI 2016. Lecture Notes in Computer Science(), vol 10247. Springer, Cham. https://doi.org/10.1007/978-3-319-61572-1_22

Download citation

DOI: https://doi.org/10.1007/978-3-319-61572-1_22
Published: 08 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-61571-4
Online ISBN: 978-3-319-61572-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Toward a Mechanistic Account for Imitation Learning: An Analysis of Pendulum Swing-Up

Abstract

Access this chapter

Similar content being viewed by others

Automatic Imitation and the Correspondence Problem of Imitation: A Brief Historical Overview of Theoretical Positions

Automatic Imitation in Infants and Children

The Promise and Pitfalls of Studying the Neurophysiological Correlates of Automatic Imitation

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Reinforcement learning

A Reinforcement learning

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us