Skip to main content

Toward a Mechanistic Account for Imitation Learning: An Analysis of Pendulum Swing-Up

  • Conference paper
  • First Online:
New Frontiers in Artificial Intelligence (JSAI-isAI 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10247))

Included in the following conference series:

  • 1081 Accesses

Abstract

Learning an action from others require to infer their underlying goals, and recent psychological studies have reported behavioral evidences that young children do infer others’ underlying goals by observing their actions. The goal of the present study is to propose a mechanistic account for how this goal inference is possible by observing others’ actions. For this purpose, we performed a series of simulations in which two agents control pendulums toward different goals, and analyzed with which types of features it is possible to infer their different latent goals and control schemes. Our analysis showed that pointwise dimension, a type of fractal dimension, of the pendulum movements is sufficiently informative to classify the types of agents. With respect to its invariant nature, this result suggests that the fine-grained movement patterns such as the fractal dimension reflect the structure of the underlying control schemes and goals.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  1. Astrom, K.J., Furuta, K.: Swinging up a pendulum by energy control. Automatica 36(2), 287–295 (2000)

    Article  MathSciNet  Google Scholar 

  2. Bernstein, N.A.: Dexterity and Its Development. Psychology Press, Abingdon (1996)

    Google Scholar 

  3. Breazeal, C., Scassellati, B.: Robots that imitate humans. TRENDS Cogn. Sci. 6(11), 481–487 (2002)

    Article  Google Scholar 

  4. Cutler, C.D.: A review of the theory and estimation of fractal dimension. In: Tong, H. (ed.) Dimension Estimation and Models, pp. 1–107. World Scientific (1993)

    Google Scholar 

  5. Doya, K.: Reinforcement learning in continuous time and space. Neural Comput. 12, 243–269 (1999)

    Google Scholar 

  6. Grondman, I., Vaandrager, M., Busoniu, L., Babuska, R., Schuitema, E.: Efficient model learning methods for actor-critic control. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 42(3), 591–602 (2012)

    Article  Google Scholar 

  7. Hidaka, S., Kashyap, N.: On the estimation of pointwise dimension. arXiv:1312.2298 (2013)

  8. Kawato, M.: Computational Theory of Brain. Sangyo Tosho, Tokyo (1996). (in Japanese)

    Google Scholar 

  9. Marr, D.: Vision. MIT Press, Cambridge (1982)

    Google Scholar 

  10. Ng, A., Russell, S.J.: Algorithms for inverse reinforcement learning. In: Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), pp. 663–670 (2000)

    Google Scholar 

  11. Schaal, S.: Learning from demonstration. In: Mozer, M., Jordan, M., Petsche, T. (eds.) Advances in Neural Information Processing Systems, vol. 9, pp. 1040–1046. MIT Press, Cambridge (1997)

    Google Scholar 

  12. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)

    MATH  Google Scholar 

  13. Warneken, F., Tomasello, M.: Altruistic helping in human infants and young chimpanzees. Science 311, 1301–1303 (2006)

    Article  Google Scholar 

  14. Warneken, F., Tomasello, M.: The roots of human altruism. Br. J. Psychol. 100, 455–471 (2009)

    Article  Google Scholar 

  15. Young, L.S.: Dimension, entropy, and Lyapunov exponents. Ergodic Theory Dyn. Syst. 2(1), 109–124 (1982)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgment

This study is supported by the JSPS KAKENHI Grant-in-Aid for Young Scientists JP 16H05860.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Takuma Torii .

Editor information

Editors and Affiliations

A Reinforcement learning

A Reinforcement learning

Reinforcement learning [12] is a framework rooted in behavioral psychology and control theory. In the task environment in state s, the learner takes an action a and receives a reward r from the environment in response to the action. Next, the learner faces with the environment in a new state \(s' = Q(s'|s)\), where Q is a transition function. The goal of learning is to acquire a control scheme g(a|s) that maximizes the cumulative reward.

The pendulum swing-up task is a classic control problem with continuous space and time [5, 11]. There are many researches to solve this task (e.g., [6] for recent updates). The simple and basic algorithm for this task is so-called actor-critic architecture [6, 12]. It is composed of two, the actor and critic components. The actor represents the control scheme g(a|s). On the other hand, the critic represents the value function V(s), that tells the learner the discounted expected reward of state s.

Since the task is in continuous space and time, it involves several engineering problems. The typical approach is discretization of the continuous space and time. For continuous time, we used discretized time steps for Eular integration (step size \(\text {d}t = 0.01\)) and we sampled per 3 time steps. For continuous state space, we adopted a discretized representation (tile coding [12]) in which the continuous state space \((\theta , \dot{\theta }) \in [-\pi ,\pi ] \times [-2\pi ,2\pi ]\) is equally divided into the grid of size 40 \(\times \) 40 (one of them is called here state s).

Learning proceeds with estimation of state values, and that shapes g(a|s) to navigate to more rewarding states. Suppose the pendulum is now in state s (one of the grid) at time t. The learner supplies a control input u sampled from \(g(u|s) = N(\mu _s,\sigma _s)\), a normal distribution of mean \(\mu _s\) and variance \(\sigma _s^2\). In response, the learner observes the pendulum in a new state \(s'\) and a reward r. Then the learner increments his value function V(s) for every s by

$$\begin{aligned} \varDelta V(s) = \alpha _c\,[r + \gamma _c\,V(s') - V(s)]\,E_c(s) \end{aligned}$$
(3)

where \(\alpha _c = 0.1\) is a learning rate, \(\gamma _c = 0.97\) is a discount rate, and \(E_c\) is an eligibility trace with exponential decay (given by \(\lambda _c = 0.65\)) that is a device for continuous tasks that assigns higher weights for recently visited states.

For continuous control inputs, the control scheme \(g(\cdot )\) is expressed by a collection of normal distributions for each state s. So the learner has to determine the mean \(\mu \) and variance \(\sigma ^2\) for each state s. Formally, the learner modifies his control scheme \(\mu _s\) and \(\sigma _s\) for every s by

$$\begin{aligned}&\varDelta \mu _s = \alpha _a\,[r + \gamma _a\,V(s') - V(s)]\,\frac{\partial N(\mu _s,\sigma _s)}{\partial \mu _s}(u)\,E_a(s) \end{aligned}$$
(4)
$$\begin{aligned}&\varDelta \sigma _s = \alpha _a\,[r + \gamma _a\,V(s') - V(s)]\,\frac{\partial N(\mu _s,\sigma _s)}{\partial \sigma _s}(u)\,E_a(s) \end{aligned}$$
(5)

where \(\alpha _a = 0.001\) is a learning rate, \(\gamma _a = 0.65\) is a discount rate, and \(E_a\) is an eligibility trace with decay (given by \(\lambda _a = 0.0\)).

The reward function \(r = f(s, a)\) must be designed carefully. From [5], we set the reward function \(r = \cos (\theta )\) for this task. It only depends on angle \(\theta \). Remark that \(\cos (\theta )\) characterizes the goal of this task, because the inverted position \(\theta = 0\) gives the highest reward \(\cos (0) = 1\), and the hanging-down position \(\theta = \pm \pi \) gives the lowest reward \(\cos (\pi ) = -1\).

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Torii, T., Hidaka, S. (2017). Toward a Mechanistic Account for Imitation Learning: An Analysis of Pendulum Swing-Up. In: Kurahashi, S., Ohta, Y., Arai, S., Satoh, K., Bekki, D. (eds) New Frontiers in Artificial Intelligence. JSAI-isAI 2016. Lecture Notes in Computer Science(), vol 10247. Springer, Cham. https://doi.org/10.1007/978-3-319-61572-1_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-61572-1_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-61571-4

  • Online ISBN: 978-3-319-61572-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics