Model-Based Robot Imitation with Future Image Similarity

Wu, A.; Piergiovanni, A. J.; Ryoo, M. S.

doi:10.1007/s11263-019-01238-5

Model-Based Robot Imitation with Future Image Similarity

Published: 11 October 2019

Volume 128, pages 1360–1374, (2020)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

884 Accesses
3 Citations
Explore all metrics

A Correction to this article was published on 09 December 2019

This article has been updated

Abstract

We present a visual imitation learning framework that enables learning of robot action policies solely based on expert samples without any robot trials. Robot exploration and on-policy trials in a real-world environment could often be expensive/dangerous. We present a new approach to address this problem by learning a future scene prediction model solely from a collection of expert trajectories consisting of unlabeled example videos and actions, and by enabling action selection using future image similarity. In this approach, the robot learns to visually imagine the consequences of taking an action, and obtains the policy by evaluating how similar the predicted future image is to an expert sample. We develop an action-conditioned convolutional autoencoder, and present how we take advantage of future images for zero-online-trial imitation learning. We conduct experiments in simulated and real-life environments using a ground mobility robot with and without obstacles in reaching target objects. We explicitly compare our models to multiple baseline methods requiring only offline samples. The results confirm that our proposed methods perform superior to previous methods, including 1.5 \(\times \) and 2.5 \(\times \) higher success rate in two different tasks than behavioral cloning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Active Learning for Autonomous Navigation

Resolving Copycat Problems in Visual Imitation Learning via Residual Action Prediction

Deep imitation learning for 3D navigation tasks

Article Open access 04 December 2017

Change history

09 December 2019
The acknowledgement section was omitted in the original version of this article, which is given below.

References

Abbeel, P., & Ng, A. Y. (2004) Apprenticeship learning via inverse reinforcement learning. In International conference on machine learning (ICML).
Argall, B. D., Chernova, S., Veloso, M., & Browning, B. (2009). A survey of robot learning from demonstration. Robotics and Autonomous Systems, 31, 469–483.
Article Google Scholar
Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R. H., & Levine, S. (2017). Stochastic variational video prediction. In CoRR. http://arxiv.org/abs/1710.11252.
Baram, N., Anschel, O., Caspi, I., & Mannor, S. (2017). End-to-end differentiable adversarial imitation learning. In International conference on machine learning (ICML) (pp. 390–399).
Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L. D., Monfort, M., Muller, U., Zhang, J., et al. (2016) End to end learning for self-driving cars. arXiv:1604.07316.
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In IEEE conference on computer vision and pattern recognition (CVPR).
Chao, Y. W., Yang, J., Price, B., Cohen, S., & Deng, J. (2016). Forecasting human dynamics from static images. In: IEEE conference on computer vision and pattern recognition (CVPR).
Chiappa, S., Racanière, S., Wierstra, D., & Mohamed, S. (2017). Recurrent environment simulators. In CoRR. http://arxiv.org/abs/1704.02254.
Denton, E., & Fergus, R. (2018). Stochastic video generation with a learned prior. In CoRR. arXiv:1802.07687.
Dosovitskiy, A., Springenberg, J. T., Tatarchenko, M., & Brox, T. (2017). Learning to generate chairs, tables and cars with convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 692–705.
Google Scholar
Finn, C., & Levine, S. (2017). Deep visual foresight for planning robot motion. In IEEE international conference on robotics and automation (ICRA). IEEE (pp. 2786–2793).
Finn, C., Goodfellow, I. J., & Levine, S. (2016). Unsupervised learning for physical interaction through video prediction. In CoRR. http://arxiv.org/abs/1605.07157
Finn, C., Levine, S., & Abbeel, P. (2016). Guided cost learning: Deep inverse optimal control via policy optimization. arXiv:1603.00448.
Giusti, A., Guzzi, J., Cireşan, D. C., He, F.-L., Rodríguez, J. P., Fontana, F., et al. (2016). A machine learning approach to visual perception of forest trails for mobile robots. IEEE Robotics and Automation Letters, 1(2), 661–667.
Article Google Scholar
Ho, J., & Ermon, S. (2016). Generative adversarial imitation learning. In Advances in neural information processing systems (NIPS).
Ho, J., Gupta, J., & Ermon, S. (2016). Model-free imitation learning with policy optimization. arXiv:1605.08478.
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980.
Laskey, M., Lee, J., Hsieh, W., Liaw, R., Mahler, J., Fox, R., & Goldberg, K. (2017). Iterative noise injection for scalable imitation learning. arXiv:1703.09327.
Lee, J., & Ryoo, M. S. (2017). Learning robot activities from first-person human videos using convolutional future regression. In IEEE/RSJ international conference on intelligent robots and systems (IROS).
Levine, S., Pastor, P., Krizhevsky, A., & Quillen, D. (2016). Learning hand-eye coordination for robotic grasping with large-scale data collection. In International symposium on experimental robotics (pp. 173–184). Springer.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016). SSD: Single shot multibox detector. In European conference on computer vision (ECCV).
Liu, Y., Gupta, A., Abbeel, P., & Levine, S. (2018). Imitation from observation: learning to imitate behaviors from raw video via context translation. arXiv:1707.03374.
Liu, Z., Yeh, R. A., Tang, X., Liu, Y., & Agarwala, A. (2017). Video frame synthesis using deep voxel flow. In IEEE international conference on computer vision (ICCV).
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529.
Article Google Scholar
Ng, A. Y., & Jordan, M. I. (2000). Inverse reinforcement learning. In International conference on machine learning (ICML).
Oh, J., Guo, X., Lee, H., Lewis, R. L., & Singh, S. (2015). Action-conditional video prediction using deep networks in atari games. In CoRR. arXiv:1507.08750.
Pathak, D., Mahmoudieh, P., Luo, G., Agrawal, P., Chen, D., Shentu, Y., Shelhamer, E., Malik, Y., Efros, A. A., & Darrell, T. (2018). Zero-shot visual imitation. arXiv:1804.08606.
Peng, X. B., Abbeel, P., Levine, S., & van de Panne, M. (2018). Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. In ACM SIGGRAPH.
Piergiovanni, A. J., & Ryoo, M. S. (2018). Learning latent super-events to detect multiple activities in videos. In IEEE conference on computer vision and pattern recognition (CVPR).
Pomerleau, D. A. (1989). Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information processing systems (NIPS) (pp. 305–313).
Pomerleau, D. A. (1991). Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3(1), 88–97.
Article Google Scholar
Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434.
Ross, S., Gordon, G., & Bagnell, D. (2011). A reduction of imitation learning and structured prediction to no-regret online learning. In International conference on artificial intelligence and statistics (pp. 627–635).
Sadeghi, F., Toshev, A., Jang, E., & Levine, S. (2017). Sim2real view invariant visual servoing by recurrent control. arXiv:1712.07642.
Salvador, S., & Chan, P. (2004). Fastdtw: Toward accurate dynamic time warping in linear time and space. Intelligent Data Analysis, 11(5), 561–580.
Article Google Scholar
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. Cambridge: MIT press.
MATH Google Scholar
Tatarchenko, M., Dosovitskiy, A., & Brox, T. (2016). Multi-view 3D models from single images with a convolutional network. In European conference on computer vision (ECCV).
Torabi, F., Warnell, G., & Stone, P. (2018). Behavioral cloning from observation. arXiv:1805.01954.
Vakanski, A., Mantegh, I., Irish, A., & Janabi-Sharifi, F. (2012). Trajectory learning for robot programming by demonstration using hidden markov model and dynamic time warping. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 42(4), 1039–1052.
Article Google Scholar
Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Anticipating visual representations from unlabeled video. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 98–106).
Walker, J., Gupta, A., & Hebert, M. (2014). Patch to the future: Unsupervised visual prediction. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3302–3309).
Walker, J., Marino, K., Gupta, A., & Hebert, M. (2017). The pose knows: Video forecasting by generating pose futures. In IEEE international conference on computer vision (ICCV) (pp. 3352–3361).
Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 600–612.
Article Google Scholar
Wulfmeier, M., Ondruska, P., & Posner, I. (2015). Deep inverse reinforcement learning. arXiv:1507.04888.
Zhou, T., Tulsiani, S., Sun, W., Malik, J., & Efros, A. A. (2016). View synthesis by appearance flow. In European conference on computer vision (ECCV) (2016) (pp. 286–301).
Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-Fei, L., & Farhadi, A. (2016). Target-driven visual navigation in indoor scenes using deep reinforcement learning. arXiv:1609.05143.
Ziebart, B. D., Maas, A., Bagnell, J. A., & Dey, A. K. (2008). Maximum entropy inverse reinforcement learning. In PAAAI conference on artificial intelligence (AAAI).

Download references

Author information

Authors and Affiliations

Indiana University, Bloomington, IN, USA
A. Wu, A. J. Piergiovanni & M. S. Ryoo
Stony Brook University, Stony Brook, NY, USA
M. S. Ryoo

Authors

A. Wu
View author publications
You can also search for this author in PubMed Google Scholar
A. J. Piergiovanni
View author publications
You can also search for this author in PubMed Google Scholar
M. S. Ryoo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. Wu.

Additional information

Communicated by Anelia Angelova, Gustavo Carneiro, Niko Sünderhauf, Jürgen Leitner.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Alan Wu and A. J. Piergiovanni these authors contributed equally to the paper.

Appendix

1.1 Implementation Details

We implemented the CNN models using the PyTorch library. The encoder/decoder networks followed the architecture of DCGAN (Radford et al. 2015), using their discriminator as our encoder CNN and their generator as our decoder CNN. Specifically, the encoder has 6 convolutional layers with a \(3\times 3\) kernel and stride of 2. The network layers have 64, 128, 256, 512, 512, 128 channels. Our input images are resized to \(64\times 64\), resulting in a feature map of size \(128\times 3\times 3\). For the linear-representation model shown in Fig. 6a, we reshape this to be a vector of size \(128\cdot 3\cdot 3\) then use a fully-connected layer to reduce the dimensionality to 4096. Our action network has two layers to increase the dimensionality to 64 then 256.

In the convolutional-representation model used in Fig. 6b, we leave the representation as-is. Our actions are 3-dimensional vectors for robot pose (\(x,y,\theta \)), which are used as input to the action network. The action network has two layers that produces a 576-dimensional vector which we reshape to a spatial tensor of size \(64\times 3\times 3\). We concatenate this tensor along the channel axis of the convolutional representation, which is then used as input to the decoder. The convolutional future prediction model contains 5 convolution layers with a \(3\times 3\) kernel and a stride of 1. The layers contain 256, 512, 512, 256, 128 channels.

Our decoder contains 6 deconvolutional layers for upsampling. All have a \(3\times 3\) kernel and a stride of 2. In the deconvolutional layer, a stride of 2 doubles in output size. The layers contain 512, 512, 256, 128, 64, 3 channels. The last layer is followed by a \(\tanh \) activation function. All other layers in all networks were followed by batch normalization and used the LeakyReLU activation function with the negative slope set to 0.2. We minimize our loss function with gradient descent using the Kingma and Ba (2014) solver and learning rate set to 0.001.

The LSTMs are implemented similar to Denton and Fergus (2018). \(LSTM_{\phi }\) and \(LSTM_{\psi }\) are both single layer LSTMs with 256 cells in each layer. Each network has a linear embedding layer and a fully connected output layer. At inference, the output of \(LSTM_{\psi }\) is concatenated to \(z_I\) and \(z_a\), and fed to the decoder. The output dimensionalities of the LSTM networks are g = 128 and \(\mu _{\phi } = \mu _{\psi } = 64\).

1.2 Training Information

Our training curves for the image predictor model and the critic are shown in Fig. 15. For the image predictor of both datasets, we set the learning rate = 0.001 and batch size = 60. The \(\beta \) multiplier for the KL loss was set to 0.0001 in our experiments. The learning rate of the value function was set to 5E−6. The weights of the image predictor were held constant when training the value function.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, A., Piergiovanni, A.J. & Ryoo, M.S. Model-Based Robot Imitation with Future Image Similarity. Int J Comput Vis 128, 1360–1374 (2020). https://doi.org/10.1007/s11263-019-01238-5

Download citation

Received: 27 July 2018
Accepted: 18 September 2019
Published: 11 October 2019
Issue Date: May 2020
DOI: https://doi.org/10.1007/s11263-019-01238-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Model-Based Robot Imitation with Future Image Similarity

Abstract

Access this article

Similar content being viewed by others

Deep Active Learning for Autonomous Navigation

Resolving Copycat Problems in Visual Imitation Learning via Residual Action Prediction

Deep imitation learning for 3D navigation tasks

Change history

09 December 2019

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

1.1 Implementation Details

1.2 Training Information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation