V2A - Vision to Action: Learning Robotic Arm Actions Based on Vision and Language

Nazarczuk, Michal; Mikolajczyk, Krystian

doi:10.1007/978-3-030-69535-4_44

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12624))

Included in the following conference series:

Asian Conference on Computer Vision

Abstract

In this work, we present a new AI task - Vision to Action (V2A) - where an agent (robotic arm) is asked to perform a high-level task with objects (e.g. stacking) present in a scene. The agent has to suggest a plan consisting of primitive actions (e.g. simple movement, grasping) in order to successfully complete the given task. Queries are formulated in a way that forces the agent to perform visual reasoning over the presented scene before inferring the actions. We propose a novel approach based on multimodal attention for this task and demonstrate its performance on our new V2A dataset. We propose a method for building the V2A dataset by generating task instructions for each scene and designing an engine capable of assessing whether the sequence of primitives leads to a successful task completion.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Agrawal, A., et al.: VQA: visual question answering. In: International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Kim, J.H., Jun, J., Zhang, B.T.: Bilinear attention networks. In: Advances in Neural Information Processing Systems (2018)
Google Scholar
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (2019)
Google Scholar
Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Neural module networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Hu, R., Andreas, J., Rohrbach, M., Darrell, T., Saenko, K.: Learning to reason: end-to-end module networks for visual question answering. In: IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Hu, R., Andreas, J., Darrell, T., Saenko, K.: Explainable neural computation via stack neural module networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018, Part VII. LNCS, vol. 11211, pp. 55–71. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_4
Chapter Google Scholar
Yi, K., Wu, J., Gan, C., Torralba, A., Pushmeet, K., Tenenbaum, J.B.: Neural-symbolic VQA: disentangling reasoning from vision and language understanding. In: Advances in Neural Information Processing Systems (2018)
Google Scholar
Mao, J., Gan, C., Deepmind, P.K., Tenenbaum, J.B., Wu, J.: The neuro-symbolic concept learner: interpreting scenes, words, and sentences from natural supervision. In: International Conference on Learning Representations (ICLR) (2019)
Google Scholar
Yu, L., Chen, X., Gkioxari, G., Bansal, M., Berg, T.L., Batra, D.: Multi-target embodied question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Roh, J., Paxton, C., Pronobis, A., Farhadi, A., Fox, D.: Conditional driving from natural language instructions. In: Conference on Robot Learning (2019)
Google Scholar
Wang, X., et al.: Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Vasudevan, A.B., Dai, D., Van Gool, L.: Talk2Nav: long-range vision-and-language navigation with dual attention and spatial memory. Int. J. Comput. Vis. 129, 246–266 (2020). https://doi.org/10.1007/s11263-020-01374-3
Article Google Scholar
Feng, W., Zhuo, H.H., Kambhampati, S.: Extracting action sequences from texts based on deep reinforcement learning. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI) (2018)
Google Scholar
Nyga, D., et al.: Grounding robot plans from natural language instructions with incomplete world knowledge. In: 2nd Conference on Robot Learning (2018)
Google Scholar
Zhang, H., Lai, P.J., Paul, S., Kothawade, S., Nikolaidis, S.: Learning collaborative action plans from YouTube videos. In: International Symposium on Robotics Research (ISRR) (2019)
Google Scholar
Johnson, J., Fei-Fei, L., Hariharan, B., Zitnick, C.L., Van Der Maaten, L., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Nazarczuk, M., Mikolajczyk, K.: SHOP-VRB: a visual reasoning benchmark for object perception. In: International Conference on Robotics and Automation (ICRA) (2020)
Google Scholar
Tassa, Y., et al.: DeepMind Control Suite. CoRR (2018)
Google Scholar
Shridhar, M., et al.: ALFRED a benchmark for interpreting grounded instructions for everyday tasks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Mandlekar, A., et al.: Scaling robot supervision to hundreds of hours with RoboTurk: robotic manipulation dataset through human reasoning and dexterity. In: International Conference on Intelligent Robots and Systems, (IROS) (2019)
Google Scholar
Fan, L., et al.: SURREAL: open-source reinforcement learning framework and robot manipulation benchmark. In: Conference on Robot Learning (2018)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: International Conference on Computer Vision (ICCV) (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems (2014)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)
Article Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations (ICLR) (2015)
Google Scholar
Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229–256 (1992)
MATH Google Scholar
Kingma, D.P., Lei Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (ICLR) (2015)
Google Scholar

Download references

Acknowledgements

This research was supported by UK EPSRC IPALM project EP/S032398/1.

Author information

Authors and Affiliations

Imperial College London, London, UK
Michal Nazarczuk & Krystian Mikolajczyk

Authors

Michal Nazarczuk
View author publications
You can also search for this author in PubMed Google Scholar
Krystian Mikolajczyk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michal Nazarczuk .

Editor information

Editors and Affiliations

Waseda University, Tokyo, Japan
Hiroshi Ishikawa
Institute of Automation of Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
Czech Technical University in Prague, Prague, Czech Republic
Tomas Pajdla
University of Pennsylvania, Philadelphia, PA, USA
Jianbo Shi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nazarczuk, M., Mikolajczyk, K. (2021). V2A - Vision to Action: Learning Robotic Arm Actions Based on Vision and Language. In: Ishikawa, H., Liu, CL., Pajdla, T., Shi, J. (eds) Computer Vision – ACCV 2020. ACCV 2020. Lecture Notes in Computer Science(), vol 12624. Springer, Cham. https://doi.org/10.1007/978-3-030-69535-4_44

Download citation

DOI: https://doi.org/10.1007/978-3-030-69535-4_44
Published: 25 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-69534-7
Online ISBN: 978-3-030-69535-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

V2A - Vision to Action: Learning Robotic Arm Actions Based on Vision and Language