Abstract
In this work, we present a new AI task - Vision to Action (V2A) - where an agent (robotic arm) is asked to perform a high-level task with objects (e.g. stacking) present in a scene. The agent has to suggest a plan consisting of primitive actions (e.g. simple movement, grasping) in order to successfully complete the given task. Queries are formulated in a way that forces the agent to perform visual reasoning over the presented scene before inferring the actions. We propose a novel approach based on multimodal attention for this task and demonstrate its performance on our new V2A dataset. We propose a method for building the V2A dataset by generating task instructions for each scene and designing an engine capable of assessing whether the sequence of primitives leads to a successful task completion.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agrawal, A., et al.: VQA: visual question answering. In: International Conference on Computer Vision (ICCV) (2015)
Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Kim, J.H., Jun, J., Zhang, B.T.: Bilinear attention networks. In: Advances in Neural Information Processing Systems (2018)
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (2019)
Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Neural module networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Hu, R., Andreas, J., Rohrbach, M., Darrell, T., Saenko, K.: Learning to reason: end-to-end module networks for visual question answering. In: IEEE International Conference on Computer Vision (ICCV) (2017)
Hu, R., Andreas, J., Darrell, T., Saenko, K.: Explainable neural computation via stack neural module networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018, Part VII. LNCS, vol. 11211, pp. 55–71. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_4
Yi, K., Wu, J., Gan, C., Torralba, A., Pushmeet, K., Tenenbaum, J.B.: Neural-symbolic VQA: disentangling reasoning from vision and language understanding. In: Advances in Neural Information Processing Systems (2018)
Mao, J., Gan, C., Deepmind, P.K., Tenenbaum, J.B., Wu, J.: The neuro-symbolic concept learner: interpreting scenes, words, and sentences from natural supervision. In: International Conference on Learning Representations (ICLR) (2019)
Yu, L., Chen, X., Gkioxari, G., Bansal, M., Berg, T.L., Batra, D.: Multi-target embodied question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Roh, J., Paxton, C., Pronobis, A., Farhadi, A., Fox, D.: Conditional driving from natural language instructions. In: Conference on Robot Learning (2019)
Wang, X., et al.: Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Vasudevan, A.B., Dai, D., Van Gool, L.: Talk2Nav: long-range vision-and-language navigation with dual attention and spatial memory. Int. J. Comput. Vis. 129, 246–266 (2020). https://doi.org/10.1007/s11263-020-01374-3
Feng, W., Zhuo, H.H., Kambhampati, S.: Extracting action sequences from texts based on deep reinforcement learning. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI) (2018)
Nyga, D., et al.: Grounding robot plans from natural language instructions with incomplete world knowledge. In: 2nd Conference on Robot Learning (2018)
Zhang, H., Lai, P.J., Paul, S., Kothawade, S., Nikolaidis, S.: Learning collaborative action plans from YouTube videos. In: International Symposium on Robotics Research (ISRR) (2019)
Johnson, J., Fei-Fei, L., Hariharan, B., Zitnick, C.L., Van Der Maaten, L., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Nazarczuk, M., Mikolajczyk, K.: SHOP-VRB: a visual reasoning benchmark for object perception. In: International Conference on Robotics and Automation (ICRA) (2020)
Tassa, Y., et al.: DeepMind Control Suite. CoRR (2018)
Shridhar, M., et al.: ALFRED a benchmark for interpreting grounded instructions for everyday tasks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Mandlekar, A., et al.: Scaling robot supervision to hundreds of hours with RoboTurk: robotic manipulation dataset through human reasoning and dexterity. In: International Conference on Intelligent Robots and Systems, (IROS) (2019)
Fan, L., et al.: SURREAL: open-source reinforcement learning framework and robot manipulation benchmark. In: Conference on Robot Learning (2018)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: International Conference on Computer Vision (ICCV) (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems (2014)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations (ICLR) (2015)
Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229–256 (1992)
Kingma, D.P., Lei Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (ICLR) (2015)
Acknowledgements
This research was supported by UK EPSRC IPALM project EP/S032398/1.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Nazarczuk, M., Mikolajczyk, K. (2021). V2A - Vision to Action: Learning Robotic Arm Actions Based on Vision and Language. In: Ishikawa, H., Liu, CL., Pajdla, T., Shi, J. (eds) Computer Vision – ACCV 2020. ACCV 2020. Lecture Notes in Computer Science(), vol 12624. Springer, Cham. https://doi.org/10.1007/978-3-030-69535-4_44
Download citation
DOI: https://doi.org/10.1007/978-3-030-69535-4_44
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-69534-7
Online ISBN: 978-3-030-69535-4
eBook Packages: Computer ScienceComputer Science (R0)