Skip to main content
Log in

Embodied scene description

  • Published:
Autonomous Robots Aims and scope Submit manuscript

Abstract

Embodiment is an important characteristic for all intelligent agents, while existing scene description tasks mainly focus on analyzing images passively and the semantic understanding of the scenario is separated from the interaction between the agent and the environment. In this work, we propose the Embodied Scene Description, which exploits the embodiment ability of the agent to find an optimal viewpoint in its environment for scene description tasks. A learning framework with the paradigms of imitation learning and reinforcement learning is established to teach the intelligent agent to generate corresponding sensorimotor activities. The proposed framework is tested on both the AI2Thor dataset and a real-world robotic platform for different scene description tasks, demonstrating the effectiveness and scalability of the developed method. Also, a mobile application is developed, which can be used to assist visually-impaired people to better understand their surroundings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  • https://news.microsoft.com/features/bonjour-bienvenidos-seeing-ai-expands-to-5-new-languages/

  • Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2018). Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6077–6086).

  • Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., & van den Hengel, A. (2018). Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3674–3683).

  • Arumugam, D., Karamcheti, S., Gopalan, N., Williams, E. C., Rhee, M., Wong, L. L., & Tellex, S. (2019). Grounding natural language instructions to semantic goal representations for abstraction and generalization. Autonomous Robots, 43(2), 449–468.

    Article  Google Scholar 

  • Bashiri, F. S., LaRose, E., Badger, J. C., D’Souza, R. M., Yu, Z., & Peissig, P. (2018) Object detection to assist visually impaired people: A deep neural network adventure. In International symposium on visual computing (pp. 500–510). Springer.

  • Chebotar, Y., Handa, A., Makoviychuk, V., Macklin, M., Issac, J., Ratliff, N., & Fox, D. (2019). Closing the sim-to-real loop: Adapting simulation randomization with real world experience. In 2019 International conference on robotics and automation (ICRA) (pp. 8973–8979). IEEE.

  • Chen, K., de Vicente, J. P., Sepulveda, G., Xia, F., Soto, A., Vazquez, M., & Savarese, S. (2019). A behavioral approach to visual navigation with graph localization networks. Robotics: Science and Systems 1–10.

  • Cheng, R., Wang, K., Yang, K., Long, N., Bai, J., & Liu, D. (2018). Real-time pedestrian crossing lights detection algorithm for the visually impaired. Multimedia Tools and Applications, 77(16), 20651–20671.

    Article  Google Scholar 

  • Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., & Batra, D. (2018a). Embodied question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 2054–2063).

  • Das, A., Gkioxari, G., Lee, S., Parikh, D., & Batra, D. (2018b). Neural modular control for embodied question answering. arXiv preprint arXiv:1810.11181.

  • Denkowski, M., & Lavie, A. (2014). Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation (pp. 376–380).

  • Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2625–2634).

  • Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., & Farhadi, A. (2018). Iqa: Visual question answering in interactive environments. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4089–4098).

  • Haarnoja, T., Ha, S., Zhou, A., Tan, J., Tucker, G., & Levine, S. (2018). Learning to walk via deep reinforcement learning. Robotics: Science and Systems 1–10.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

  • Hougardy, S. (2010). The Floyd–Warshall algorithm on graphs with negative cycles. Information Processing Letters, 110(8–9), 279–281.

    Article  MathSciNet  Google Scholar 

  • Jayaraman, D., & Grauman, K. (2018a). End-to-end policy learning for active visual categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(7), 1601–1614.

  • Jayaraman, D., & Grauman, K. (2018b). Learning to look around: Intelligently exploring unseen environments for unknown tasks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1238–1247).

  • Jeong, R., Aytar, Y., Khosid, D., Zhou, Y., Kay, J., Lampe, T., Bousmalis, K., & Nori, F. (2020). Self-supervised sim-to-real adaptation for visual robotic manipulation. In 2020 IEEE international conference on robotics and automation (ICRA) (pp. 2718–2724). IEEE.

  • Johnson, J., Karpathy, A., & Fei-Fei, L. (2016). Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4565–4574).

  • Kattepur, A., & Purushotaman, B. (2020). Roboplanner: A pragmatic task planning framework for autonomous robots. Cognitive Computation and Systems, 2(1), 12–22.

    Article  Google Scholar 

  • Kolve, E., Mottaghi, R., Gordon, D., Zhu, Y., Gupta, A., & Farhadi, A. (2017). Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474.

  • Kong, C., Lin, D., Bansal, M., Urtasun, R., & Fidler, S. (2014). What are you talking about? Text-to-image coreference. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3558–3565).

  • Krause, J., Johnson, J., Krishna, R., & Fei-Fei, L. (2017). A hierarchical approach for generating descriptive image paragraphs. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 317–325).

  • Li, G., Mueller, M., Casser, V., Smith, N., Michels, D. L., & Ghanem, B. (2019a). Oil: Observational imitation learning. Robotics: Science and Systems 1–10.

  • Li, H., Zhang, Q., & Zhao, D. (2019b). Deep reinforcement learning-based automatic exploration for navigation in unknown environment. IEEE Transactions on Neural Networks and Learning Systems 31(6), 2064–2076.

  • Liang, X., Hu, Z., Zhang, H., Gan, C., & Xing, E. P. (2017). Recurrent topic-transition gan for visual paragraph generation. In Proceedings of the IEEE international conference on computer vision (pp. 3362–3371).

  • Liu, H., Wu, Y., & Sun, F. (2018). Extreme trust region policy optimization for active object recognition. IEEE Transactions on Neural Networks and Learning Systems, 29(6), 2253–2258.

    Article  MathSciNet  Google Scholar 

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani & K.Q. Weinberger (Eds.), Proceedings of a Advances in neural information processing systems (pp. 3111–3119).

  • Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics (pp. 311–318). Association for Computational Linguistics.

  • Park, D. H., Darrell, T., & Rohrbach, A. (2019). Robust change captioning. In Proceedings of the IEEE international conference on computer vision (pp. 4624–4633).

  • Peng, X. B., Andrychowicz, M., Zaremba, W., & Abbeel, P. (2018). Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE international conference on robotics and automation (ICRA) (pp. 3803–3810). IEEE.

  • Pintado, D., Sanchez, V., Adarve, E., Mata, M., Gogebakan, Z., Cabuk, B., Chiu, C., Zhan, J., Gewali, L., & Oh, P. (2019). Deep learning based shopping assistant for the visually impaired. In 2019 IEEE international conference on consumer electronics (ICCE) (pp. 1–6). IEEE.

  • Ramakrishnan, S. K., & Grauman, K. (2018). Sidekick policy learning for active visual exploration. In Proceedings of the European conference on computer vision (ECCV) (pp. 413–430).

  • Ramakrishnan, S. K., Jayaraman, D., & Grauman, K. (2019). Emergence of exploratory look-around behaviors through active observation completion. Science Robotics, 4(30), eaaw6326.

    Article  Google Scholar 

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama & R. Garnett (Eds.), Proceedings of Advances in neural information processing systems (pp. 91–99).

  • Sadeghi, F. (2019) Divis: Domain invariant visual servoing for collision-free goal reaching. Robotics: Science and Systems 1–10.

  • Song, J., Guo, Y., Gao, L., Li, X., Hanjalic, A., & Shen, H. T. (2018). From deterministic to generative: Multimodal stochastic rnns for video captioning. IEEE Transactions on Neural Networks and Learning Systems, 30(10), 3047–3058.

    Article  Google Scholar 

  • Stramandinoli, F., Marocco, D., & Cangelosi, A. (2017). Making sense of words: A robotic model for language abstraction. Autonomous Robots, 41(2), 367–383.

    Article  Google Scholar 

  • Takano, W., Yamada, Y., & Nakamura, Y. (2019). Linking human motions and objects to language for synthesizing action sentences. Autonomous Robots, 43(4), 913–925.

    Article  Google Scholar 

  • Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., & Abbeel, P. (2017). Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 23–30). IEEE.

  • Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4566–4575).

  • Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3156–3164).

  • Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2016). Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 652–663.

    Article  Google Scholar 

  • Wachaja, A., Agarwal, P., Zink, M., Adame, M. R., Möller, K., & Burgard, W. (2017). Navigating blind people with walking impairments using a smart walker. Autonomous Robots, 41(3), 555–573.

    Article  Google Scholar 

  • Wang, X., Huang, Q., Celikyilmaz, A., Gao, J., Shen, D., Wang, Y. F., Wang, W. Y., & Zhang, L. (2019). Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6629–6638).

  • Wijmans, E., Datta, S., Maksymets, O., Das, A., Gkioxari, G., Lee, S., Essa, I., Parikh, D., & Batra, D. (2019). Embodied question answering in photorealistic environments with point cloud perception. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6659–6668).

  • Wu, Y., Jiang, L., & Yang, Y. (2019). Revisiting embodiedqa: A simple baseline and beyond. arXiv preprint arXiv:1904.04166.

  • Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (pp. 2048–2057).

  • Yang, J., Ren, Z., Xu, M., Chen, X., Crandall, D., Parikh, D., & Batra, D. (2019). Embodied visual recognition. arXiv preprint arXiv:1904.04404.

  • Ye, X., Lin, Z., Lee, J. Y., Zhang, J., Zheng, S., & Yang, Y. (2019). Gaple: Generalizable approaching policy learning for robotic object searching in indoor environment. IEEE Robotics and Automation Letters 4(4), 4003–4010.

  • You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016). Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4651–4659).

  • Yu, L., Chen, X., Gkioxari, G., Bansal, M., Berg, T. L., & Batra, D. (2019). Multi-target embodied question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6309–6318).

  • Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2881–2890).

  • Zhong, J., Peniak, M., Tani, J., Ogata, T., & Cangelosi, A. (2019). Sensorimotor input as a language generalisation tool: A neurorobotics model for generation and generalisation of noun-verb combinations with sensorimotor inputs. Autonomous Robots, 43(5), 1271–1290.

    Article  Google Scholar 

  • Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-Fei, L., & Farhadi, A. (2017). Target-driven visual navigation in indoor scenes using deep reinforcement learning. In 2017 IEEE international conference on robotics and automation (ICRA) (pp. 3357–3364). IEEE.

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 62025304, and in part by the Seed Fund of Tsinghua University (Department of Computer Science and Technology)-Siemens Ltd., China Joint Research Center for Industrial Intelligence and Internet of Things.

Funding

The research leading to these results received funding from the National Natural Science Foundation under Grants 62025304.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huaping Liu.

Ethics declarations

Conflict of interest

The author decalres that they have no conflict of interest.

Informed consent

Informed consent was obtained from all individual participants included in this study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This is one of the several papers published in Autonomous Robotscomprising the Special Issue on Robotics: Science and Systems 2020.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 85614 KB)

Appendix for figures

Appendix for figures

Fig. 14
figure 14

Dense Captioning task result for the scene Living room. The trajectories of the agent are shown as red curves in the left panel. (Color figure online)

Fig. 15
figure 15

Dense Captioning task result for the scene Bedroom. The trajectories of the agent are shown as red curves in the left panel. (Color figure online)

Fig. 16
figure 16

Dense Captioning task result for the scene Bathroom. The trajectories of the agent are shown as red curves in the left panel. (Color figure online)

Fig. 17
figure 17

Failure case. When applying the proposed framework to Dense captioning task, the agent is stuck at a slightly better local viewpoint, ignoring better possible global viewpoints which can give a full view of the room

Fig. 18
figure 18

LEFT: The developed robotic platform. A Kinect is equipped on the top of it and we carefully adjust its position to ensure the optical center of the camera in Kinect is aligned with the center of the platform. RIGHT: Two representative real working scenes for the agent. For each scene, we show the third-person view (left) and the first-person view (right)

Fig. 19
figure 19

Real Scene 1: The robot turns around from the wall corner to the bed and finally correctly recognizes the bedroom scene

Fig. 20
figure 20

Real Scene 2: The robot adjusts its position to observe the room. But the caption model mistakenly recognizes the scene as a bathroom

Fig. 21
figure 21

Image Paragraphing task result for the scene Kitchen. The trajectories of the agent are shown as red curves in the left panel

Fig. 22
figure 22

Image Paragraphing task result for the scene Living Room. The trajectories of the agent are shown as red curves in the left panel

Fig. 23
figure 23

Image Paragraphing task result for the scene Bedroom. The trajectories of the agent are shown as red curves in the left panel

Fig. 24
figure 24

Image Paragraphing task result for the scene Bathroom. The trajectories of the agent are shown as red curves in the left panel

Fig. 25
figure 25

Failure case. When applying the proposed framework to Image Paragraphing models. Though an object-rich scene is finally discovered, the adopted paragraphing model does not work well

Fig. 26
figure 26

The proposed model is deployed on a mobile application for human–computer interaction

Fig. 27
figure 27

The user firstly faces to the window and then turns around to capture the conference room scene eventually

Fig. 28
figure 28

The user firstly faces to the wall and then turns around to the bed and finally correctly recognizes the bedroom scene

There are a lot of figures in the manuscript, some of which are very large and may occupy several pages. For a better reading experience, we decide to move all of them to the last section. These figures will be presented after the reference part of this paper (Figs. 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tan, S., Guo, D., Liu, H. et al. Embodied scene description. Auton Robot 46, 21–43 (2022). https://doi.org/10.1007/s10514-021-10014-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10514-021-10014-9

Keywords

Navigation