Scene understanding using natural language description based on 3D semantic graph map

Moon, Jiyoun; Lee, Beomhee

doi:10.1007/s11370-018-0257-x

Scene understanding using natural language description based on 3D semantic graph map

Original Research Paper
Published: 17 August 2018

Volume 11, pages 347–354, (2018)
Cite this article

Intelligent Service Robotics Aims and scope Submit manuscript

817 Accesses
10 Citations
Explore all metrics

Abstract

A natural language description for working environment understanding is an important component in human–robot communication. Although 3D semantic graph mappings are widely studied for perceptual aspects of the environment, these approaches hardly apply to the communication issues such as natural language descriptions for a semantic graph map. There are many researches on workspace understanding over images in the field of computer vision, which automatically generate sentences while they usually never utilize multiple scenes and 3D information. In this paper, we introduce a novel natural language description method using 3D semantic graph map. An object-oriented semantic graph map is first constructed using 3D information. A graph convolutional neural network and a recurrent neural network are then used to generate a description of the map. A natural language sentence focusing on objects over 3D semantic graph map can be eventually generated consisting of a single scene or multiple scenes. We validate the proposed method using publicly available dataset and compare it with conventional methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

Ranjay Krishna, Yuke Zhu, … Li Fei-Fei

Learning to Prompt for Vision-Language Models

Article 31 July 2022

Kaiyang Zhou, Jingkang Yang, … Ziwei Liu

Visual attention network

Article Open access 28 July 2023

Meng-Hao Guo, Cheng-Ze Lu, … Shi-Min Hu

References

Tellex S, Knepper RA, Li A, Rus D, Roy N (2014) Asking for help using inverse semantics. In: Robotics: science and systems, vol 2
Knepper RA, Layton T, Romanishin J, Rus D (2013) Ikeabot: an autonomous multi-robot coordinated furniture assembly system. In: 2013 IEEE international conference on robotics and automation (ICRA). IEEE, pp 855–862
Matuszek C, Fox D, Koscher K (2010) Following directions using statistical machine translation. In: 2010 5th ACM/IEEE international conference on human–robot interaction (HRI). IEEE, pp 251–258
Chen DL, Mooney RJ (2011) Learning to interpret natural language navigation instructions from observations. In: AAAI, vol 2, pp 1–2
Hemachandra S, Walter MR, Tellex S, Teller S (2014) Learning spatial-semantic representations from natural language descriptions and scene classifications. In: 2014 IEEE international conference on robotics and automation (ICRA). IEEE, pp 2623–2630
Bowman SL, Atanasov N, Daniilidis K, Pappas GJ (2017) Probabilistic data association for semantic slam. In: 2017 IEEE international conference on robotics and automation (ICRA). IEEE, pp 1722–1729
Agrawal P, Girshick R, Malik J (2014) Analyzing the performance of multilayer neural networks for object recognition. In: European conference on computer vision. Springer, pp 329–344
Walter MR, Hemachandra S, Homberg B, Tellex S, Teller S (2014) A framework for learning semantic maps from grounded natural language descriptions. Int J Robot Res 33(9):1167–1190
Article Google Scholar
Galindo C, Saffiotti A, Coradeschi S, Buschka P, Fernandez-Madrigal JA, González J (2005) Multi-hierarchical semantic maps for mobile robotics. In: 2005 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, pp 2278–2283
Mallya A, Lazebnik S (2017) Recurrent models for situation recognition. arXiv preprint arXiv:1703.06233
Chen X, Lawrence Zitnick C (2015) Mind’s eye: a recurrent visual representation for image caption generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2422–2431
Lin D, Kong C, Fidler S, Urtasun R (2015) Generating multi-sentence lingual descriptions of indoor scenes. arXiv preprint arXiv:1503.00064
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 3156–3164
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Li Y, Ouyang W, Wang X, Tang X (2017) ViP-CNN: visual phrase guided convolutional neural network. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 7244–7253
Zhang H, Kyaw Z, Yu J, Chang SF (2017) PPR-FCN: weakly supervised visual relation detection via parallel pairwise R-FCN. arXiv preprint arXiv:1708.01956
Li R, Tapaswi M, Liao R, Jia J, Urtasun R, Fidler S (2017a) Situation recognition with graph neural networks. arXiv preprint arXiv:1708.04320
Li Y, Ouyang W, Zhou B, Wang K, Wang X (2017b) Scene graph generation from objects, phrases and region captions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1261–1270
Juan L, Oubong G (2010) Surf applied in panorama image stitching. In: 2010 2nd international conference on image processing theory tools and applications (IPTA). IEEE, pp 495–499
Leng B, Guo S, Zhang X, Xiong Z (2015) 3d object retrieval with stacked local convolutional autoencoder. Signal Process 112:119–128
Article Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
Wei XS, Xie CW, Wu J (2016) Mask-CNN: localizing parts and selecting descriptors for fine-grained image recognition. arXiv preprint arXiv:1605.06878
Henaff M, Bruna J, LeCun Y (2015) Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163
Defferrard M, Bresson X, Vandergheynst P (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In: Advances in Neural Information Processing Systems, pp 3844–3852
Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907
Xiao J, Ehinger KA, Oliva A, Torralba A (2012) Recognizing scene viewpoint using panoramic place representation. In: 2012 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 2695–2702
Dozat T (2016) Incorporating nesterov momentum into adam. In: ICLR Workshop, no 1, pp 2013–2016
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics, pp 311–318

Download references

Acknowledgements

The funding was provided by National Research Foundation of Korea (Grant No. 2017R1A2B2002608).

Author information

Authors and Affiliations

Automation and Systems Research Institute, Department of Electrical Engineering, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul, 151-742, Korea
Jiyoun Moon & Beomhee Lee

Authors

Jiyoun Moon
View author publications
You can also search for this author in PubMed Google Scholar
Beomhee Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiyoun Moon.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Moon, J., Lee, B. Scene understanding using natural language description based on 3D semantic graph map. Intel Serv Robotics 11, 347–354 (2018). https://doi.org/10.1007/s11370-018-0257-x

Download citation

Received: 11 April 2018
Accepted: 07 August 2018
Published: 17 August 2018
Issue Date: October 2018
DOI: https://doi.org/10.1007/s11370-018-0257-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scene understanding using natural language description based on 3D semantic graph map

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Learning to Prompt for Vision-Language Models

Visual attention network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Scene understanding using natural language description based on 3D semantic graph map

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Learning to Prompt for Vision-Language Models

Visual attention network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation