Skip to main content
Log in

Scene understanding using natural language description based on 3D semantic graph map

  • Original Research Paper
  • Published:
Intelligent Service Robotics Aims and scope Submit manuscript

Abstract

A natural language description for working environment understanding is an important component in human–robot communication. Although 3D semantic graph mappings are widely studied for perceptual aspects of the environment, these approaches hardly apply to the communication issues such as natural language descriptions for a semantic graph map. There are many researches on workspace understanding over images in the field of computer vision, which automatically generate sentences while they usually never utilize multiple scenes and 3D information. In this paper, we introduce a novel natural language description method using 3D semantic graph map. An object-oriented semantic graph map is first constructed using 3D information. A graph convolutional neural network and a recurrent neural network are then used to generate a description of the map. A natural language sentence focusing on objects over 3D semantic graph map can be eventually generated consisting of a single scene or multiple scenes. We validate the proposed method using publicly available dataset and compare it with conventional methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Tellex S, Knepper RA, Li A, Rus D, Roy N (2014) Asking for help using inverse semantics. In: Robotics: science and systems, vol 2

  2. Knepper RA, Layton T, Romanishin J, Rus D (2013) Ikeabot: an autonomous multi-robot coordinated furniture assembly system. In: 2013 IEEE international conference on robotics and automation (ICRA). IEEE, pp 855–862

  3. Matuszek C, Fox D, Koscher K (2010) Following directions using statistical machine translation. In: 2010 5th ACM/IEEE international conference on human–robot interaction (HRI). IEEE, pp 251–258

  4. Chen DL, Mooney RJ (2011) Learning to interpret natural language navigation instructions from observations. In: AAAI, vol 2, pp 1–2

  5. Hemachandra S, Walter MR, Tellex S, Teller S (2014) Learning spatial-semantic representations from natural language descriptions and scene classifications. In: 2014 IEEE international conference on robotics and automation (ICRA). IEEE, pp 2623–2630

  6. Bowman SL, Atanasov N, Daniilidis K, Pappas GJ (2017) Probabilistic data association for semantic slam. In: 2017 IEEE international conference on robotics and automation (ICRA). IEEE, pp 1722–1729

  7. Agrawal P, Girshick R, Malik J (2014) Analyzing the performance of multilayer neural networks for object recognition. In: European conference on computer vision. Springer, pp 329–344

  8. Walter MR, Hemachandra S, Homberg B, Tellex S, Teller S (2014) A framework for learning semantic maps from grounded natural language descriptions. Int J Robot Res 33(9):1167–1190

    Article  Google Scholar 

  9. Galindo C, Saffiotti A, Coradeschi S, Buschka P, Fernandez-Madrigal JA, González J (2005) Multi-hierarchical semantic maps for mobile robotics. In: 2005 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, pp 2278–2283

  10. Mallya A, Lazebnik S (2017) Recurrent models for situation recognition. arXiv preprint arXiv:1703.06233

  11. Chen X, Lawrence Zitnick C (2015) Mind’s eye: a recurrent visual representation for image caption generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2422–2431

  12. Lin D, Kong C, Fidler S, Urtasun R (2015) Generating multi-sentence lingual descriptions of indoor scenes. arXiv preprint arXiv:1503.00064

  13. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 3156–3164

  14. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137

  15. Li Y, Ouyang W, Wang X, Tang X (2017) ViP-CNN: visual phrase guided convolutional neural network. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 7244–7253

  16. Zhang H, Kyaw Z, Yu J, Chang SF (2017) PPR-FCN: weakly supervised visual relation detection via parallel pairwise R-FCN. arXiv preprint arXiv:1708.01956

  17. Li R, Tapaswi M, Liao R, Jia J, Urtasun R, Fidler S (2017a) Situation recognition with graph neural networks. arXiv preprint arXiv:1708.04320

  18. Li Y, Ouyang W, Zhou B, Wang K, Wang X (2017b) Scene graph generation from objects, phrases and region captions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1261–1270

  19. Juan L, Oubong G (2010) Surf applied in panorama image stitching. In: 2010 2nd international conference on image processing theory tools and applications (IPTA). IEEE, pp 495–499

  20. Leng B, Guo S, Zhang X, Xiong Z (2015) 3d object retrieval with stacked local convolutional autoencoder. Signal Process 112:119–128

    Article  Google Scholar 

  21. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  22. Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99

  23. Wei XS, Xie CW, Wu J (2016) Mask-CNN: localizing parts and selecting descriptors for fine-grained image recognition. arXiv preprint arXiv:1605.06878

  24. Henaff M, Bruna J, LeCun Y (2015) Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163

  25. Defferrard M, Bresson X, Vandergheynst P (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In: Advances in Neural Information Processing Systems, pp 3844–3852

  26. Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907

  27. Xiao J, Ehinger KA, Oliva A, Torralba A (2012) Recognizing scene viewpoint using panoramic place representation. In: 2012 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 2695–2702

  28. Dozat T (2016) Incorporating nesterov momentum into adam. In: ICLR Workshop, no 1, pp 2013–2016

  29. Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics, pp 311–318

Download references

Acknowledgements

The funding was provided by National Research Foundation of Korea (Grant No. 2017R1A2B2002608).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiyoun Moon.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Moon, J., Lee, B. Scene understanding using natural language description based on 3D semantic graph map. Intel Serv Robotics 11, 347–354 (2018). https://doi.org/10.1007/s11370-018-0257-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11370-018-0257-x

Keywords

Navigation