Abstract
A natural language description for working environment understanding is an important component in human–robot communication. Although 3D semantic graph mappings are widely studied for perceptual aspects of the environment, these approaches hardly apply to the communication issues such as natural language descriptions for a semantic graph map. There are many researches on workspace understanding over images in the field of computer vision, which automatically generate sentences while they usually never utilize multiple scenes and 3D information. In this paper, we introduce a novel natural language description method using 3D semantic graph map. An object-oriented semantic graph map is first constructed using 3D information. A graph convolutional neural network and a recurrent neural network are then used to generate a description of the map. A natural language sentence focusing on objects over 3D semantic graph map can be eventually generated consisting of a single scene or multiple scenes. We validate the proposed method using publicly available dataset and compare it with conventional methods.
Similar content being viewed by others
References
Tellex S, Knepper RA, Li A, Rus D, Roy N (2014) Asking for help using inverse semantics. In: Robotics: science and systems, vol 2
Knepper RA, Layton T, Romanishin J, Rus D (2013) Ikeabot: an autonomous multi-robot coordinated furniture assembly system. In: 2013 IEEE international conference on robotics and automation (ICRA). IEEE, pp 855–862
Matuszek C, Fox D, Koscher K (2010) Following directions using statistical machine translation. In: 2010 5th ACM/IEEE international conference on human–robot interaction (HRI). IEEE, pp 251–258
Chen DL, Mooney RJ (2011) Learning to interpret natural language navigation instructions from observations. In: AAAI, vol 2, pp 1–2
Hemachandra S, Walter MR, Tellex S, Teller S (2014) Learning spatial-semantic representations from natural language descriptions and scene classifications. In: 2014 IEEE international conference on robotics and automation (ICRA). IEEE, pp 2623–2630
Bowman SL, Atanasov N, Daniilidis K, Pappas GJ (2017) Probabilistic data association for semantic slam. In: 2017 IEEE international conference on robotics and automation (ICRA). IEEE, pp 1722–1729
Agrawal P, Girshick R, Malik J (2014) Analyzing the performance of multilayer neural networks for object recognition. In: European conference on computer vision. Springer, pp 329–344
Walter MR, Hemachandra S, Homberg B, Tellex S, Teller S (2014) A framework for learning semantic maps from grounded natural language descriptions. Int J Robot Res 33(9):1167–1190
Galindo C, Saffiotti A, Coradeschi S, Buschka P, Fernandez-Madrigal JA, González J (2005) Multi-hierarchical semantic maps for mobile robotics. In: 2005 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, pp 2278–2283
Mallya A, Lazebnik S (2017) Recurrent models for situation recognition. arXiv preprint arXiv:1703.06233
Chen X, Lawrence Zitnick C (2015) Mind’s eye: a recurrent visual representation for image caption generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2422–2431
Lin D, Kong C, Fidler S, Urtasun R (2015) Generating multi-sentence lingual descriptions of indoor scenes. arXiv preprint arXiv:1503.00064
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 3156–3164
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Li Y, Ouyang W, Wang X, Tang X (2017) ViP-CNN: visual phrase guided convolutional neural network. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 7244–7253
Zhang H, Kyaw Z, Yu J, Chang SF (2017) PPR-FCN: weakly supervised visual relation detection via parallel pairwise R-FCN. arXiv preprint arXiv:1708.01956
Li R, Tapaswi M, Liao R, Jia J, Urtasun R, Fidler S (2017a) Situation recognition with graph neural networks. arXiv preprint arXiv:1708.04320
Li Y, Ouyang W, Zhou B, Wang K, Wang X (2017b) Scene graph generation from objects, phrases and region captions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1261–1270
Juan L, Oubong G (2010) Surf applied in panorama image stitching. In: 2010 2nd international conference on image processing theory tools and applications (IPTA). IEEE, pp 495–499
Leng B, Guo S, Zhang X, Xiong Z (2015) 3d object retrieval with stacked local convolutional autoencoder. Signal Process 112:119–128
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
Wei XS, Xie CW, Wu J (2016) Mask-CNN: localizing parts and selecting descriptors for fine-grained image recognition. arXiv preprint arXiv:1605.06878
Henaff M, Bruna J, LeCun Y (2015) Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163
Defferrard M, Bresson X, Vandergheynst P (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In: Advances in Neural Information Processing Systems, pp 3844–3852
Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907
Xiao J, Ehinger KA, Oliva A, Torralba A (2012) Recognizing scene viewpoint using panoramic place representation. In: 2012 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 2695–2702
Dozat T (2016) Incorporating nesterov momentum into adam. In: ICLR Workshop, no 1, pp 2013–2016
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics, pp 311–318
Acknowledgements
The funding was provided by National Research Foundation of Korea (Grant No. 2017R1A2B2002608).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Moon, J., Lee, B. Scene understanding using natural language description based on 3D semantic graph map. Intel Serv Robotics 11, 347–354 (2018). https://doi.org/10.1007/s11370-018-0257-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11370-018-0257-x