Abstract
Semantic segmentation is one of the most important tasks in the field of computer vision. It is the main step towards scene understanding. With the advent of RGB-Depth sensors, such as Microsoft Kinect, nowadays RGB-Depth images are easily available. This has changed the landscape of some tasks such as semantic segmentation. As the depth images are independent of illumination, the combination of depth and RGB images can improve the quality of semantic labeling. The related research has been divided into two main categories, based on the usage of hand-crafted features and deep learning. Although the state-of-the-art results are mainly achieved by deep learning methods, traditional methods have also been at the center of attention for some years and lots of valuable work have been done in that category. As the field of semantic segmentation is very broad, in this survey, a comprehensive analysis has been carried out on RGB-Depth semantic segmentation methods, their challenges and contributions, available RGB-Depth datasets, metrics of evaluation, state-of-the-art results, and promising directions of the field.




Similar content being viewed by others
References
Achanta R, Shaji A, Smith K, Lucchi A, Fua P, Süsstrunk S et al (2012) Slic superpixels compared to state-of-the-art superpixel methods. IEEE Trans Pattern Anal Mach Intell 34(11):2274–2282
Arbelaez P, Maire M, Fowlkes C, Malik J (2011) Contour detection and hierarchical image segmentation. IEEE Trans Pattern Anal Mach Intell 33(5):898–916
Armeni I, Sax S, Zamir AR, Savarese S (2017) Joint 2d-3d-semantic data for indoor scene understanding. arXiv:1702.01105
Badrinarayanan V, Handa A, Cipolla R (2015) Segnet: a deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. arXiv:1505.07293
Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495
Banica D, Sminchisescu C (2015) Second-order constrained parametric proposals and sequential search-based structured prediction for semantic segmentation in rgb-d images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3517–3526
Bo L, Ren X, Fox D (2010) Kernel descriptors for visual recognition. In: Advances in neural information processing systems, pp 244–252
Cadena C, Košecka J (2013) Semantic parsing for priming object detection in rgb-d scenes. In: 3rd workshop on semantic perception, mapping and exploration. Citeseer
Chang A, Dai A, Funkhouser T, Halber M, Nießner M, Savva M, Song S, Zeng A, Zhang Y (2017) Matterport3d: learning from rgb-d data in indoor environments. arXiv:1709.06158
Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2015) Semantic image segmentation with deep convolutional nets and fully connected crfs. In: International conference on learning representations
Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2018) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848
Cheng Y, Cai R, Li Z, Zhao X, Huang K (2017) Localitysensitive deconvolution networks with gated fusion for rgb-d indoor semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), vol 3
Couprie C, Farabet C, Najman L, LeCun Y (2013) Indoor semantic segmentation using depth information. arXiv:1301.3572
Csurka G, Larlus D, Perronnin F, Meylan F (2013) What is a good evaluation measure for semantic segmentation?. In: BMVC, vol 27. Citeseer
Eigen D, Fergus R (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE international conference on computer vision, pp 2650–2658
Farabet C, Couprie C, Najman L, LeCun Y (2013) Learning hierarchical features for scene labeling. IEEE Trans Pattern Anal Mach Intell 35(8):1915–1929
Firman M (2016) Rgbd datasets: past, present and future. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 19–31
Fooladgar F, Kasaei S (2015) Learning strengths and weaknesses of classifiers for rgb-d semantic segmentation. In: 2015 9th Iranian conference on machine vision and image processing (MVIP). IEEE, pp 176–179
Fooladgar F, Kasaei S (2015) Semantic segmentation of rgb-d images using 3d and local neighbouring features. In: 2015 International conference on digital image computing: techniques and applications (DICTA). IEEE, pp 1–7
Garcia-Garcia A, Orts-Escolano S, Oprea S, Villena-Martinez V, Garcia-Rodriguez J (2017) A review on deep learning techniques applied to semantic segmentation. arXiv:1704.06857
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587
Gupta S, Arbelaez P, Malik J (2013) Perceptual organization and recognition of indoor scenes from rgb-d images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 564–571
Gupta S, Girshick R, Arbeláez P, Malik J (2014) Learning rich features from rgb-d images for object detection and segmentation. In: European conference on computer vision. Springer, pp 345–360
Han J, Shao L, Xu D, Shotton J (2013) Enhanced computer vision with microsoft kinect sensor: a review. IEEE Trans Cybern 43(5):1318–1334
Hariharan B, Arbeláez P, Girshick R, Malik J (2015) Hypercolumns for object segmentation and fine-grained localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 447–456
Hazirbas C, Ma L, Domokos C, Cremers D (2016) Fusenet: incorporating depth into semantic segmentation via fusion-based cnn architecture. In: Asian conference on computer vision. Springer, pp 213–228
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hermans A, Floros G, Leibe B (2014) Dense 3d semantic mapping of indoor scenes from rgb-d images. In: 2014 IEEE international conference on robotics and automation (ICRA). IEEE, pp 2631–2638
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: CVPR, vol 1, p 3
Jégou S, Drozdzal M, Vazquez D, Romero A, Bengio Y (2017) The one hundred layers tiramisu: fully convolutional densenets for semantic segmentation. In: 2017 IEEE conference on computer vision and pattern recognition workshops (CVPRW). IEEE, pp 1175–1183
Kendall A, Badrinarayanan V, Cipolla R (2015) Bayesian segnet: model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. arXiv:1511.02680
Kindermann R (1980) Markov random fields and their applications. American Mathematical Society, Providence
Kong S, Fowlkes C (2018) Pixel-wise attentional gating for parsimonious pixel labeling. arXiv:1805.01556
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Li Z, Gan Y, Liang X, Yu Y, Cheng H, Lin L (2016) Lstm-cf: unifying context modeling and fusion with lstms for rgb-d scene labeling. In: European conference on computer vision. Springer, pp 541–557
Lin D, Chen G, Cohen-Or D, Heng PA, Huang H (2017) Cascaded feature network for semantic segmentation of rgb-d images. In: 2017 IEEE international conference on computer vision (ICCV). IEEE, pp 1320–1328
Lin G, Milan A, Shen C, Reid I (2017) Refinenet: multi-path refinement networks for high-resolution semantic segmentation. In: Cvpr, vol 1, p 5
Lin G, Shen C, Van Den Hengel A, Reid I (2018) Exploring context with deep structured models for semantic segmentation. IEEE Trans Pattern Anal Mach Intell 40(6):1352–1366
Liu W, Rabinovich A, Berg AC (2015) Parsenet: looking wider to see better. arXiv:1506.04579
Liu H, Wu W, Wang X, Qian Y (2018) RGB-D joint modelling with scene geometric information for indoor semantic segmentation. Multimed Tools Appl 77(17):22475–22488
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Long J, Shelhamer E, Darrell T (2017) Fully convolutional networks for semantic segmentation. IEEE Trans Pattern Anal Mach Intell 39(4):640–651
McCormac J, Handa A, Leutenegger S, Davison AJ (2017) Scenenet rgb-d: can 5m synthetic images beat generic imagenet pre-training on indoor segmentation. In: Proceedings of the international conference on computer vision (ICCV), vol 4
Müller AC, Behnke S (2014) Learning depth-sensitive conditional random fields for semantic segmentation of rgb-d images. In: 2014 IEEE international conference on robotics and automation (ICRA). Citeseer, pp 6232–6237
Naseer M, Khan SH, Porikli F (2018) Indoor scene understanding in 2.5/3d: a survey. arXiv:1803.03352
Noh H, Hong S, Han B (2015) Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE international conference on computer vision, pp 1520–1528
Park SJ, Hong KS, Lee S (2017) Rdfnet: Rgb-d multi-level residual feature fusion for indoor semantic segmentation. In: The IEEE international conference on computer vision (ICCV)
Qi X, Liao R, Jia J, Fidler S, Urtasun R (2017) 3d graph neural networks for rgbd semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5199–5208
Ren X, Bo L, Fox D (2012) Rgb-(d) scene labeling: features and algorithms. In: 2012 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 2759–2766
Reynolds J, Murphy K (2007) Figure-ground segmentation using a hierarchical conditional random field. In: Fourth Canadian conference on computer and robot vision, 2007. CRV’07. IEEE, pp 175–182
Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 234–241
Sabour S, Frosst N, Hinton GE (2017) Dynamic routing between capsules. In: Advances in neural information processing systems, pp 3856–3866
Seyedhosseini M, Tasdizen T (2016) Semantic image segmentation with contextual hierarchical models. IEEE Trans Pattern Anal Mach Intell 38(5):951
Shotton J, Winn J, Rother C, Criminisi A (2006) Textonboost: joint appearance, shape and context modeling for multi-class object recognition and segmentation. In: European conference on computer vision. Springer, pp 1–15
Silberman N, Fergus R (2011) Indoor scene segmentation using a structured light sensor. In: 2011 IEEE international conference on computer vision workshops (ICCV Workshops). IEEE, pp 601–608
Silberman N, Hoiem D, Kohli P, Fergus R (2012) Indoor segmentation and support inference from rgbd images. In: European conference on computer vision. Springer, pp 746–760
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Song X, Zhong F, Wang Y, Qin X (2014) Estimation of kinect depth confidence through self-training. Vis Comput 30(6-8):855–865
Song S, Lichtenberg SP, Xiao J (2015) Sun rgb-d: a rgb-d scene understanding benchmark suite. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 567–576
Song X, Zheng J, Zhong F, Qin X (2018) Modeling deviations of rgb-d cameras for accurate depth map and color image registration. Multimed Tools Appl, 1–27
Stückler J., Waldvogel B, Schulz H, Behnke S (2015) Dense real-time mapping of object-class semantics from rgb-d video. J Real-Time Image Proc 10(4):599–609
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Wang W, Neumann U (2018) Depth-aware cnn for rgb-d segmentation. arXiv:1803.06791
Wang J, Wang Z, Tao D, See S, Wang G (2016) Learning common and specific features for rgb-d semantic segmentation with deconvolutional networks. In: European conference on computer vision. Springer, pp 664–679
Yang MY, Forstner W (2011) A hierarchical conditional random field model for labeling and classifying images of man-made scenes. In: 2011 IEEE international conference on computer vision workshops (ICCV Workshops). IEEE, pp 196–203
Zhang D, Islam MM, Lu G (2012) A review on automatic image annotation techniques. Pattern Recogn 45(1):346–362
Zheng S, Cheng MM, Warrell J, Sturgess P, Vineet V, Rother C, Torr PH (2014) Dense semantic image segmentation with objects and attributes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3214–3221
Zheng S, Jayasumana S, Romera-Paredes B, Vineet V, Su Z, Du D, Huang C, Torr PH (2015) Conditional random fields as recurrent neural networks. In: Proceedings of the IEEE international conference on computer vision, pp 1529–1537
Zhu H, Meng F, Cai J, Lu S (2016) Beyond pixels: a comprehensive survey from bottom-up to semantic image segmentation and cosegmentation. J Vis Commun Image Represent 34:12–27
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Fooladgar, F., Kasaei, S. A survey on indoor RGB-D semantic segmentation: from hand-crafted features to deep convolutional neural networks. Multimed Tools Appl 79, 4499–4524 (2020). https://doi.org/10.1007/s11042-019-7684-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-019-7684-3