Skip to main content
Log in

A survey on indoor RGB-D semantic segmentation: from hand-crafted features to deep convolutional neural networks

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Semantic segmentation is one of the most important tasks in the field of computer vision. It is the main step towards scene understanding. With the advent of RGB-Depth sensors, such as Microsoft Kinect, nowadays RGB-Depth images are easily available. This has changed the landscape of some tasks such as semantic segmentation. As the depth images are independent of illumination, the combination of depth and RGB images can improve the quality of semantic labeling. The related research has been divided into two main categories, based on the usage of hand-crafted features and deep learning. Although the state-of-the-art results are mainly achieved by deep learning methods, traditional methods have also been at the center of attention for some years and lots of valuable work have been done in that category. As the field of semantic segmentation is very broad, in this survey, a comprehensive analysis has been carried out on RGB-Depth semantic segmentation methods, their challenges and contributions, available RGB-Depth datasets, metrics of evaluation, state-of-the-art results, and promising directions of the field.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Achanta R, Shaji A, Smith K, Lucchi A, Fua P, Süsstrunk S et al (2012) Slic superpixels compared to state-of-the-art superpixel methods. IEEE Trans Pattern Anal Mach Intell 34(11):2274–2282

    Article  Google Scholar 

  2. Arbelaez P, Maire M, Fowlkes C, Malik J (2011) Contour detection and hierarchical image segmentation. IEEE Trans Pattern Anal Mach Intell 33(5):898–916

    Article  Google Scholar 

  3. Armeni I, Sax S, Zamir AR, Savarese S (2017) Joint 2d-3d-semantic data for indoor scene understanding. arXiv:1702.01105

  4. Badrinarayanan V, Handa A, Cipolla R (2015) Segnet: a deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. arXiv:1505.07293

  5. Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495

    Article  Google Scholar 

  6. Banica D, Sminchisescu C (2015) Second-order constrained parametric proposals and sequential search-based structured prediction for semantic segmentation in rgb-d images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3517–3526

  7. Bo L, Ren X, Fox D (2010) Kernel descriptors for visual recognition. In: Advances in neural information processing systems, pp 244–252

  8. Cadena C, Košecka J (2013) Semantic parsing for priming object detection in rgb-d scenes. In: 3rd workshop on semantic perception, mapping and exploration. Citeseer

  9. Chang A, Dai A, Funkhouser T, Halber M, Nießner M, Savva M, Song S, Zeng A, Zhang Y (2017) Matterport3d: learning from rgb-d data in indoor environments. arXiv:1709.06158

  10. Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2015) Semantic image segmentation with deep convolutional nets and fully connected crfs. In: International conference on learning representations

  11. Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2018) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848

    Article  Google Scholar 

  12. Cheng Y, Cai R, Li Z, Zhao X, Huang K (2017) Localitysensitive deconvolution networks with gated fusion for rgb-d indoor semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), vol 3

  13. Couprie C, Farabet C, Najman L, LeCun Y (2013) Indoor semantic segmentation using depth information. arXiv:1301.3572

  14. Csurka G, Larlus D, Perronnin F, Meylan F (2013) What is a good evaluation measure for semantic segmentation?. In: BMVC, vol 27. Citeseer

  15. Eigen D, Fergus R (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE international conference on computer vision, pp 2650–2658

  16. Farabet C, Couprie C, Najman L, LeCun Y (2013) Learning hierarchical features for scene labeling. IEEE Trans Pattern Anal Mach Intell 35(8):1915–1929

    Article  Google Scholar 

  17. Firman M (2016) Rgbd datasets: past, present and future. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 19–31

  18. Fooladgar F, Kasaei S (2015) Learning strengths and weaknesses of classifiers for rgb-d semantic segmentation. In: 2015 9th Iranian conference on machine vision and image processing (MVIP). IEEE, pp 176–179

  19. Fooladgar F, Kasaei S (2015) Semantic segmentation of rgb-d images using 3d and local neighbouring features. In: 2015 International conference on digital image computing: techniques and applications (DICTA). IEEE, pp 1–7

  20. Garcia-Garcia A, Orts-Escolano S, Oprea S, Villena-Martinez V, Garcia-Rodriguez J (2017) A review on deep learning techniques applied to semantic segmentation. arXiv:1704.06857

  21. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587

  22. Gupta S, Arbelaez P, Malik J (2013) Perceptual organization and recognition of indoor scenes from rgb-d images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 564–571

  23. Gupta S, Girshick R, Arbeláez P, Malik J (2014) Learning rich features from rgb-d images for object detection and segmentation. In: European conference on computer vision. Springer, pp 345–360

  24. Han J, Shao L, Xu D, Shotton J (2013) Enhanced computer vision with microsoft kinect sensor: a review. IEEE Trans Cybern 43(5):1318–1334

    Article  Google Scholar 

  25. Hariharan B, Arbeláez P, Girshick R, Malik J (2015) Hypercolumns for object segmentation and fine-grained localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 447–456

  26. Hazirbas C, Ma L, Domokos C, Cremers D (2016) Fusenet: incorporating depth into semantic segmentation via fusion-based cnn architecture. In: Asian conference on computer vision. Springer, pp 213–228

  27. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  28. Hermans A, Floros G, Leibe B (2014) Dense 3d semantic mapping of indoor scenes from rgb-d images. In: 2014 IEEE international conference on robotics and automation (ICRA). IEEE, pp 2631–2638

  29. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: CVPR, vol 1, p 3

  30. Jégou S, Drozdzal M, Vazquez D, Romero A, Bengio Y (2017) The one hundred layers tiramisu: fully convolutional densenets for semantic segmentation. In: 2017 IEEE conference on computer vision and pattern recognition workshops (CVPRW). IEEE, pp 1175–1183

  31. Kendall A, Badrinarayanan V, Cipolla R (2015) Bayesian segnet: model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. arXiv:1511.02680

  32. Kindermann R (1980) Markov random fields and their applications. American Mathematical Society, Providence

    Book  Google Scholar 

  33. Kong S, Fowlkes C (2018) Pixel-wise attentional gating for parsimonious pixel labeling. arXiv:1805.01556

  34. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  35. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  36. Li Z, Gan Y, Liang X, Yu Y, Cheng H, Lin L (2016) Lstm-cf: unifying context modeling and fusion with lstms for rgb-d scene labeling. In: European conference on computer vision. Springer, pp 541–557

  37. Lin D, Chen G, Cohen-Or D, Heng PA, Huang H (2017) Cascaded feature network for semantic segmentation of rgb-d images. In: 2017 IEEE international conference on computer vision (ICCV). IEEE, pp 1320–1328

  38. Lin G, Milan A, Shen C, Reid I (2017) Refinenet: multi-path refinement networks for high-resolution semantic segmentation. In: Cvpr, vol 1, p 5

  39. Lin G, Shen C, Van Den Hengel A, Reid I (2018) Exploring context with deep structured models for semantic segmentation. IEEE Trans Pattern Anal Mach Intell 40(6):1352–1366

    Article  Google Scholar 

  40. Liu W, Rabinovich A, Berg AC (2015) Parsenet: looking wider to see better. arXiv:1506.04579

  41. Liu H, Wu W, Wang X, Qian Y (2018) RGB-D joint modelling with scene geometric information for indoor semantic segmentation. Multimed Tools Appl 77(17):22475–22488

    Article  Google Scholar 

  42. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440

  43. Long J, Shelhamer E, Darrell T (2017) Fully convolutional networks for semantic segmentation. IEEE Trans Pattern Anal Mach Intell 39(4):640–651

    Article  Google Scholar 

  44. McCormac J, Handa A, Leutenegger S, Davison AJ (2017) Scenenet rgb-d: can 5m synthetic images beat generic imagenet pre-training on indoor segmentation. In: Proceedings of the international conference on computer vision (ICCV), vol 4

  45. Müller AC, Behnke S (2014) Learning depth-sensitive conditional random fields for semantic segmentation of rgb-d images. In: 2014 IEEE international conference on robotics and automation (ICRA). Citeseer, pp 6232–6237

  46. Naseer M, Khan SH, Porikli F (2018) Indoor scene understanding in 2.5/3d: a survey. arXiv:1803.03352

  47. Noh H, Hong S, Han B (2015) Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE international conference on computer vision, pp 1520–1528

  48. Park SJ, Hong KS, Lee S (2017) Rdfnet: Rgb-d multi-level residual feature fusion for indoor semantic segmentation. In: The IEEE international conference on computer vision (ICCV)

  49. Qi X, Liao R, Jia J, Fidler S, Urtasun R (2017) 3d graph neural networks for rgbd semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5199–5208

  50. Ren X, Bo L, Fox D (2012) Rgb-(d) scene labeling: features and algorithms. In: 2012 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 2759–2766

  51. Reynolds J, Murphy K (2007) Figure-ground segmentation using a hierarchical conditional random field. In: Fourth Canadian conference on computer and robot vision, 2007. CRV’07. IEEE, pp 175–182

  52. Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 234–241

  53. Sabour S, Frosst N, Hinton GE (2017) Dynamic routing between capsules. In: Advances in neural information processing systems, pp 3856–3866

  54. Seyedhosseini M, Tasdizen T (2016) Semantic image segmentation with contextual hierarchical models. IEEE Trans Pattern Anal Mach Intell 38(5):951

    Article  Google Scholar 

  55. Shotton J, Winn J, Rother C, Criminisi A (2006) Textonboost: joint appearance, shape and context modeling for multi-class object recognition and segmentation. In: European conference on computer vision. Springer, pp 1–15

  56. Silberman N, Fergus R (2011) Indoor scene segmentation using a structured light sensor. In: 2011 IEEE international conference on computer vision workshops (ICCV Workshops). IEEE, pp 601–608

  57. Silberman N, Hoiem D, Kohli P, Fergus R (2012) Indoor segmentation and support inference from rgbd images. In: European conference on computer vision. Springer, pp 746–760

  58. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  59. Song X, Zhong F, Wang Y, Qin X (2014) Estimation of kinect depth confidence through self-training. Vis Comput 30(6-8):855–865

    Article  Google Scholar 

  60. Song S, Lichtenberg SP, Xiao J (2015) Sun rgb-d: a rgb-d scene understanding benchmark suite. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 567–576

  61. Song X, Zheng J, Zhong F, Qin X (2018) Modeling deviations of rgb-d cameras for accurate depth map and color image registration. Multimed Tools Appl, 1–27

  62. Stückler J., Waldvogel B, Schulz H, Behnke S (2015) Dense real-time mapping of object-class semantics from rgb-d video. J Real-Time Image Proc 10(4):599–609

    Article  Google Scholar 

  63. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9

  64. Wang W, Neumann U (2018) Depth-aware cnn for rgb-d segmentation. arXiv:1803.06791

  65. Wang J, Wang Z, Tao D, See S, Wang G (2016) Learning common and specific features for rgb-d semantic segmentation with deconvolutional networks. In: European conference on computer vision. Springer, pp 664–679

  66. Yang MY, Forstner W (2011) A hierarchical conditional random field model for labeling and classifying images of man-made scenes. In: 2011 IEEE international conference on computer vision workshops (ICCV Workshops). IEEE, pp 196–203

  67. Zhang D, Islam MM, Lu G (2012) A review on automatic image annotation techniques. Pattern Recogn 45(1):346–362

    Article  Google Scholar 

  68. Zheng S, Cheng MM, Warrell J, Sturgess P, Vineet V, Rother C, Torr PH (2014) Dense semantic image segmentation with objects and attributes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3214–3221

  69. Zheng S, Jayasumana S, Romera-Paredes B, Vineet V, Su Z, Du D, Huang C, Torr PH (2015) Conditional random fields as recurrent neural networks. In: Proceedings of the IEEE international conference on computer vision, pp 1529–1537

  70. Zhu H, Meng F, Cai J, Lu S (2016) Beyond pixels: a comprehensive survey from bottom-up to semantic image segmentation and cosegmentation. J Vis Commun Image Represent 34:12–27

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shohreh Kasaei.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fooladgar, F., Kasaei, S. A survey on indoor RGB-D semantic segmentation: from hand-crafted features to deep convolutional neural networks. Multimed Tools Appl 79, 4499–4524 (2020). https://doi.org/10.1007/s11042-019-7684-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-019-7684-3

Keywords

Navigation