Abstract
Indoor semantic segmentation plays a critical role in many applications, such as intelligent robots. However, multi-class recognition is still challenging, especially for pixel-level indoor semantic labeling. In this paper, a novel deep structured model that combines the strengths of the widely used convolutional neural networks (CNNs) and recurrent neural networks (RNNs) is proposed. We first present a multi-information fusion model that utilizes the scene category information to fine-tune the fully convolutional network. Then, to refine the coarse outputs of CNN, the RNN is applied to the final CNN layer so that we can build an end-to-end trainable system. This Graph-RNN is transformed from a conditional random field based on superpixel segmentation graphical modeling that can utilize flexible contextual information of different neighboring regions. The experimental results on the recent large SUN RGB-D dataset demonstrate that the proposed model outperforms existing state-of-the-art methods on the challenging 40 dominant classes task (\(40.8\%\) mean IU accuracy and \(69.1\%\) pixel accuracy). We also evaluate our model on the public NYU depth V2 dataset and achieve remarkable performance.







Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Anand, A., Koppula, H.S., Joachims, T., Saxena, A.: Contextually guided semantic labeling and search for three-dimensional point clouds. Int. J. Robot. Res. 32(1):19–34 (2013)
Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 33(5), 898–916 (2011)
Bell, S., Upchurch, P., Snavely, N., Bala, K.: Material recognition in the wild with the materials in context database. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3479–3487 (2015)
Bo, L., Ren, X., Fox, D.: Unsupervised feature learning for RGB-D based object recognition. In: Experimental Robotics, pp. 387–402. Springer (2013)
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semanticimage segmentation with deep convolutional nets and fully connected CRFS. In: International Conference on Learning Representations, pp. 357–361. ICLR, Hilton San Diego Resort (2015)
Chen, W., Yue, H., Wang, J., Wu, X.: An improved edge detection algorithm for depth map inpainting. Opt. Lasers Eng. 55, 69–77 (2014)
Cheng, M.M., Zheng, S., Lin, W.Y., Vineet, V., Sturgess, P., Crook, N., Mitra, N.J., Torr, P.: Imagespirit: verbal guided image parsing. ACM Trans. Graph. 3(1), 3:1–3:11 (2014). doi:10.1145/2682628
Couprie, C., Farabet, C., Najman, L., LeCun, Y.: Indoor semantic segmentation using depth information. arXiv preprint arXiv:1301.3572 (2013)
Deng, Z., Todorovic, S., Jan Latecki, L.: Semantic segmentation of RGBD images with mutex constraints. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1733–1741 (2015)
Ding, K., Chen, W., Wu, X.: Optimum inpainting for depth map based on l 0 total variation. Vis. Comput. 30(12), 1311–1320 (2014)
Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: IEEE International Conference on Computer Vision, pp. 2650–2658 (2015)
Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1915–1929 (2013)
Girshick, R.: Fast R-CNN. In: The IEEE International Conference on Computer Vision (ICCV). ICCV, Santiago, Chile (2015)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 580–587. IEEE (2014)
Gupta, S., Arbelaez, P., Malik, J.: Perceptual organization and recognition of indoor scenes from RGB-D images. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 564–571. IEEE (2013)
Gupta, S., Girshick, R., Arbelaez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Computer Vision–ECCV 2014, pp. 345–360. Springer (2014)
Hariharan, B., Arbelaez, P., Girshick, R., Malik, J.: Simultaneous detection and segmentation. In: European Conference on Computer Vision (ECCV), pp. 297–312(2014)
Hayat, M., Khan, S.H., Bennamoun, M.: A spatial layout and scale invariant feature representation for indoor scene classification. IEEE Trans. Image Process. 25(10), 4829–4841 (2016)
Hermans, A., Floros, G., Leibe, B.: Dense 3D semantic mapping of indoor scenes from RGB-D images. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 2631–2638. IEEE (2014)
Hoiem, D., Efros, A.A., Hebert, M.: Putting objects in perspective. Int. J. Comput. Vis. 80(1), 3–15 (2008)
Huang, G., Liu, Z., Weinberger, K.Q.: Densely connected convolutional networks. arXiv:1608.06993 (2016)
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on Multimedia, pp. 675–678. ACM (2014)
Khan, S.H., Bennamoun, M., Sohel, F., Togneri, R., Naseem, I.: Integrating geometrical context for semantic labeling of indoor scenes using rgbd images. Int. J. Comput. Vis. 117(1), 1–20 (2016)
Krähenbühl, P., Koltun, V.: Efficient inference in fully connected CRFS with Gaussian edge potentials. Adv. Neural Inf. Process. Syst. 109–117 (2011)
Koppula, H.S., Anand, A., Joachims, T., Saxena, A.: Semantic labeling of 3D point clouds for indoor scenes. In: Advances in Neural Information Processing Systems (NIPS), pp. 244–252 (2011)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 1097–1105 (2012)
Lai, K., Bo, L., Fox, D.: Unsupervised feature learning for 3D scene labeling. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 3050–3057. IEEE (2014)
Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989)
Li, Z., Gan, Y., Liang, X., Yu, Y., Cheng, H., Lin, L.: LSTM-CF: Unifying context modeling and fusion with LSTMS for RGB-D scene labeling. In: European Conference on Computer Vision, pp. 541–557. Springer (2016)
Liang, X., Shen, X., Xiang, D., Feng, J., Lin, L., Yan, S.: Semantic object parsing with local-global long short-term memory. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1337–1342. CVPR, Boston, MA, USA (2015)
Nathan Silberman Derek Hoiem, P.K., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: ECCV (2012)
Ren, X., Bo, L., Fox, D.: RGB-(B) scene labeling: Features and algorithms. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2759–2766. IEEE (2012)
Shotton, J., Winn, J., Rother, C., Criminisi, A.: Textonboost for image understanding: multi-class object recognition and segmentation by jointly modeling texture, layout, and context. Int. J. Comput. Vis. 81(1), 2–23 (2009)
Shuai, B., Zuo, Z., Wang, B., Wang, G.: Dag-recurrent neural networks for scene labeling. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Silberman, N., Fergus, R.: Indoor scene segmentation using a structured light sensor. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 601–608. IEEE (2011)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Comput. Sci. arXiv:1409.1556 (2014)
Song, S., Lichtenberg, S.P., Xiao, J.: Sun RGB-D: A RGB-D scene understanding benchmark suite. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 567–576 (2015)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. arXiv preprint arXiv:1409.4842 (2014)
Tighe, J., Lazebnik, S.: Superparsing: scalable nonparametric image parsing with superpixels. In: Computer Vision–ECCV 2010, pp. 352–365. Springer (2010)
van den Oord, A., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759 (2016)
Wang, J., Zheng, C., Chen, W., Wu, X.: Learning aggregated features and optimizing model for semantic labeling. Vis. Comput. 1–14 (2016)
Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.H.: Conditional random fields as recurrent neuralnetworks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1529–1537 (2015)
Zhou, B., Garcia, A.L., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. Adv. Neural Inf. Process. Syst. 1, 487–495 (2014)
Acknowledgements
The work described in this paper was supported by National Science Foundation of China under the research Project Grant Nos. 61573048, 61620106012, the International Scientific and Technological Cooperation Projects of China under Grant No. 2015DFG12650, and the Key Laboratory of Robotics and Intelligent Manufacturing Equipment Technology of Zhejiang Province.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zheng, C., Wang, J., Chen, W. et al. Multi-class indoor semantic segmentation with deep structured model. Vis Comput 34, 735–747 (2018). https://doi.org/10.1007/s00371-017-1411-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-017-1411-8