Skip to main content

3D Guided Weakly Supervised Semantic Segmentation

  • Conference paper
  • First Online:
Computer Vision – ACCV 2020 (ACCV 2020)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12622))

Included in the following conference series:

Abstract

Pixel-wise clean annotation is necessary for fully-supervised semantic segmentation, which is laborious and expensive to obtain. In this paper, we propose a weakly supervised 2D semantic segmentation model by incorporating sparse bounding box labels with available 3D information, which is much easier to obtain with advanced sensors. We introduce a 2D-3D inference module to generate accurate pixel-wise segment proposal masks. Guided by 3D information, we first generate a point cloud of objects and calculate a per class objectness probability score for each point using projected bounding-boxes. Then we project the point cloud with objectness probabilities back to the 2D images followed by a refinement step to obtain segment proposals, which are treated as pseudo labels to train a semantic segmentation network. Our method works in a recursive manner to gradually refine the above-mentioned segment proposals. We conducted extensive experimental results on the 2D-3D-S dataset where we manually labeled a subset of images with bounding boxes. We show that the proposed method can generate accurate segment proposals when bounding box labels are available on only a small subset of training images. Performance comparison with recent state-of-the-art methods further illustrates the effectiveness of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    3D information provided by the 2D-3D-S dataset [8] is determinate, and SLAM reconstruction is mature, so high quality 3D information is assumed. Our method is not based on SLAM and every point is projected independently so we don’t need to handle accumulated errors.

References

  1. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceeding IEEE Conference Computer Vision Pattern Recognition, pp. 3431–3440 (2015)

    Google Scholar 

  2. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceeding IEEE Conference Computer Vision Pattern Recognition, pp. 2881–2890 (2017)

    Google Scholar 

  3. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40, 834–848 (2017)

    Article  Google Scholar 

  4. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected CRFs. ArXiv e-prints (2014)

    Google Scholar 

  5. Zhao, H., et al.: PSANet: point-wise spatial attention network for scene parsing. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 270–286. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_17

    Chapter  Google Scholar 

  6. Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. ArXiv e-prints (2017)

    Google Scholar 

  7. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  8. Armeni, I., Sax, A., Zamir, A.R., Savarese, S.: Joint 2D–3D-semantic data for indoor scene understanding. ArXiv e-prints (2017)

    Google Scholar 

  9. Huang, Z., Wang, X., Wang, J., Liu, W., Wang, J.: Weakly-supervised semantic segmentation network with deep seeded region growing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7014–7023 (2018)

    Google Scholar 

  10. Wei, Y., Xiao, H., Shi, H., Jie, Z., Feng, J., Huang, T.S.: Revisiting dilated convolution: a simple approach for weakly-and semi-supervised semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7268–7277 (2018)

    Google Scholar 

  11. Ahn, J., Kwak, S.: Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4981–4990 (2018)

    Google Scholar 

  12. Fan, J., Zhang, Z., Tan, T.: Cian: cross-image affinity net for weakly supervised semantic segmentation. ArXiv e-prints (2018)

    Google Scholar 

  13. Bearman, A., Russakovsky, O., Ferrari, V., Fei-Fei, L.: What’s the point: semantic segmentation with point supervision. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 549–565. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_34

    Chapter  Google Scholar 

  14. Vernaza, P., Chandraker, M.: Learning random-walk label propagation for weakly-supervised semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7158–7166 (2017)

    Google Scholar 

  15. Lin, D., Dai, J., Jia, J., He, K., Sun, J.: ScribbleSup: scribble-supervised convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3159–3167 (2016)

    Google Scholar 

  16. Tang, M., Perazzi, F., Djelouah, A., Ayed, I.B., Schroers, C., Boykov, Y.: On regularized losses for weakly-supervised CNN segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 524–540. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_31

    Chapter  Google Scholar 

  17. Dai, J., He, K., Sun, J.: BoxSup: exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision, pp. 1635–1643 (2015)

    Google Scholar 

  18. Papandreou, G., Chen, L.C., Murphy, K.P., Yuille, A.L.: Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In: Proceedings of the IEEE Conference on Computer Vision, pp. 1742–1750 (2015)

    Google Scholar 

  19. Khoreva, A., Benenson, R., Hosang, J., Hein, M., Schiele, B.: Simple does it: weakly supervised instance and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 876–885 (2017)

    Google Scholar 

  20. Li, Q., Arnab, A., Torr, P.H.: Weakly-and semi-supervised panoptic segmentation. In: Proceedings of the IEEE Conference on Computer Vision, pp. 102–118 (2018)

    Google Scholar 

  21. Song, C., Huang, Y., Ouyang, W., Wang, L.: Box-driven class-wise region masking and filling rate guided loss for weakly supervised semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3136–3145 (2019)

    Google Scholar 

  22. Krähenbühl, P., Koltun, V.: Efficient inference in fully connected CRFs with gaussian edge potentials. In: Proceeding of Advance Neural Information Processing System, pp. 109–117 (2011)

    Google Scholar 

  23. Rother, C., Kolmogorov, V., Blake, A.: GrabCut: interactive foreground extraction using iterated graph cuts. ACM Trans. Graph. (TOG) 23, 309–314 (2004)

    Article  Google Scholar 

  24. Pont-Tuset, J., Arbelaez, P., Barron, J.T., Marques, F., Malik, J.: Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE Trans. Pattern Anal. Mach. Intell. 39, 128–140 (2016)

    Article  Google Scholar 

  25. Xiao, J., Owens, A., Torralba, A.: Sun3D: a database of big spaces reconstructed using SFM and object labels. In: Proceedings of the IEEE Conference on Computer Vision, pp. 1625–1632 (2013)

    Google Scholar 

  26. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5828–5839 (2017)

    Google Scholar 

  27. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. (IJRR) (2013)

    Google Scholar 

  28. Huang, X., et al.: The apolloscape dataset for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Workshops, pp. 954–960 (2018)

    Google Scholar 

  29. Sun, P., et al.: Scalability in perception for autonomous driving: waymo open dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2446–2454 (2020)

    Google Scholar 

  30. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28

    Chapter  Google Scholar 

  31. Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39, 2481–2495 (2017)

    Article  Google Scholar 

  32. Liu, W., Rabinovich, A., Berg, A.C.: ParseNet: looking wider to see better. ArXiv e-prints (2015)

    Google Scholar 

  33. Yuan, Y., Wang, J.: OCNET: object context network for scene parsing. ArXiv e-prints (2018)

    Google Scholar 

  34. Zhang, H., Zhang, H., Wang, C., Xie, J.: Co-occurrent features in semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 548–557 (2019)

    Google Scholar 

  35. Zhou, Y., Sun, X., Zha, Z.J., Zeng, W.: Context-reinforced semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4046–4055 (2019)

    Google Scholar 

  36. He, J., Deng, Z., Zhou, L., Wang, Y., Qiao, Y.: Adaptive pyramid context network for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7519–7528 (2019)

    Google Scholar 

  37. Fu, J., et al.: Dual attention network for scene segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3146–3154 (2019)

    Google Scholar 

  38. Ren, X., Bo, L., Fox, D.: RGB-(D) scene labeling: features and algorithms. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2759–2766 (2012)

    Google Scholar 

  39. Gupta, S., Arbelaez, P., Malik, J.: Perceptual organization and recognition of indoor scenes from RGB-D images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 564–571 (2013)

    Google Scholar 

  40. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54

    Chapter  Google Scholar 

  41. Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE Conference on Computer , pp. 2650–2658 (2015)

    Google Scholar 

  42. Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 345–360. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_23

    Chapter  Google Scholar 

  43. Qi, X., Liao, R., Jia, J., Fidler, S., Urtasun, R.: 3D graph neural networks for RGBD semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision, pp. 5199–5208 (2017)

    Google Scholar 

  44. Park, S.J., Hong, K.S., Lee, S.: RDfNet: RGB-D multi-level residual feature fusion for indoor semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision, pp. 4980–4989 (2017)

    Google Scholar 

  45. Wang, W., Neumann, U.: Depth-aware CNN for RGB-D segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 144–161. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01252-6_9

    Chapter  Google Scholar 

  46. Hou, J., Dai, A., Nießner, M.: 3D-SIS: 3D semantic instance segmentation of RGB-D scans. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4421–4430 (2019)

    Google Scholar 

  47. Vechersky, P., Cox, M., Borges, P., Lowe, T.: Colourising point clouds using independent cameras. IEEE Robot. Autom. Lett. 3, 3575–3582 (2018)

    Article  Google Scholar 

  48. Chen, D.Z., Chang, A.X., Nießner, M.: ScanRefer: 3D object localization in RGB-D Scans using natural language. arXiv preprint arXiv:1912.08830 (2019)

  49. Chang, A., et al.: Matterport3D: learning from RGB-D data in indoor environments. In: International Conference on 3D Vision (3DV) (2017)

    Google Scholar 

  50. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2003)

    MATH  Google Scholar 

  51. Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets for 3D object detection from RGB-D data. In: Proceeding of IEEE Conference Computer Vision Pattern Recognition, pp. 918–927 (2018)

    Google Scholar 

  52. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man, Cybern. 9, 62–66 (1979)

    Article  Google Scholar 

  53. Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 833–851. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_49

    Chapter  Google Scholar 

  54. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceeding of IEEE Conference Computer Vision and Pattern Recognition (2016)

    Google Scholar 

  55. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Proceeding of IEEE Conference Computer Vision and Pattern Recognition, pp. 248–255 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weixuan Sun .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5602 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sun, W., Zhang, J., Barnes, N. (2021). 3D Guided Weakly Supervised Semantic Segmentation. In: Ishikawa, H., Liu, CL., Pajdla, T., Shi, J. (eds) Computer Vision – ACCV 2020. ACCV 2020. Lecture Notes in Computer Science(), vol 12622. Springer, Cham. https://doi.org/10.1007/978-3-030-69525-5_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-69525-5_35

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-69524-8

  • Online ISBN: 978-3-030-69525-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics