Skip to main content

CYBORGS: Contrastively Bootstrapping Object Representations by Grounding in Segmentation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13691))

Included in the following conference series:

  • 3150 Accesses

Abstract

Many recent approaches in contrastive learning have worked to close the gap between pretraining on iconic images like ImageNet and pretraining on complex scenes like COCO. This gap exists largely because commonly used random crop augmentations obtain semantically inconsistent content in crowded scene images of diverse objects. In this work, we propose a framework which tackles this problem via joint learning of representations and segmentation. We leverage segmentation masks to train a model with a mask-dependent contrastive loss, and use the partially trained model to bootstrap better masks. By iterating between these two components, we ground the contrastive updates in segmentation information, and simultaneously improve segmentation throughout pretraining. Experiments show our representations transfer robustly to downstream tasks in classification, detection and segmentation. (Code and pretrained models available at https://github.com/renwang435/CYBORGS).

H. Zhao and Y. Gao—Equal advising.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bai, Y., Chen, X., Kirillov, A., Yuille, A., Berg, A.C.: Point-level region contrast for object detection pre-training. arXiv preprint arXiv:2202.04639 (2022)

  2. Ballard, D.H., Zhang, R.: The hierarchical evolution in human vision modeling. Top. Cogn. Sci. 13(2), 309–328 (2021)

    Article  Google Scholar 

  3. Bar, A., et al.: DETReg: unsupervised pretraining with region priors for object detection. arXiv preprint arXiv:2106.04550 (2021)

  4. Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: Thing and stuff classes in context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1209–1218 (2018)

    Google Scholar 

  5. Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 139–156. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_9

    Chapter  Google Scholar 

  6. Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 9912–9924. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper/2020/file/70feb62b69f16e0238f741fab228fec2-Paper.pdf

  7. Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking Atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)

  8. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)

    Google Scholar 

  9. Chen, T., Luo, C., Li, L.: Intriguing properties of contrastive losses. Adv. Neural. Inf. Process. Syst. 34, 1–9 (2021)

    Google Scholar 

  10. Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020)

  11. Cho, J.H., Mall, U., Bala, K., Hariharan, B.: PiCIE: unsupervised semantic segmentation using invariance and equivariance in clustering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16794–16804 (2021)

    Google Scholar 

  12. Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016)

    Google Scholar 

  13. Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430 (2015)

    Google Scholar 

  14. Gopal, S., Yang, Y.: Von mises-fisher clustering models. In: International Conference on Machine Learning, pp. 154–162. PMLR (2014)

    Google Scholar 

  15. Grill, J.B., et al.: Bootstrap your own latent - a new approach to self-supervised learning. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 21271–21284. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper/2020/file/f3ada80d5c4ee70142b17b8192b2958e-Paper.pdf

  16. Gupta, A., Dollar, P., Girshick, R.: Lvis: a dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5356–5364 (2019)

    Google Scholar 

  17. Hamilton, M., Zhang, Z., Hariharan, B., Snavely, N., Freeman, W.T.: Unsupervised semantic segmentation by distilling feature correspondences. In: International Conference on Learning Representations (2021)

    Google Scholar 

  18. Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Hypercolumns for object segmentation and fine-grained localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 447–456 (2015)

    Google Scholar 

  19. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)

    Google Scholar 

  20. Hénaff, O.J., Koppula, S., Alayrac, J.B., Oord, A., Vinyals, O., Carreira, J.: Efficient Visual Pretraining with Contrastive Detection. In: International Conference on Computer Vision (2021)

    Google Scholar 

  21. Herranz, L., Jiang, S., Li, X.: Scene recognition with CNNs: objects, scales and dataset bias. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 571–579 (2016)

    Google Scholar 

  22. Hwang, J.J., Yet al.: SegSort: segmentation by discriminative sorting of segments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7334–7344 (2019)

    Google Scholar 

  23. Jabri, A., Owens, A., Efros, A.: Space-time correspondence as a contrastive random walk. Adv. Neural. Inf. Process. Syst. 33, 19545–19560 (2020)

    Google Scholar 

  24. Ji, X., Henriques, J.F., Vedaldi, A.: Invariant information clustering for unsupervised image classification and segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9865–9874 (2019)

    Google Scholar 

  25. Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs. IEEE Trans. Big Data. 7, 537–547 (2019)

    Google Scholar 

  26. Ke, T.W., Hwang, J.J., Yu, S.X.: Universal weakly supervised segmentation by pixel-to-segment contrastive learning. In: International Conference on Learning Representations (2021)

    Google Scholar 

  27. Komodakis, N., Gidaris, S.: Unsupervised representation learning by predicting image rotations. In: International Conference on Learning Representations (ICLR) (2018)

    Google Scholar 

  28. Kornblith, S., Shlens, J., Le, Q.V.: Do better imagenet models transfer better? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2661–2671 (2019)

    Google Scholar 

  29. Krähenbühl, P., Koltun, V.: Efficient inference in fully connected CRFs with gaussian edge potentials. Adv. Neural. Inf. Process. Syst. 24, 1–11 (2011)

    Google Scholar 

  30. Kuang, H., et al.: Video contrastive learning with global context. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3195–3204 (2021)

    Google Scholar 

  31. Li, J., Zhou, P., Xiong, C., Hoi, S.C.: Prototypical contrastive learning of unsupervised representations. In: ICLR (2021)

    Google Scholar 

  32. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  33. Liu, S., Li, Z., Sun, J.: Self-EMD: self-supervised object detection without imagenet. arXiv preprint arXiv:2011.13677 (2020)

  34. Mo, S., Kang, H., Sohn, K., Li, C.L., Shin, J.: Object-aware contrastive learning for debiased scene representation. Adv. Neural. Inf. Process. Syst. 34, 1–14 (2021)

    Google Scholar 

  35. Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_5

    Chapter  Google Scholar 

  36. Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

  37. Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, pp. 2536–2544 (2016)

    Google Scholar 

  38. Purushwalkam, S., Gupta, A.: Demystifying contrastive self-supervised learning: invariances, augmentations and dataset biases. Adv. Neural. Inf. Process. Syst. 33, 3407–3418 (2020)

    Google Scholar 

  39. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626 (2017)

    Google Scholar 

  40. Selvaraju, R.R., Desai, K., Johnson, J., Naik, N.: Casting your model: learning to localize improves self-supervised representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11058–11067 (2021)

    Google Scholar 

  41. Van Gansbeke, W., Vandenhende, S., Georgoulis, S., Van Gool, L.: Unsupervised semantic segmentation by contrasting object mask proposals. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10052–10062, October 2021

    Google Scholar 

  42. Wang, T., Isola, P.: Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In: International Conference on Machine Learning, pp. 9929–9939. PMLR (2020)

    Google Scholar 

  43. Wang, X., Zhang, R., Shen, C., Kong, T., Li, L.: Dense contrastive learning for self-supervised visual pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3024–3033 (2021)

    Google Scholar 

  44. Xiao, T., Reed, C.J., Wang, X., Keutzer, K., Darrell, T.: Region similarity representation learning. arXiv preprint arXiv:2103.12902 (2021)

  45. Xie, J., Zhan, X., Liu, Z., Ong, Y.S., Loy, C.C.: Unsupervised object-level representation learning from scene images. arXiv preprint arXiv:2106.11952 (2021)

  46. Xiong, Y., Ren, M., Zeng, W., Urtasun, R.: Self-supervised representation learning from flow equivariance. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10191–10200, October 2021

    Google Scholar 

  47. Xu, J., Wang, X.: Rethinking self-supervised correspondence learning: a video frame-level similarity perspective. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10075–10085, October 2021

    Google Scholar 

  48. Yang, C., Wu, Z., Zhou, B., Lin, S.: Instance localization for self-supervised detection pretraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3987–3996 (2021)

    Google Scholar 

  49. You, Y.,et al.: Large batch optimization for deep learning: training BERT in 76 minutes. arXiv preprint arXiv:1904.00962 (2019)

  50. Zhang, F., Torr, P., Ranftl, R., Richter, S.: Looking beyond single images for contrastive semantic segmentation learning. Adv. Neural. Inf. Process. Syst. 34, 1–13 (2021)

    Google Scholar 

  51. Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 649–666. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_40

    Chapter  Google Scholar 

  52. Zhang, X., Maire, M.: Self-supervised visual representation learning from hierarchical grouping. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 16579–16590. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper/2020/file/c1502ae5a4d514baec129f72948c266e-Paper.pdf

Download references

Acknowledgements

YG is supported by the Ministry of Science and Technology of the People’s Republic of China, the 2030 Innovation Megaprojects “Program on New Generation Artificial Intelligence” (Grant No. 2021AAA0150000). YG is also supported by a grant from the Guoqiang Institute, Tsinghua University. RW would like to thank Yu Sun and Yingdong Hu for valuable edits to the paper, without which this work would not be possible.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yang Gao .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 401 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, R., Zhao, H., Gao, Y. (2022). CYBORGS: Contrastively Bootstrapping Object Representations by Grounding in Segmentation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13691. Springer, Cham. https://doi.org/10.1007/978-3-031-19821-2_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19821-2_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19820-5

  • Online ISBN: 978-3-031-19821-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics