Skip to main content

OWS-Seg: Online Weakly Supervised Video Instance Segmentation via Contrastive Learning

  • Conference paper
  • First Online:
Artificial Neural Networks and Machine Learning – ICANN 2023 (ICANN 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14260))

Included in the following conference series:

  • 999 Accesses

Abstract

Video Instance Segmentation (VIS) aims to detect, segment, and track instances appearing in a video. To reduce annotation costs, some existing VIS methods use the Weakly Supervised Scheme (WSVIS). However, those WSVIS methods usually run in an offline manner, which fails in handling ongoing and long videos due to the limited computational resources. It would be considerable benefits if online models could match or surpass the performance of offline models. In this paper, we propose OWS-Seg, an end-to-end, simple, and efficient online WSVIS network with box annotations. Concretely, OWS-Seg consists of two novel contrastive learning branches: the Instance Contrastive Learning (ICL) branch learns instance level discriminative features to distinguish different instances in each frame, and the Mask Contrastive Learning (MCL) branch with Boxccam learns pixel level discriminative features to differentiate foreground and background. Experimental results show that OWS-Seg achieves promising performance, e.g., 43.5% AP on YouTube-VIS 2019, 36.6% AP on YouTube-VIS 2021, and 21.9% AP on OVIS. Besides, OWS-Seg achieves comparable performance to offline WSVIS and surpasses recent fully supervised methods, demonstrating its wide range of practical applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Cao, J., Anwer, R.M., Cholakkal, H., Khan, F.S., Pang, Y., Shao, L.: SipMask: spatial information preservation for fast image and video instance segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 1–18. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_1

    Chapter  Google Scholar 

  2. Fu, Y., Liu, S., Iqbal, U., De Mello, S., Shi, H., Kautz, J.: Learning to track instances without video annotations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8680–8689 (2021)

    Google Scholar 

  3. Ge, Z., Liu, S., Li, Z., Yoshie, O., Sun, J.: OTA: optimal transport assignment for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 303–312 (2021)

    Google Scholar 

  4. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  5. Heo, M., Hwang, S., Oh, S.W., Lee, J.Y., Kim, S.J.: VITA: video instance segmentation via object token association. In: Advances in Neural Information Processing Systems (2022)

    Google Scholar 

  6. Hwang, S., Heo, M., Oh, S.W., Kim, S.J.: Video instance segmentation using inter-frame communication transformers. Adv. Neural. Inf. Process. Syst. 34, 13352–13363 (2021)

    Google Scholar 

  7. Ke, L., Danelljan, M., Ding, H., Tai, Y.W., Tang, C.K., Yu, F.: Mask-free video instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

    Google Scholar 

  8. Lee, J., Yi, J., Shin, C., Yoon, S.: BBAM: bounding box attribution map for weakly supervised semantic and instance segmentation. In: Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pp. 2643–2652 (2021)

    Google Scholar 

  9. Li, F., Shen, L., Mi, Y., Li, Z.: DRCNet: dynamic image restoration contrastive network. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13679. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19800-7_30

  10. Li, F., Zhang, L., Lei, J., Liu, Z., Li, Z.: Multi-frequency representation enhancement with privilege information for video super-resolution. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)

    Google Scholar 

  11. Li, X., Wang, J., Li, X., Lu, Y.: Hybrid instance-aware temporal fusion for online video instance segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 1429–1437 (2022)

    Google Scholar 

  12. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)

    Google Scholar 

  13. Liu, Q., Ramanathan, V., Mahajan, D., Yuille, A., Yang, Z.: Weakly supervised instance segmentation for videos with temporal mask consistency. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13968–13978 (2021)

    Google Scholar 

  14. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  15. Qi, J., et al.: Occluded video instance segmentation: a benchmark. In: IJCV (2022)

    Google Scholar 

  16. Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019)

    Google Scholar 

  17. Tian, Z., Shen, C., Wang, X., Chen, H.: BoxInst: high-performance instance segmentation with box annotations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5443–5452 (2021)

    Google Scholar 

  18. Wang, Y., et al.: End-to-end video instance segmentation with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8741–8750 (2021)

    Google Scholar 

  19. Wu, J., Yarram, S., Liang, H., Lan, T., Medioni, G.: Efficient video instance segmentation via tracklet query and proposal. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 959–968 (2022)

    Google Scholar 

  20. Wu, J., Jiang, Y., Bai, S., Zhang, W., Bai, X.: SeqFormer: sequential transformer for video instance segmentation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13688. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_32

  21. Wu, J., Liu, Q., Jiang, Y., Bai, S., Yuille, A., Bai, X.: In defense of online models for video instance segmentation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13688. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_34

  22. Xie, J., Xiang, J., Chen, J., Hou, X., Zhao, X., Shen, L.: C2AM: contrastive learning of class-agnostic activation map for weakly supervised object localization and semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 989–998 (2022)

    Google Scholar 

  23. Xu, N., et al.: Youtubevis dataset 2021 version (2022)

    Google Scholar 

  24. Yan, L., Wang, Q., Ma, S., Wang, J., Yu, C.: Solve the puzzle of instance segmentation in videos: a weakly supervised framework with spatio-temporal collaboration. IEEE Trans. Circuits Syst. Video Technol. 32, 393–406 (2022)

    Google Scholar 

  25. Yang, L., Fan, Y., Xu, N.: Video instance segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5188–5197 (2019)

    Google Scholar 

  26. Yang, S., et al.: Crossover learning for fast online video instance segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8043–8052 (2021)

    Google Scholar 

  27. Yang, S., et al.: Temporally efficient vision transformer for video instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2885–2895 (2022)

    Google Scholar 

  28. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: International Conference on Learning Representations (2020)

    Google Scholar 

Download references

Acknowledgements

The authors gratefully acknowledge the financial support from the National Key R &D Program of China (No.2021ZD0113805, No.2020YFD0900204), and the Key Research and Development Plan Project of Guangdong Province(No.2020B0202010009). We appreciate the seminar participants’ comments at the Center for Deep Learning of Computer Vision Research at China Agricultural University, making the manuscript improve significantly.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhenbo Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ning, Y., Li, F., Dong, M., Li, Z. (2023). OWS-Seg: Online Weakly Supervised Video Instance Segmentation via Contrastive Learning. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14260. Springer, Cham. https://doi.org/10.1007/978-3-031-44195-0_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-44195-0_39

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-44194-3

  • Online ISBN: 978-3-031-44195-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics