Skip to main content
Log in

Video object segmentation through semantic visual words matching

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Video object segmentation (VOS) has been widely used in the fields of computer vision. However, existing VOS algorithms have drawbacks, such as difficulty with object deformation, occlusion, and fast motion. We therefore propose an effective VOS algorithm based on semantic visual words matching. Specifically, given the support frame and its corresponding mask, the frame is firstly input to the encoder with an embedding layer, and then a clustering algorithm is followed to generate a group of semantic visual words according to its mask. For a query frame to be segmented, a matching operation is performed against words generated from the support frame. In this manner, each pixel on query frame can be classified into different object categories by the obtained similarity. What’s more, a self-attention mechanism is applied to enhance the embedding features in order to capture the global dependencies before the words matching. For further handling the object changing and global mismatch problems, an online update and correction mechanism are also employed in our method. Experiments show that our proposed method achieved competitive results on the DAVIS 2016 and DAVIS 2017 datasets. J&F-mean, the mean value between regional similarity and contour accuracy, reached 83.2% and 72.3% on DAVIS 2016 and DAVIS 2017, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

References

  1. Behl HS, Najafi M, Arnab A, Torr PHS (2019) Meta learning deep visual words for fast video object segmentation. In: Proceedings of the 2019 conference on neural information processing systems machine learning for autonomous driving workshop

  2. Caelles S, Maninis KK, Pont-Tuset J, Leal-Taixe L, Cremers D, Gool LV (2017) One-shot video object segmentation. In: Proceedings of the 2017 IEEE conference on computer vision and pattern recognition, pp 5320–5329

  3. Hospedales T, Antoniou A, Micaelli P, Storkey A (2020) Meta-learning in neural networks: a survey. In: Arxiv preprint arXiv:2004.05439

  4. Hu YT, Huang JB, Schwing AG (2018) Videomatch: Matching based video object segmentation. In: Proceedings of the 2018 European conference on computer vision

  5. Khoreva A, Benenson R, Ilg E, Brox T, Schiele B (2019) Lucid data dreaming for video object segmentation. International Journal of Computer Vision

  6. Li Y, Shen Z, Shan Y (2020) Fast video object segmentation using the global context module. In: Vedaldi A, Bischof H, Brox T, Frahm JM (eds) Computer vision – ECCV 2020. Springer International Publishing, Cham, pp 735–750

  7. Liang Y, Li X, Jafari N, Chen Q (2020) Video object segmentation with adaptive feature bank and uncertain-region refinement. In: Proceedings of the 2020 conference on neural information processing systems

  8. Lu X, Wang W, Danelljan M, Zhou T, Shen J, Van Gool L (2020) Video object segmentation with episodic graph memory networks. In: Vedaldi A, Bischof H, Brox T, Frahm JM (eds) Computer vision – ECCV 2020. Springer International Publishing, Cham, pp 661–679

  9. Lu X, Wang W, Ma C, Shen J, Shao L, Porikli F (2019) See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3618–3627. https://doi.org/10.1109/CVPR.2019.00374

  10. Lu X, Wang W, Shen J, Crandall D, Luo J (2022) Zero-shot video object segmentation with co-attention siamese networks. IEEE Trans Pattern Anal Mach Intell 44 (4):2228–2242. https://doi.org/10.1109/TPAMI.2020.3040258

    Google Scholar 

  11. Lu X, Wang W, Shen J, Crandall D, Van Gool L (2021) Segmenting objects from relational visual data. IEEE Trans Pattern Anal Mach Intell, pp 1–1. https://doi.org/10.1109/TPAMI.2021.3115815

  12. Luiten J, Voigtlaender P, Leibe B (2018) Premvos:proposal-generation, refinement and merging for the davis challenge on video object segmentation 2018. In: The 2018 DAVIS challenge on video object segmentation - CVPR workshops

  13. Maninis K, Caelles S, Chen Y, Pont-Tuset J, Leal-Taixé L, Cremers D, Van Gool L (2019) Video object segmentation without temporal information. IEEE Trans Pattern Anal Mach Intell 41(6):1515–1530

    Article  Google Scholar 

  14. Meinhardt T, Leal-taixe L (2020) Make one-shot video object segmentation efficient again. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H (eds) Advances in neural information processing systems, vol 33. Curran Associates, Inc., pp 10607–10619. https://proceedings.neurips.cc/paper/2020/file/781397bc0630d47ab531ea850bddcf63-Paper.pdf

  15. Oh SW, Lee J, Sunkavalli K, Kim SJ (2018) Fast video object segmentation by reference-guided mask propagation. In: Proceedings of the 2018 IEEE conference on computer vision and pattern recognition, pp 7376–7385

  16. Oh SW, Lee J, Xu N, Kim SJ (2019) Video object segmentation using space-time memory networks. In: Proceedings of the 2019 IEEE international conference on computer vision, pp 9225–9234

  17. Perazzi F, Khoreva A, Benenson R, Schiele B, Sorkine-Hornung A (2017) Learning video object segmentation from static images. In: Proceedings of the 2017 IEEE conference on computer vision and pattern recognition

  18. Seong H, Hyun J, Kim E (2020) Kernelized memory network for video object segmentation. In: Vedaldi A, Bischof H, Brox T, Frahm JM (eds) Computer vision – ECCV 2020. Springer International Publishing, Cham, pp 629–645

  19. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser U, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st international conference on neural information processing systems, NIPS’17. Curran Associates Inc., Red Hook, NY, USA, pp 6000–6010

  20. Voigtlaender P, Chai Y, Schroff F, Adam H, Leibe B, Chen L (2019) Feelvos: Fast end-to-end embedding learning for video object segmentation. In: Proceedings of the 2019 IEEE conference on computer vision and pattern recognition, pp 9473–9482

  21. Wang Z, Xu J, Liu L, Zhu F, Shao L (2019) Ranet: Ranking attention network for fast video object segmentation. In: 2019 IEEE/CVF international conference on computer vision (ICCV). https://doi.org/10.1109/ICCV.2019.00408, pp 3977–3986

  22. Woo S, Park J, Lee J, Kweon IS (2018) CBAM: convolutional block attention module. In: Computer vision – ECCV 2018, Lecture notes in computer science. https://doi.org/10.1007/978-3-030-01234-2_1, vol 11211. Springer, pp 3–19

  23. Xie H, Yao H, Zhou S, Zhang S, Sun W (2021) Efficient regional memory network for video object segmentation. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR46437.2021.00134, pp 1286–1295

  24. Yang L, Wang Y, Xiong X, Yang J, Katsaggelos AK (2018) Efficient video object segmentation via network modulation. In: Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, pp 6499–6507

  25. Yang Z, Wei Y, Yang Y (2020) Collaborative video object segmentation by foreground-background integration. In: Proceedings of the 2020 European conference on computer vision

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chuanyan Hao.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was partially supported by the National Natural Science Foundation of China (Grant Nos. 61802197) and is also funded in part by the Science and Technology Development Fund, Macau SAR (File Nos. SKL-IOTSC-2018-2020, 0018/2019/AKP, 00 08/2019/AGJ, and FDCT/194/2017/A3), in part by the University of Macau under Grant MYRG2018-00248-FST and MYRG2019-0137-FST.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hao, C., Chen, Y., Wu, W. et al. Video object segmentation through semantic visual words matching. Multimed Tools Appl 82, 19591–19605 (2023). https://doi.org/10.1007/s11042-023-14361-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-14361-w

Keywords

Navigation