VISAGE: Video Instance Segmentation with Appearance-Guided Enhancement

Kim, Hanjung; Kang, Jaehyun; Heo, Miran; Hwang, Sukjun; Oh, Seoung Wug; Kim, Seon Joo

doi:10.1007/978-3-031-72667-5_6

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15065))

Included in the following conference series:

European Conference on Computer Vision

441 Accesses

Abstract

In recent years, online Video Instance Segmentation (VIS) methods have shown remarkable advancement with their powerful query-based detectors. Utilizing the output queries of the detector at the frame-level, these methods achieve high accuracy on challenging benchmarks. However, our observations demonstrate that these methods heavily rely on location information, which often causes incorrect associations between objects. This paper presents that a key axis of object matching in trackers is appearance information, which becomes greatly instructive under conditions where positional cues are insufficient for distinguishing their identities. Therefore, we suggest a simple yet powerful extension to object decoders that explicitly extract embeddings from backbone features and drive queries to capture the appearances of objects, which greatly enhances instance association accuracy. Furthermore, recognizing the limitations of existing benchmarks in fully evaluating appearance awareness, we have constructed a synthetic dataset to rigorously validate our method. By effectively resolving the over-reliance on location information, we achieve state-of-the-art results on YouTube-VIS 2019/2021 and Occluded VIS (OVIS). Code is available at https://github.com/KimHanjung/VISAGE.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

In Defense of Online Models for Video Instance Segmentation

Instance as Identity: A Generic Online Paradigm for Video Instance Segmentation

STC: Spatio-Temporal Contrastive Learning for Video Instance Segmentation

References

Athar, A., Mahadevan, S., Os̆ep, A., Leal-Taixé, L., Leibe, B.: STEm-seg: spatio-temporal embeddings for instance segmentation in videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI, pp. 158–177. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_10
Chapter Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I, pp. 213–229. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML (2020)
Google Scholar
Cheng, B., Choudhuri, A., Misra, I., Kirillov, A., Girdhar, R., Schwing, A.G.: Mask2former for video instance segmentation. arXiv preprint arXiv:2112.10764 (2021)
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR (2022)
Google Scholar
Choudhuri, A., Chowdhary, G., Schwing, A.G.: Context-aware relative object queries to unify video instance and panoptic segmentation. In: CVPR (2023)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Fischer, T., et al.: QDTrack: quasi-dense similarity learning for appearance-only multiple object tracking. TPAMI (2023)
Google Scholar
Ghiasi, G., et al.: Simple copy-paste is a strong data augmentation method for instance segmentation. In: CVPR (2021)
Google Scholar
Girshick, R.: Fast R-CNN. In: ICCV (2015)
Google Scholar
Han, S.H., et al.: VISOLO: grid-based space-time aggregation for efficient online video instance segmentation. In: CVPR (2022)
Google Scholar
He, F., et al.: InsPro: propagating instance query and proposal for online video instance segmentation. NeurIPS 35, 19370–19383 (2022)
Google Scholar
He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Heo, M., et al.: A generalized framework for video instance segmentation. In: CVPR (2023)
Google Scholar
Heo, M., Hwang, S., Oh, S.W., Lee, J.Y., Kim, S.J.: Vita: Video instance segmentation via object token association. NeurIPS (2022)
Google Scholar
Huang, D.A., Yu, Z., Anandkumar, A.: MinVIS: a minimal video instance segmentation framework without video-based training. NeurIPS 35, 31265–31277 (2022)
Google Scholar
Hwang, S., Heo, M., Oh, S.W., Kim, S.J.: Video instance segmentation using inter-frame communication transformers. NeurIPS 34, 13352–13363 (2021)
Google Scholar
Ke, L., Li, X., Danelljan, M., Tai, Y.W., Tang, C.K., Yu, F.: Prototypical cross-attention networks for multiple object tracking and segmentation. NeurIPS 34, 1192–1203 (2021)
Google Scholar
Kim, D., Woo, S., Lee, J.Y., Kweon, I.S.: Video panoptic segmentation. In: CVPR (2020)
Google Scholar
Kuhn, H.W.: The hungarian method for the assignment problem. NRL 2(1-2), 83–97 (1955)
Google Scholar
Li, J., Zhang, J., Maybank, S.J., Tao, D.: Bridging composite and real: towards end-to-end deep image matting. Int. J. Comput. Vision 130(2), 246–266 (2022). https://doi.org/10.1007/s11263-021-01541-0
Article Google Scholar
Li, J., Yu, B., Rao, Y., Zhou, J., Lu, J.: TCOVIS: temporally consistent online video instance segmentation. In: ICCV (2023)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, pp. 740–755. Springer International Publishing, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Qi, J., et al.: Occluded video instance segmentation. arXiv preprint arXiv:2102.01558 (2021)
Qi, J., et al.: Occluded video instance segmentation: a benchmark. IJCV (2022). https://doi.org/10.1007/s11263-022-01629-1
Sohn, K.: Improved deep metric learning with multi-class n-pair loss objective. NeurIPS 29 (2016)
Google Scholar
Wang, Y., et al.: End-to-end video instance segmentation with transformers. In: CVPR (2020)
Google Scholar
Wu, J., et al.: Efficient video instance segmentation via tracklet query and proposal. In: CVPR (2022)
Google Scholar
Wu, J., Jiang, Y., Zhang, W., Bai, X., Bai, S.: SeqFormer: a frustratingly simple model for video instance segmentation. In: ECCV (2022)
Google Scholar
Wu, J., Liu, Q., Jiang, Y., Bai, S., Yuille, A., Bai, X.: In defense of online models for video instance segmentation. In: ECCV (2022)
Google Scholar
Yang, L., Fan, Y., Xu, N.: Video instance segmentation. In: ICCV (2019)
Google Scholar
Yang, L., Fan, Y., Xu, N.: The 3rd large-scale video object segmentation challenge - video instance segmentation track (2021)
Google Scholar
Yang, L., Fan, Y., Xu, N.: The 4th large-scale video object segmentation challenge - video instance segmentation track (2022)
Google Scholar
Yang, S., et al.: Crossover learning for fast online video instance segmentation. In: ICCV (2021)
Google Scholar
Yang, S., et al.: Temporally efficient vision transformer for video instance segmentation. In: CVPR (2022)
Google Scholar
Ying, K., et al.: CTVIS: consistent training for online video instance segmentation. In: ICCV (2023)
Google Scholar
Zhang, T., et al.: DVIS: decoupled video instance segmentation framework. In: ICCV (2023)
Google Scholar
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: ICLR (2021)
Google Scholar

Download references

Acknowledgements

This work was partly supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (Artificial Intelligence Graduate School Program, Yonsei University, under Grant 2020-0-01361) and Artificial Intelligence Innovation Hub under Grant RS-2021-II212068.

Author information

Authors and Affiliations

Yonsei University, Seoul, South Korea
Hanjung Kim, Jaehyun Kang, Miran Heo & Seon Joo Kim
Carnegie Mellon University, Pittsburgh, USA
Sukjun Hwang
Adobe Research, San Francisco, USA
Seoung Wug Oh

Authors

Hanjung Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jaehyun Kang
View author publications
You can also search for this author in PubMed Google Scholar
Miran Heo
View author publications
You can also search for this author in PubMed Google Scholar
Sukjun Hwang
View author publications
You can also search for this author in PubMed Google Scholar
Seoung Wug Oh
View author publications
You can also search for this author in PubMed Google Scholar
Seon Joo Kim
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 69814 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kim, H., Kang, J., Heo, M., Hwang, S., Oh, S.W., Kim, S.J. (2025). VISAGE: Video Instance Segmentation with Appearance-Guided Enhancement. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15065. Springer, Cham. https://doi.org/10.1007/978-3-031-72667-5_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-72667-5_6
Published: 29 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72666-8
Online ISBN: 978-3-031-72667-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

VISAGE: Video Instance Segmentation with Appearance-Guided Enhancement