Abstract
We present GvSeg, a general video segmentation framework for addressing four different video segmentation tasks (i.e., instance, semantic, panoptic, and exemplar-guided) while maintaining an identical architectural design. Currently, there is a trend towards developing general video segmentation solutions that can be applied across multiple tasks. This streamlines research endeavors and simplifies deployment. However, such a highly homogenized framework in current design, where each element maintains uniformity, could overlook the inherent diversity among different tasks and lead to suboptimal performance. To tackle this, GvSeg: i) provides a holistic disentanglement and modeling for segment targets, thoroughly examining them from the perspective of appearance, position, and shape, and on this basis, ii) reformulates the query initialization, matching and sampling strategies in alignment with the task-specific requirement. These architecture-agnostic innovations empower GvSeg to effectively address each unique task by accommodating the specific properties that characterize them. Extensive experiments on seven gold-standard benchmark datasets demonstrate that GvSeg surpasses all existing specialized/general solutions by a significant margin on four different video segmentation tasks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Zhou, T., Porikli, F., Crandall, D.J., Van Gool, L., Wang, W.: A survey on deep learning technique for video segmentation. IEEE TPAMI 45(6), 7099–7122 (2022)
Yang, L., Fan, Y., Xu, N.: Video instance segmentation. In: ICCV (2019)
Wang, Y., et al.: End-to-end video instance segmentation with transformers. In: CVPR (2021)
Huang, D.A., Yu, Z., Anandkumar, A.: MinVIS: a minimal video instance segmentation framework without video-based training. In: NeurIPS (2022)
Heo, M., Hwang, S., Oh, S.W., Lee, J.Y., Kim, S.J.: VITA: video instance segmentation via object token association. In: NeurIPS (2022)
Wu, J., Jiang, Y., Bai, S., Zhang, W., Bai, X.: SeqFormer: sequential transformer for video instance segmentation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 553–569. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_32
Hu, P., Caba, F., Wang, O., Lin, Z., Sclaroff, S., Perazzi, F.: Temporally distributed networks for fast video semantic segmentation. In: CVPR (2020)
Paul, M., Danelljan, M., Van Gool, L., Timofte, R.: Local memory attention for fast video semantic segmentation. In: IROS (2021)
Ji, W., et al.: Multispectral video semantic segmentation: a benchmark dataset and baseline. In: CVPR (2023)
Sun, G., Liu, Y., Ding, H., Probst, T., Van Gool, L.: Coarse-to-fine feature mining for video semantic segmentation. In: CVPR (2022)
Kim, D., Woo, S., Lee, J.Y., Kweon, I.S.: Video panoptic segmentation. In: CVPR (2020)
Weber, M., et al.: Step: segmenting and tracking every pixel. In: NeurIPS (2021)
Woo, S., Kim, D., Lee, J.Y., Kweon, I.S.: Learning to associate every segment for video panoptic segmentation. In: CVPR (2021)
Liang, C., Wang, W., Zhou, T., Miao, J., Luo, Y., Yang, Y.: Local-global context aware transformer for language-guided video segmentation. IEEE TPAMI 45(8), 10055–10069 (2023)
Hui, T., et al.: Language-aware spatial-temporal collaboration for referring video segmentation. IEEE TPAMI 45(7), 8646–8659 (2023)
Cheng, Y., et al.: Segment and track anything. arXiv preprint arXiv:2305.06558 (2023)
Wang, W., Shen, J., Li, X., Porikli, F.: Robust video object cosegmentation. IEEE TIP 24(10), 3137–3148 (2015)
Wang, W., Shen, J., Porikli, F.: Saliency-aware geodesic video object segmentation. In: CVPR (2015)
Wang, W., Shen, J., Xie, J., Porikli, F.: Super-trajectory for video segmentation. In: ICCV (2017)
Lu, X., Wang, W., Shen, J., Tai, Y.W., Crandall, D.J., Hoi, S.C.: Learning video object segmentation from unlabeled videos. In: CVPR (2020)
Lu, X., Wang, W., Danelljan, M., Zhou, T., Shen, J., Van Gool, L.: Video object segmentation with episodic graph memory networks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 661–679. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_39
Li, X., et al.: Video k-net: a simple, strong, and unified baseline for video segmentation. In: CVPR (2022)
Kim, D., et al.: TubeFormer-DeepLab: video mask transformer. In: CVPR (2022)
Choudhuri, A., Chowdhary, G., Schwing, A.G.: Context-aware relative object queries to unify video instance and panoptic segmentation. In: CVPR (2023)
Athar, A., Hermans, A., Luiten, J., Ramanan, D., Leibe, B.: TarViS: a unified approach for target-based video segmentation. In: CVPR (2023)
Li, X., et al.: Tube-link: a flexible cross tube baseline for universal video segmentation. In: ICCV (2023)
He, F., et al.: InsPro: propagating instance query and proposal for online video instance segmentation. In: NeurIPS (2022)
Heo, M., et al.: A generalized framework for video instance segmentation. In: CVPR (2023)
Qin, Z., Lu, X., Nie, X., Liu, D., Yin, Y., Wang, W.: Coarse-to-fine video instance segmentation with factorized conditional appearance flows. IEEE/CAA J. Automatica Sinica 10(5), 1192–1208 (2023)
Adelson, E.H.: On seeing stuff: the perception of materials by humans and machines. In: Human Vision and Electronic Imaging VI (2001)
Loomis, J.M., Philbeck, J.W., Zahorik, P.: Dissociation between location and shape in visual space. J. Exp. Psychol. Hum. Percept. Perform. 28(5), 1202 (2002)
Wang, W., Yang, Y., Pan, Y.: Visual knowledge in the big model era: retrospect and prospect. arXiv preprint arXiv:2404.04308 (2024)
Yang, Z., Chen, G., Li, X., Wang, W., Yang, Y.: DoraemonGPT: toward understanding dynamic scenes with large language models (exemplified as a video agent). In: ICML (2024)
Wu, J., Liu, Q., Jiang, Y., Bai, S., Yuille, A., Bai, X.: In defense of online models for video instance segmentation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 588–605. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_34
Athar, A., et al.: BURST: a benchmark for unifying object recognition, segmentation and tracking in video. In: WACV (2023)
Qi, J., et al.: Occluded video instance segmentation: a benchmark. IJCV 130(8), 2022–2039 (2022)
Miao, J., Wei, Y., Wu, Y., Liang, C., Li, G., Yang, Y.: VSPW: a large-scale dataset for video scene parsing in the wild. In: CVPR (2021)
Miao, J., et al.: Large-scale video panoptic segmentation in the wild: a benchmark. In: CVPR (2022)
Yang, Z., Wei, Y., Yang, Y.: Collaborative video object segmentation by multi-scale foreground-background integration. IEEE TPAMI 44, 4701–4712 (2021)
Miao, J., Wei, Y., Yang, Y.: Memory aggregation networks for efficient interactive video object segmentation. In: CVPR (2020)
Wu, R., Lin, H., Qi, X., Jia, J.: Memory selection network for video propagation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 175–190. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_11
Wang, W., Lu, X., Shen, J., Crandall, D.J., Shao, L.: Zero-shot video object segmentation via attentive graph neural networks. In: ICCV (2019)
Wang, X., Jabri, A., Efros, A.A.: Learning correspondence from the cycle-consistency of time. In: CVPR (2019)
Lu, X., Wang, W., Ma, C., Shen, J., Shao, L., Porikli, F.: See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In: CVPR (2019)
Wang, W., Shen, J., Porikli, F., Yang, R.: Semi-supervised video object segmentation with super-trajectories. IEEE TPAMI 41(4), 985–998 (2018)
Seong, H., Oh, S.W., Lee, J.Y., Lee, S., Lee, S., Kim, E.: Hierarchical memory matching network for video object segmentation. In: CVPR (2021)
Mao, Y., Wang, N., Zhou, W., Li, H.: Joint inductive and transductive learning for video object segmentation. In: ICCV (2021)
Cheng, H.K., Tai, Y.W., Tang, C.K.: Rethinking space-time networks with improved memory coverage for efficient video object segmentation. In: NeurIPS (2021)
Cheng, H.K., Schwing, A.G.: XMem: long-term video object segmentation with an Atkinson-Shiffrin memory model. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 640–658. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_37
Li, L., Zhou, T., Wang, W., Yang, L., Li, J., Yang, Y.: Locality-aware inter-and intra-video reconstruction for self-supervised correspondence learning. In: CVPR (2022)
Park, K., Woo, S., Oh, S.W., Kweon, I.S., Lee, J.Y.: Per-clip video object segmentation. In: CVPR (2022)
Yu, Y., Yuan, J., Mittal, G., Fuxin, L., Chen, M.: BATMAN: bilateral attention transformer in motion-appearance neighboring space for video object segmentation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13689, pp. 612–629. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19818-2_35
Zhang, Y., Li, L., Wang, W., Xie, R., Song, L., Zhang, W.: Boosting video object segmentation via space-time correspondence learning. In: CVPR (2023)
Li, L., Wang, W., Zhou, T., Li, J., Yang, Y.: Unified mask embedding and correspondence learning for self-supervised video segmentation. In: CVPR (2023)
Cao, J., Anwer, R.M., Cholakkal, H., Khan, F.S., Pang, Y., Shao, L.: SipMask: spatial information preservation for fast image and video instance segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 1–18. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_1
Liu, D., Cui, Y., Tan, W., Chen, Y.: SG-Net: spatial granularity network for one-stage video instance segmentation. In: CVPR (2021)
Yang, S., et al.: Crossover learning for fast online video instance segmentation. In: ICCV (2021)
Han, S.H., et al.: VISOLO: grid-based space-time aggregation for efficient online video instance segmentation. In: CVPR (2022)
Fang, Y., et al.: Instances as queries. In: ICCV (2021)
Zhu, F., Yang, Z., Yu, X., Yang, Y., Wei, Y.: Instance as identity: a generic online paradigm for video instance segmentation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13689, pp. 524–540. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19818-2_30
Li, M., Li, S., Li, L., Zhang, L.: Spatial feature calibration and temporal fusion for effective one-stage video instance segmentation. In: CVPR (2021)
Ke, L., Li, X., Danelljan, M., Tai, Y.W., Tang, C.K., Yu, F.: Prototypical cross-attention networks for multiple object tracking and segmentation. In: NeurIPS (2021)
Lin, H., Wu, R., Liu, S., Lu, J., Jia, J.: Video instance segmentation with a propose-reduce paradigm. In: ICCV (2021)
Koner, R., et al.: InstanceFormer: an online video instance segmentation framework. In: AAAI (2023)
Liu, Q., Wu, J., Jiang, Y., Bai, X., Yuille, A.L., Bai, S.: InstMove: instance motion for object-centric video segmentation. In: CVPR (2023)
Li, M., Li, S., Xiang, W., Zhang, L.: MDQE: mining discriminative query embeddings to segment occluded instances on challenging videos. In: CVPR (2023)
Athar, A., Mahadevan, S., Os̆ep, A., Leal-Taixé, L., Leibe, B.: STEm-Seg: spatio-temporal embeddings for instance segmentation in videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 158–177. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_10
Wu, J., et al.: Efficient video instance segmentation via tracklet query and proposal. In: CVPR (2022)
Yang, S., et al.: Temporally efficient vision transformer for video instance segmentation. In: CVPR (2022)
Bertasius, G., Torresani, L.: Classifying, segmenting, and tracking object instances in video with mask propagation. In: CVPR (2020)
Hwang, S., Heo, M., Oh, S.W., Kim, S.J.: Video instance segmentation using inter-frame communication transformers. In: NeurIPS (2021)
Wang, W., Zhou, T., Yu, F., Dai, J., Konukoglu, E., Van Gool, L.: Exploring cross-image pixel contrast for semantic segmentation. In: ICCV (2021)
Zhou, T., Wang, W., Konukoglu, E., Van Gool, L.: Rethinking semantic segmentation: a prototype view. In: CVPR (2022)
Li, L., Wang, W., Yang, Y.: LOGICSEG: parsing visual semantics with neural logic learning and reasoning. In: ICCV (2023)
Chen, M., Zheng, Z., Yang, Y., Chua, T.S.: PiPa: pixel-and patch-wise self-supervised learning for domain adaptative semantic segmentation. In: ACM MM (2023)
Li, L., Zhou, T., Wang, W., Li, J., Yang, Y.: Deep hierarchical semantic segmentation. In: CVPR (2022)
Li, L., Wang, W., Zhou, T., Quan, R., Yang, Y.: Semantic hierarchy-aware segmentation. IEEE TPAMI 46, 2123–2138 (2023)
Chen, M., Zheng, Z., Yang, Y.: Transferring to real-world layouts: a depth-aware framework for scene adaptation. arXiv preprint arXiv:2311.12682 (2023)
Zhou, T., Wang, W.: Cross-image pixel contrasting for semantic segmentation. IEEE TPAMI 46, 5398–5412 (2024)
Xu, Y.S., Fu, T.J., Yang, H.K., Lee, C.Y.: Dynamic video segmentation network. In: CVPR (2018)
Mahasseni, B., Todorovic, S., Fern, A.: Budget-aware deep semantic video segmentation. In: CVPR (2017)
Nilsson, D., Sminchisescu, C.: Semantic video segmentation by gated recurrent flow propagation. In: CVPR (2018)
Jain, S., Wang, X., Gonzalez, J.E.: Accel: a corrective fusion network for efficient semantic segmentation on video. In: CVPR (2019)
Liu, Y., Shen, C., Yu, C., Wang, J.: Efficient semantic video segmentation with per-frame inference. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 352–368. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_21
Li, J., et al.: Video semantic segmentation via sparse temporal transformer. In: ACM MM (2021)
Sun, G., Liu, Y., Tang, H., Chhatkuli, A., Zhang, L., Van Gool, L.: Mining relations among cross-frame affinities for video semantic segmentation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13694, pp. 522–539. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19830-4_30
Qiao, S., Zhu, Y., Adam, H., Yuille, A., Chen, L.C.: ViP-DeepLab: learning visual perception with depth-aware video panoptic segmentation. In: CVPR (2021)
Kreuzberg, L., Zulfikar, I.E., Mahadevan, S., Engelmann, F., Leibe, B.: 4D-stop: panoptic segmentation of 4D lidar using spatio-temporal object proposal generation and aggregation. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds.) ECCV 2022. LNCS, vol. 13801, pp. 537–553. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-25056-9_34
Zhou, Y., et al.: Slot-VPS: object-centric representation learning for video panoptic segmentation. In: CVPR (2022)
Yuan, H., et al.: PolyphonicFormer: unified query learning for depth-aware video panoptic segmentation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13687, pp. 582–599. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19812-0_34
He, J., et al.: Towards deeply unified depth-aware panoptic segmentation with bi-directional guidance learning. In: ICCV (2023)
Shin, I., et al.: Video-kMAX: a simple unified approach for online and near-online video panoptic segmentation. arXiv preprint arXiv:2304.04694 (2023)
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR (2022)
Zhang, T., et al.: DVIS: decoupled video instance segmentation framework. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1282–1291 (2023)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Cheng, B., Schwing, A., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS (2021)
Wang, W., Liang, J.C., Liu, D.: Learning equivariant segmentation with instance-unique querying. In: NeurIPS (2022)
Ding, Y., Li, L., Wang, W., Yang, Y.: Clustering propagation for universal medical image segmentation. In: CVPR (2024)
Liang, J.C., Zhou, T., Liu, D., Wang, W.: CLUSTSEG: clustering for universal segmentation. In: ICML (2023)
Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE TPAMI 24(4), 509–522 (2002)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Li, Z., et al.: Panoptic SegFormer: delving deeper into panoptic segmentation with transformers. In: CVPR (2022)
Kirillov, A., Wu, Y., He, K., Girshick, R.: PointRend: image segmentation as rendering. In: CVPR (2020)
Wang, X., Zhang, R., Kong, T., Li, L., Shen, C.: SOLOv2: dynamic and fast instance segmentation. In: NeurIPS (2020)
Cheng, H.K., Oh, S.W., Price, B., Schwing, A., Lee, J.Y.: Tracking anything with decoupled video segmentation. In: ICCV (2023)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)
Cheng, B., Choudhuri, A., Misra, I., Kirillov, A., Girdhar, R., Schwing, A.G.: Mask2former for video instance segmentation. arXiv preprint arXiv:2112.10764 (2021)
Li, J., Yu, B., Rao, Y., Zhou, J., Lu, J.: TCOVIS: temporally consistent online video instance segmentation. In: ICCV (2023)
Ying, K., et al.: CTVIS: consistent training for online video instance segmentation. In: ICCV (2023)
Xu, N., et al.: Youtube-VOS: a large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327 (2018)
Hoffhues, A., Luiten, J.: Trackeval (2020). https://github.com/JonathonLuiten/TrackEval
Yan, B., et al.: Universal instance perception as object discovery and retrieval. In: CVPR (2023)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Chen, M., Li, L., Wang, W., Quan, R., Yang, Y. (2025). General and Task-Oriented Video Segmentation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15065. Springer, Cham. https://doi.org/10.1007/978-3-031-72667-5_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-72667-5_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72666-8
Online ISBN: 978-3-031-72667-5
eBook Packages: Computer ScienceComputer Science (R0)