skip to main content
10.1145/3581783.3612493acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Transformer-based Open-world Instance Segmentation with Cross-task Consistency Regularization

Published: 27 October 2023 Publication History

Abstract

Open-World Instance Segmentation (OWIS) is an emerging research topic that aims to segment class-agnostic object instances from images. The mainstream approaches use a two-stage segmentation framework, which first locates the candidate object bounding boxes and then performs instance segmentation. In this work, we instead promote a single-stage transformer-based framework for OWIS. We argue that the end-to-end training process in the single-stage framework can be more convenient for directly regularizing the localization of class-agnostic object pixels. Based on the transformer-based instance segmentation framework, we propose a regularization model to predict foreground pixels and use its relation to instance segmentation to construct a cross-task consistency loss. We show that such a consistency loss could alleviate the problem of incomplete instance annotation - a common problem in the existing OWIS datasets. We also show that the proposed loss lends itself to an effective solution to semi-supervised OWIS that could be considered an extreme case that all object annotations are absent for some images. Our extensive experiments demonstrate that the proposed method achieves impressive results in both fully-supervised and semi-supervised settings. Compared to SOTA methods, the proposed method significantly improves the AP_100 score by 4.75% in UVO dataset →UVO dataset setting and 4.05% in COCO dataset →UVO dataset setting.

Supplemental Material

MP4 File
The video description for ACM MM23 paper "Transformer-based Open-world Instance Segmentation with Cross-task Consistency Regularization"

References

[1]
Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. 2019. Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF international conference on computer vision. 9157--9166.
[2]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European conference on computer vision. Springer, 213--229.
[3]
Hao Chen, Kunyang Sun, Zhi Tian, Chunhua Shen, Yongming Huang, and Youliang Yan. 2020. Blendmask: Top-down meets bottom-up for instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8573--8581.
[4]
Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. 2022. Masked-attention Mask Transformer for Universal Image Segmentation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1280--1289.
[5]
Bowen Cheng, Alex Schwing, and Alexander Kirillov. 2021. Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, Vol. 34 (2021), 17864--17875.
[6]
Tianheng Cheng, Xinggang Wang, Lichao Huang, and Wenyu Liu. 2020. Boundary-preserving mask r-cnn. (2020), 660--676.
[7]
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3213--3223.
[8]
Jifeng Dai, Kaiming He, and Jian Sun. 2016. Instance-aware semantic segmentation via multi-task network cascades. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3150--3158.
[9]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition. Ieee, 248--255.
[10]
Terrance DeVries and Graham W Taylor. 2017. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017).
[11]
Bin Dong, Fangao Zeng, Tiancai Wang, Xiangyu Zhang, and Yichen Wei. 2021. Solq: Segmenting objects by learning queries. Advances in Neural Information Processing Systems, Vol. 34 (2021), 21898--21909.
[12]
Yuming Du, Wen Guo, Yang Xiao, and Vincent Lepetit. 2021. 1st Place Solution for the UVO Challenge on Video-based Open-World Segmentation 2021. arXiv preprint arXiv:2110.11661 (2021).
[13]
Yuxin Fang, Shusheng Yang, Xinggang Wang, Yu Li, Chen Fang, Ying Shan, Bin Feng, and Wenyu Liu. 2021. Instances as queries. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6910--6919.
[14]
Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. 2021. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 (2021).
[15]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961--2969.
[16]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[17]
Ronghang Hu, Piotr Dollár, Kaiming He, Trevor Darrell, and Ross Girshick. 2018. Learning to segment every thing. (2018), 4233--4241.
[18]
Lei Ke, Martin Danelljan, Xia Li, Yu-Wing Tai, Chi-Keung Tang, and Fisher Yu. 2022. Mask Transfiner for High-Quality Instance Segmentation. (2022), 4412--4421.
[19]
Lei Ke, Yu-Wing Tai, and Chi-Keung Tang. 2021. Deep Occlusion-Aware Instance Segmentation with Overlapping BiLayers. (2021), 4019--4028.
[20]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. 2023. Segment Anything. arXiv:2304.02643 (2023).
[21]
Alexander Kirillov, Yuxin Wu, Kaiming He, and Ross Girshick. 2020. Pointrend: Image segmentation as rendering. (2020), 9799--9808.
[22]
Weicheng Kuo, Anelia Angelova, Jitendra Malik, and Tsung-Yi Lin. 2019. Shapemask: Learning to segment novel objects by refining shape priors. (2019), 9207--9216.
[23]
Wei-Hong Li, Xialei Liu, and Hakan Bilen. 2022. Learning multiple dense prediction tasks from partially annotated data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18879--18889.
[24]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.
[25]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012--10022.
[26]
Yao Lu, Soren Pirk, Jan Dlabal, Anthony Brohan, Ankita Pasad, Zhao Chen, Vincent Casser, Anelia Angelova, and Ariel Gordon. 2021. Taskology: Utilizing task relations at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8700--8709.
[27]
Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. 2016. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV). IEEE, 565--571.
[28]
Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and Peter Kontschieder. 2017. The mapillary vistas dataset for semantic understanding of street scenes. In Proceedings of the IEEE international conference on computer vision. 4990--4999.
[29]
Kuniaki Saito, Ping Hu, Trevor Darrell, and Kate Saenko. 2022. Learning to detect every thing in an open world. In European Conference on Computer Vision. Springer, 268--284.
[30]
Thang Vu, Hyunjun Jang, Trung X Pham, and Chang Yoo. 2019. Cascade rpn: Delving into high-quality region proposal network with adaptive convolution. Advances in neural information processing systems, Vol. 32 (2019).
[31]
Weiyao Wang, Matt Feiszli, Heng Wang, Jitendra Malik, and Du Tran. 2022. Open-World Instance Segmentation: Exploiting Pseudo Ground Truth From Learned Pairwise Affinity. (2022), 4422--4432.
[32]
Weiyao Wang, Matt Feiszli, Heng Wang, and Du Tran. 2021. Unidentified video objects: A benchmark for dense, open-world segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10776--10785.
[33]
Xinlong Wang, Tao Kong, Chunhua Shen, Yuning Jiang, and Lei Li. 2020a. Solo: Segmenting objects by locations. In European Conference on Computer Vision. Springer, 649--665.
[34]
Xinlong Wang, Rufeng Zhang, Tao Kong, Lei Li, and Chunhua Shen. 2020b. SOLOv2: Dynamic and Fast Instance Segmentation. Advances in Neural information processing systems, Vol. 33 (2020), 17721--17732.
[35]
Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross. Girshick. 2019. Detectron2. (2019). https://github.com/facebookresearch/detectron2
[36]
Amir R Zamir, Alexander Sax, Nikhil Cheerla, Rohan Suri, Zhangjie Cao, Jitendra Malik, and Leonidas J Guibas. 2020. Robust learning through cross-task consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11197--11206.
[37]
Yu Zhang and Qiang Yang. 2018. An overview of multi-task learning. National Science Review, Vol. 5, 1 (2018), 30--43.

Cited By

View all
  • (2024)SOS: Segment Object System for Open-World Instance Segmentation with Object PriorsComputer Vision – ECCV 202410.1007/978-3-031-73383-3_10(165-182)Online publication date: 3-Nov-2024

Index Terms

  1. Transformer-based Open-world Instance Segmentation with Cross-task Consistency Regularization

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MM '23: Proceedings of the 31st ACM International Conference on Multimedia
      October 2023
      9913 pages
      ISBN:9798400701085
      DOI:10.1145/3581783
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 27 October 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. cross-task consistency
      2. instance segmentation
      3. open world

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      MM '23
      Sponsor:
      MM '23: The 31st ACM International Conference on Multimedia
      October 29 - November 3, 2023
      Ottawa ON, Canada

      Acceptance Rates

      Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)50
      • Downloads (Last 6 weeks)8
      Reflects downloads up to 05 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)SOS: Segment Object System for Open-World Instance Segmentation with Object PriorsComputer Vision – ECCV 202410.1007/978-3-031-73383-3_10(165-182)Online publication date: 3-Nov-2024

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media