research-article

Transformer-based Open-world Instance Segmentation with Cross-task Consistency Regularization

Authors:

Satoshi Tsutsui,

Mike Zheng ShouAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 2507 - 2515

https://doi.org/10.1145/3581783.3612493

Published: 27 October 2023 Publication History

Abstract

Open-World Instance Segmentation (OWIS) is an emerging research topic that aims to segment class-agnostic object instances from images. The mainstream approaches use a two-stage segmentation framework, which first locates the candidate object bounding boxes and then performs instance segmentation. In this work, we instead promote a single-stage transformer-based framework for OWIS. We argue that the end-to-end training process in the single-stage framework can be more convenient for directly regularizing the localization of class-agnostic object pixels. Based on the transformer-based instance segmentation framework, we propose a regularization model to predict foreground pixels and use its relation to instance segmentation to construct a cross-task consistency loss. We show that such a consistency loss could alleviate the problem of incomplete instance annotation - a common problem in the existing OWIS datasets. We also show that the proposed loss lends itself to an effective solution to semi-supervised OWIS that could be considered an extreme case that all object annotations are absent for some images. Our extensive experiments demonstrate that the proposed method achieves impressive results in both fully-supervised and semi-supervised settings. Compared to SOTA methods, the proposed method significantly improves the AP_100 score by 4.75% in UVO dataset →UVO dataset setting and 4.05% in COCO dataset →UVO dataset setting.

Supplemental Material

MP4 File

The video description for ACM MM23 paper "Transformer-based Open-world Instance Segmentation with Cross-task Consistency Regularization"

Download
93.54 MB

References

[1]

Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. 2019. Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF international conference on computer vision. 9157--9166.

[2]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European conference on computer vision. Springer, 213--229.

Digital Library

[3]

Hao Chen, Kunyang Sun, Zhi Tian, Chunhua Shen, Yongming Huang, and Youliang Yan. 2020. Blendmask: Top-down meets bottom-up for instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8573--8581.

[4]

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. 2022. Masked-attention Mask Transformer for Universal Image Segmentation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1280--1289.

[5]

Bowen Cheng, Alex Schwing, and Alexander Kirillov. 2021. Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, Vol. 34 (2021), 17864--17875.

[6]

Tianheng Cheng, Xinggang Wang, Lichao Huang, and Wenyu Liu. 2020. Boundary-preserving mask r-cnn. (2020), 660--676.

[7]

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3213--3223.

[8]

Jifeng Dai, Kaiming He, and Jian Sun. 2016. Instance-aware semantic segmentation via multi-task network cascades. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3150--3158.

[9]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition. Ieee, 248--255.

[10]

Terrance DeVries and Graham W Taylor. 2017. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017).

[11]

Bin Dong, Fangao Zeng, Tiancai Wang, Xiangyu Zhang, and Yichen Wei. 2021. Solq: Segmenting objects by learning queries. Advances in Neural Information Processing Systems, Vol. 34 (2021), 21898--21909.

[12]

Yuming Du, Wen Guo, Yang Xiao, and Vincent Lepetit. 2021. 1st Place Solution for the UVO Challenge on Video-based Open-World Segmentation 2021. arXiv preprint arXiv:2110.11661 (2021).

[13]

Yuxin Fang, Shusheng Yang, Xinggang Wang, Yu Li, Chen Fang, Ying Shan, Bin Feng, and Wenyu Liu. 2021. Instances as queries. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6910--6919.

[14]

Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. 2021. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 (2021).

[15]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961--2969.

[16]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[17]

Ronghang Hu, Piotr Dollár, Kaiming He, Trevor Darrell, and Ross Girshick. 2018. Learning to segment every thing. (2018), 4233--4241.

[18]

Lei Ke, Martin Danelljan, Xia Li, Yu-Wing Tai, Chi-Keung Tang, and Fisher Yu. 2022. Mask Transfiner for High-Quality Instance Segmentation. (2022), 4412--4421.

[19]

Lei Ke, Yu-Wing Tai, and Chi-Keung Tang. 2021. Deep Occlusion-Aware Instance Segmentation with Overlapping BiLayers. (2021), 4019--4028.

[20]

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. 2023. Segment Anything. arXiv:2304.02643 (2023).

[21]

Alexander Kirillov, Yuxin Wu, Kaiming He, and Ross Girshick. 2020. Pointrend: Image segmentation as rendering. (2020), 9799--9808.

[22]

Weicheng Kuo, Anelia Angelova, Jitendra Malik, and Tsung-Yi Lin. 2019. Shapemask: Learning to segment novel objects by refining shape priors. (2019), 9207--9216.

[23]

Wei-Hong Li, Xialei Liu, and Hakan Bilen. 2022. Learning multiple dense prediction tasks from partially annotated data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18879--18889.

[24]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.

[25]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012--10022.

[26]

Yao Lu, Soren Pirk, Jan Dlabal, Anthony Brohan, Ankita Pasad, Zhao Chen, Vincent Casser, Anelia Angelova, and Ariel Gordon. 2021. Taskology: Utilizing task relations at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8700--8709.

[27]

Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. 2016. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV). IEEE, 565--571.

[28]

Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and Peter Kontschieder. 2017. The mapillary vistas dataset for semantic understanding of street scenes. In Proceedings of the IEEE international conference on computer vision. 4990--4999.

[29]

Kuniaki Saito, Ping Hu, Trevor Darrell, and Kate Saenko. 2022. Learning to detect every thing in an open world. In European Conference on Computer Vision. Springer, 268--284.

Digital Library

[30]

Thang Vu, Hyunjun Jang, Trung X Pham, and Chang Yoo. 2019. Cascade rpn: Delving into high-quality region proposal network with adaptive convolution. Advances in neural information processing systems, Vol. 32 (2019).

[31]

Weiyao Wang, Matt Feiszli, Heng Wang, Jitendra Malik, and Du Tran. 2022. Open-World Instance Segmentation: Exploiting Pseudo Ground Truth From Learned Pairwise Affinity. (2022), 4422--4432.

[32]

Weiyao Wang, Matt Feiszli, Heng Wang, and Du Tran. 2021. Unidentified video objects: A benchmark for dense, open-world segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10776--10785.

[33]

Xinlong Wang, Tao Kong, Chunhua Shen, Yuning Jiang, and Lei Li. 2020a. Solo: Segmenting objects by locations. In European Conference on Computer Vision. Springer, 649--665.

Digital Library

[34]

Xinlong Wang, Rufeng Zhang, Tao Kong, Lei Li, and Chunhua Shen. 2020b. SOLOv2: Dynamic and Fast Instance Segmentation. Advances in Neural information processing systems, Vol. 33 (2020), 17721--17732.

[35]

Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross. Girshick. 2019. Detectron2. (2019). https://github.com/facebookresearch/detectron2

[36]

Amir R Zamir, Alexander Sax, Nikhil Cheerla, Rohan Suri, Zhangjie Cao, Jitendra Malik, and Leonidas J Guibas. 2020. Robust learning through cross-task consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11197--11206.

[37]

Yu Zhang and Qiang Yang. 2018. An overview of multi-task learning. National Science Review, Vol. 5, 1 (2018), 30--43.

Cited By

Wilms CRolff THillemann MJohanson RFrintrop S(2024)SOS: Segment Object System for Open-World Instance Segmentation with Object PriorsComputer Vision – ECCV 202410.1007/978-3-031-73383-3_10(165-182)Online publication date: 3-Nov-2024
https://doi.org/10.1007/978-3-031-73383-3_10

Index Terms

Transformer-based Open-world Instance Segmentation with Cross-task Consistency Regularization
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Image segmentation
      2. Computer vision tasks
        Scene understanding

Recommendations

Hybrid supervised instance segmentation by learning label noise suppression
Abstract
To reach top accuracy, current fully supervised instance segmentation methods severely rely on large-scale pixel-wise labeled datasets. They are usually expensive and time-consuming to obtain. Though weakly or semi-supervised methods ...
Instance Consistency Regularization for Semi-Supervised 3D Instance Segmentation
Large-scale datasets with point-wise semantic and instance labels are crucial to 3D instance segmentation but also expensive. To leverage unlabeled data, previous semi-supervised 3D instance segmentation approaches have explored self-training frameworks, ...
Bayesian Semantic Instance Segmentation in Open Set World
Computer Vision – ECCV 2018
Abstract
This paper addresses the semantic instance segmentation task in the open-set conditions, where input images can contain known and unknown object classes. The training process of existing semantic instance segmentation methods requires annotation ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Key R & D projects of Shaanxi Province, China
National Natural Science Foundation of China
National Research Foundation, Singapore

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
178
Total Downloads

Downloads (Last 12 months)50
Downloads (Last 6 weeks)8

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wilms CRolff THillemann MJohanson RFrintrop S(2024)SOS: Segment Object System for Open-World Instance Segmentation with Object PriorsComputer Vision – ECCV 202410.1007/978-3-031-73383-3_10(165-182)Online publication date: 3-Nov-2024
https://doi.org/10.1007/978-3-031-73383-3_10

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten