research-article

Dynamic Transformer for Few-shot Instance Segmentation

Authors:
Haochen Wang

University of Amsterdam, Amsterdam, Netherlands

University of Amsterdam, Amsterdam, Netherlands
View Profile

,
Jie Liu

University of Amsterdam, Amsterdam, Netherlands

University of Amsterdam, Amsterdam, Netherlands
View Profile

,
Yongtuo Liu

University of Amsterdam, Amsterdam, Netherlands

University of Amsterdam, Amsterdam, Netherlands
View Profile

,
Subhransu Maji

University of Massachusetts, Amherst, Netherlands

University of Massachusetts, Amherst, Netherlands
View Profile

,
Jan-Jakob Sonke

The Netherlands Cancer Institute, Amsterdam, Netherlands

The Netherlands Cancer Institute, Amsterdam, Netherlands
View Profile

,
Efstratios Gavves

University of Amsterdam, Amsterdam, Netherlands

University of Amsterdam, Amsterdam, Netherlands
View Profile

MM '22: Proceedings of the 30th ACM International Conference on MultimediaOctober 2022Pages 2969–2977https://doi.org/10.1145/3503161.3548227

Published:10 October 2022Publication History

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 2969–2977

ABSTRACT

Few-shot instance segmentation aims to train an instance segmentation model that can fast adapt to novel classes with only a few reference images. Existing methods are usually derived from standard detection models and tackle few-shot instance segmentation indirectly by conducting classification, box regression, and mask prediction on a large set of redundant proposals followed by indispensable post-processing, e.g., Non-Maximum Suppression. Such complicated hand-crafted procedures and hyperparameters lead to degraded optimization and insufficient generalization ability. In this work, we propose an end-to-end Dynamic Transformer Network, DTN for short, to directly segment all target object instances from arbitrary categories given by reference images, relieving the requirements of dense proposal generation and post-processing. Specifically, a small set of Dynamic Queries, conditioned on reference images, are exclusively assigned to target object instances and generate all the instance segmentation masks of reference categories simultaneously. Moreover, a Semantic-induced Transformer Decoder is introduced to constrain the cross-attention between dynamic queries and target images within the pixels of the reference category, which suppresses the noisy interaction with the background and irrelevant categories. Extensive experiments are conducted on the COCO-20 dataset. The experiment results demonstrate that our proposed Dynamic Transformer Network significantly outperforms the state-of-the-arts.

Supplemental Material

Available for Download

mp4

MM22-fp1967.mp4 (167.3 MB)

References

Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. 2019. Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF international conference on computer vision. 9157--9166.Google ScholarCross Ref
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European conference on computer vision. Springer, 213--229.Google ScholarDigital Library
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2017. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40, 4 (2017), 834--848.Google Scholar
Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. 2021. Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8126--8135.Google ScholarCross Ref
Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. 2021. Masked-attention mask transformer for universal image segmentation. arXiv preprint arXiv:2112.01527 (2021).Google Scholar
Zhibo Fan, Jin-Gang Yu, Zhihao Liang, Jiarong Ou, Changxin Gao, Gui-Song Xia, and Yuanqing Li. 2020. Fgn: Fully guided network for few-shot instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9172--9181.Google ScholarCross Ref
Dan Andrei Ganea, Bas Boom, and Ronald Poppe. 2021. Incremental few-shot instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1185--1194.Google ScholarCross Ref
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961--2969.Google ScholarCross Ref
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980--2988.Google ScholarCross Ref
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.Google ScholarCross Ref
Yongfei Liu, Xiangyi Zhang, Songyang Zhang, and Xuming He. 2020. Part-aware prototype network for few-shot semantic segmentation. In European Conference on Computer Vision. Springer, 142--158.Google ScholarDigital Library
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012--10022.Google ScholarCross Ref
Ilya Loshchilov and Frank Hutter. 2018. Fixing weight decay regularization in adam. (2018).Google Scholar
Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, and Christoph Feichtenhofer. 2021. Trackformer: Multi-object tracking with transformers. arXiv preprint arXiv:2101.02702 (2021).Google Scholar
Claudio Michaelis, Ivan Ustyuzhaninov, Matthias Bethge, and Alexander S Ecker. 2018. One-shot instance segmentation. arXiv preprint arXiv:1811.11507 (2018).Google Scholar
Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. 2016. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV). IEEE, 565--571.Google Scholar
Khoi Nguyen and Sinisa Todorovic. 2021. Fapis: A few-shot anchor-free partbased instance segmenter. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11099--11108.Google ScholarCross Ref
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 779--788.Google ScholarCross Ref
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015).Google Scholar
Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. 2019. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 658--666.Google ScholarCross Ref
Zhuotao Tian, Hengshuang Zhao, Michelle Shu, Zhicheng Yang, Ruiyu Li, and Jiaya Jia. 2020. Prior guided feature enrichment network for few-shot segmentation. IEEE transactions on pattern analysis and machine intelligence (2020).Google ScholarDigital Library
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).Google Scholar
Haochen Wang, Xudong Zhang, Yutao Hu, Yandan Yang, Xianbin Cao, and Xiantong Zhen. 2020. Few-shot semantic segmentation with democratic attention networks. In European Conference on Computer Vision. Springer, 730--746.Google ScholarDigital Library
Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. 2020. Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence 43, 10 (2020), 3349--3364.Google Scholar
WenhaiWang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. 2021. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 568--578.Google Scholar
Xinlong Wang, Tao Kong, Chunhua Shen, Yuning Jiang, and Lei Li. 2020. Solo: Segmenting objects by locations. In European Conference on Computer Vision. Springer, 649--665.Google ScholarDigital Library
JunfengWu, Yi Jiang,Wenqing Zhang, Xiang Bai, and Song Bai. 2021. Seqformer: a frustratingly simple model for video instance segmentation. arXiv preprint arXiv:2112.08275 (2021).Google Scholar
Xiaopeng Yan, Ziliang Chen, Anni Xu, Xiaoxi Wang, Xiaodan Liang, and Liang Lin. 2019. Meta r-cnn: Towards general solver for instance-level low-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9577-- 9586.Google ScholarCross Ref
Lihe Yang, Wei Zhuo, Lei Qi, Yinghuan Shi, and Yang Gao. 2021. Mining latent classes for few-shot segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8721--8730.Google ScholarCross Ref
Chi Zhang, Guosheng Lin, Fayao Liu, Jiushuang Guo, Qingyao Wu, and Rui Yao. 2019. Pyramid graph networks with connection attentions for region-based one-shot semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9587--9595.Google ScholarCross Ref
Gengwei Zhang, Guoliang Kang, Yi Yang, and Yunchao Wei. 2021. Few-shot segmentation via cycle-consistent transformer. Advances in Neural Information Processing Systems 34 (2021).Google Scholar
Chenchen Zhu, Yihui He, and Marios Savvides. 2019. Feature selective anchor-free module for single-shot object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 840--849.Google ScholarCross Ref
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2020. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020).Google Scholar

Index Terms

Dynamic Transformer for Few-shot Instance Segmentation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Image segmentation

Recommendations

Transformer Based Multiple Instance Learning for Weakly Supervised Histopathology Image Segmentation
Medical Image Computing and Computer Assisted Intervention – MICCAI 2022
Abstract
Hispathological image segmentation algorithms play a critical role in computer aided diagnosis technology. The development of weakly supervised segmentation algorithm alleviates the problem of medical image annotation that it is time-consuming and ...
Read More
Hybrid supervised instance segmentation by learning label noise suppression
Abstract
To reach top accuracy, current fully supervised instance segmentation methods severely rely on large-scale pixel-wise labeled datasets. They are usually expensive and time-consuming to obtain. Though weakly or semi-supervised methods ...
Read More
One-shot Joint Extraction, Registration and Segmentation of Neuroimaging Data
KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Brain extraction, registration and segmentation are indispensable preprocessing steps in neuroimaging studies. The aim is to extract the brain from raw imaging scans (i.e., extraction step), align it with a target brain image (i.e., registration step) ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
October 2022
7537 pages
ISBN:9781450392037
DOI:10.1145/3503161
General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 October 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
dynamic queries
few-shot instance segmentation
semantic-induced transformer decoder
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 359
  Total Downloads
- Downloads (Last 12 months)167
- Downloads (Last 6 weeks)13
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Dynamic Transformer for Few-shot Instance Segmentation

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Transformer Based Multiple Instance Learning for Weakly Supervised Histopathology Image Segmentation

Hybrid supervised instance segmentation by learning label noise suppression

One-shot Joint Extraction, Registration and Segmentation of Neuroimaging Data