CAT: A Simple yet Effective Cross-Attention Transformer for One-Shot Object Detection

Lin, Wei-Dong; Deng, Yu-Yan; Gao, Yang; Wang, Ning; Liu, Ling-Qiao; Zhang, Lei; Wang, Peng

doi:10.1007/s11390-024-1743-6

CAT: A Simple yet Effective Cross-Attention Transformer for One-Shot Object Detection

Regular Paper
Published: 06 June 2024

Volume 39, pages 460–471, (2024)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Wei-Dong Lin (林蔚东)^1,2,
Yu-Yan Deng (邓玉岩)^1,2,
Yang Gao (高扬)^1,2,
Ning Wang (王宁)^1,2,
Ling-Qiao Liu (刘凌峤)³,
Lei Zhang (张磊)^1,2 &
…
Peng Wang (王鹏)^1,2

299 Accesses
1 Altmetric
Explore all metrics

Abstract

Given a query patch from a novel class, one-shot object detection aims to detect all instances of this class in a target image through the semantic similarity comparison. However, due to the extremely limited guidance in the novel class as well as the unseen appearance difference between the query and target instances, it is difficult to appropriately exploit their semantic similarity and generalize well. To mitigate this problem, we present a universal Cross-Attention Transformer (CAT) module for accurate and efficient semantic similarity comparison in one-shot object detection. The proposed CAT utilizes the transformer mechanism to comprehensively capture bi-directional correspondence between any paired pixels from the query and the target image, which empowers us to sufficiently exploit their semantic characteristics for accurate similarity comparison. In addition, the proposed CAT enables feature dimensionality compression for inference speedup without performance loss. Extensive experiments on three object detection datasets MS-COCO, PASCAL VOC and FSOD under the one-shot setting demonstrate the effectiveness and efficiency of our model, e.g., it surpasses CoAE, a major baseline in this task, by 1.0% in average precision (AP) on MS-COCO and runs nearly 2.5 times faster.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proc. the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2014, pp.580–587. DOI: https://doi.org/10.1109/CVPR.2014.81.
Ren S Q, He K M, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137–1149. DOI: https://doi.org/10.1109//TPAMI.2016.2577031.
Article Google Scholar
Hsieh T I, Lo Y C, Chen H T, Liu T L. One-shot object detection with co-attention and co-excitation. In Proc. the 33rd International Conference on Neural Information Processing Systems, Dec. 2019, Article No. 245.
Fan Q, Zhuo W, Tang C K, Tai Y W. Few-shot object detection with attention-RPN and multi-relation detector. In Proc. the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2020, pp.4012–4021. DOI: https://doi.org/10.1104/CVPR42600.2020.00407.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I. Attention is all you need. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.6000–6010. DOI: https://doi.org/10.5555/3295222.3295349.
Chen H, Wang Y L, Wang G Y, Qiao Y. LSTD: A low-shot transfer detector for object detection. In Proc. the 32nd AAAI Conference on Artificial Intelligence, Feb. 2018, pp.2836–2843. DOI: https://doi.org/10.1609/aaai.v32i1.11716.
Kang B Y, Liu Z, Wang X, Yu F, Feng J S, Darrell T. Few-shot object detection via feature reweighting. In Proc. the 2019 IEEE/CVF International Conference on Computer Vision, Oct. 27–Nov. 2, 2019, pp.8419–8428. DOI: https://doi.org/10.1104/ICCV.2019.00851.
Karlinsky L, Shtok J, Harary S, Schwartz E, Aides A, Feris R, Giryes R, Bronstein A M. RepMet: Representative-based metric learning for classification and few-shot object detection. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2019, pp.5192–5201. DOI: https://doi.org/10.1109/CVPR.2019.00534.
Osokin A, Sumin D, Lomakin V. OS2D: One-stage one-shot object detection by matching anchor features. In Proc. the 16th European Conference on Computer Vision, Aug. 2020, pp.635–652. DOI: https://doi.org/10.1007/978-3-030-58555-6_38.
Tay Y, Dehghani M, Bahri D, Metzler D. Efficient transformers: A survey. ACM Computing Surveys, 2023, 55(6): Article No. 109. DOI: https://doi.org/10.1145/3530811.
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X H, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. the 9th International Conference on Learning Representations, May 2021.
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H. Training data-efficient image transformers & distillation through attention. In Proc. the 38th International Conference on Machine Learning, Jul. 2021, pp.10347–10357.
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. In Proc. the 16th European Conference on Computer Vision, Aug. 2020, pp.213–229. DOI: https://doi.org/10.1007/978-3-030-58452-8_13.
Zhu X Z, Su W J, Lu L W, Li B, Wang X G, Dai J F. Deformable DETR: Deformable transformers for end-to-end object detection. In Proc. the 9th International Conference on Learning Representations, May 2021.
Ye L W, Rochan M, Liu Z, Wang Y. Cross-modal self-attention network for referring image segmentation. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2019, pp.10494–10503. DOI: https://doi.org/10.1109/CVPR.2019.01075.
Tan H, Bansal M. LXMERT: Learning cross-modality encoder representations from transformers. In Proc. the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Nov. 2019, pp.5100–5111. DOI: https://doi.org/10.18653/v1/D19-1514.
Su W J, Zhu X Z, Cao Y, Li B, Lu L W, Wei F R, Dai J F. VL-BERT: Pre-training of generic visual-linguistic representations. In Proc. the 8th International Conference on Learning Representations, Apr. 2020.
Guo M H, Cai J X, Liu Z N, Mu T J, Martin R R, Hu S M. PCT: Point cloud transformer. Computational Visual Media, 2021, 7(2): 187–199. DOI: https://doi.org/10.1007/s41095-021-0229-5.
Article Google Scholar
Yuan L, Chen Y P, Wang T, Yu W H, Shi Y J, Jiang Z H, Tay F E H, Feng J S, Yan S C. Tokens-to-token ViT: Training vision transformers from scratch on ImageNet. In Proc. the 2021 IEEE/CVF International Conference on Computer Vision, Oct. 2021, pp.538–547. DOI: https://doi.org/10.1109/ICCV48922.2021.00060.
He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016, pp.770–778. DOI: https://doi.org/10.1109/CVPR.2016.90.
Zhang Z M, Warrell J, Torr P H S. Proposal generation for object detection using cascaded ranking SVMs. In Proc. the 2011 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2011, pp.1497–1504. DOI: https://doi.org/10.1109/CVPR.2011.5995411.
Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C L. Microsoft COCO: Common objects in context. In Proc. the 13th European Conference on Computer Vision, Sept. 2014, pp.740–755. DOI: https://doi.org/10.1007/978-3-319-10602-1_48.
Everingham M, Van Gool L, Williams C K I, Winn J, Zisserman A. The Pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 2010, 88(2): 303–338. DOI: https://doi.org/10.1007/s11263-009-0275-4.
Article Google Scholar
Chen K, Wang J Q, Pang J M, Cao Y H, Xiong Y, Li X X, Sun S Y, Feng W S, Liu Z W, Xu J R, Zhang Z, Cheng D Z, Zhu C C, Cheng T H, Zhao Q J, Li B Y, Lu X, Zhu R, Wu Y, Dai J F, Wang J D, Shi J P, Ouyang W L, Loy C C, Lin D H. MMDetection: Open MMLab detection toolbox and benchmark. arXiv: 1906.07155, 2019. https://arxiv.org/abs/1906.07155, March 2024.
Michaelis C, Ustyuzhaninov I, Bethge M, Ecker A S. One-shot instance segmentation. arXiv: 1811.11507, 2018. https://arxiv.org/abs/1811.11507, March 2024.
Fu K, Zhang T F, Zhang Y, Sun X. OSCD: A one-shot conditional object detection framework. Neurocomputing, 2021, 425: 243–255. DOI: https://doi.org/10.1016/j.neucom.2020.04.092.
Article Google Scholar
Cen M B, Jung C. Fully convolutional Siamese fusion networks for object tracking. In Proc. the 25th IEEE International Conference on Image Processing, Oct. 2018, pp.3718–3722. DOI: https://doi.org/10.1109/ICIP.2018.8451102.
Li B, Yan J J, Wu W, Zhu Z, Hu X L. High performance visual tracking with Siamese region proposal network. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2018, pp.8971–8980. DOI: https://doi.org/10.1109/CVPR.2018.00935.
Wang X, Huang T E, Darrell T, Gonzalez J E, Yu F. Frustratingly simple few-shot object detection. In Proc. the 37th International Conference on Machine Learning, Jul. 2020, Article No. 920.
Wu X W, Sahoo D, Hoi S. Meta-RCNN: Meta learning for few-shot object detection. In Proc. the 28th ACM International Conference on Multimedia, Oct. 2020, pp.1679–1687. DOI: https://doi.org/10.1145/3394171.3413832.
Xiao Y, Marlet R. Few-shot object detection and viewpoint estimation for objects in the wild. In Proc. the 16th European Conference on Computer Vision, Aug. 2020, pp.192–210. DOI: https://doi.org/10.1007/978-3-030-58520-4_12.
Sun B, Li B H, Cai S C, Yuan Y, Zhang C. FSCE: Few-shot object detection via contrastive proposal encoding. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021, pp.7348–7358. DOI: https://doi.org/10.1109/CVPR46437.2021.00727.
Wu J X, Liu S T, Huang D, Wang Y H. Multi-scale positive sample refinement for few-shot object detection. In Proc. the 16th European Conference on Computer Vision, August 2020, pp.456–472. DOI: https://doi.org/10.1007/978-3-030-58517-4_27.
Lin T Y, Dollár P, Girshick R, He K M, Hariharan B, Belongie S. Feature pyramid networks for object detection. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2017, pp.936–944. DOI: https://doi.org/10.1109/CVPR.2017.106.

Download references

Author information

Authors and Affiliations

School of Computer Science, Northwestern Polytechnical University, Xi’an, 710000, China
Wei-Dong Lin (林蔚东), Yu-Yan Deng (邓玉岩), Yang Gao (高扬), Ning Wang (王宁), Lei Zhang (张磊) & Peng Wang (王鹏)
National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, Northwestern Polytechnical University, Xi’an, 710000, China
Wei-Dong Lin (林蔚东), Yu-Yan Deng (邓玉岩), Yang Gao (高扬), Ning Wang (王宁), Lei Zhang (张磊) & Peng Wang (王鹏)
School of Computer Science, The University of Adelaide, Adelaide, SA, 0115, Australia
Ling-Qiao Liu (刘凌峤)

Authors

Wei-Dong Lin (林蔚东)
View author publications
You can also search for this author inPubMed Google Scholar
Yu-Yan Deng (邓玉岩)
View author publications
You can also search for this author inPubMed Google Scholar
Yang Gao (高扬)
View author publications
You can also search for this author inPubMed Google Scholar
Ning Wang (王宁)
View author publications
You can also search for this author inPubMed Google Scholar
Ling-Qiao Liu (刘凌峤)
View author publications
You can also search for this author inPubMed Google Scholar
Lei Zhang (张磊)
View author publications
You can also search for this author inPubMed Google Scholar
Peng Wang (王鹏)
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Peng Wang (王鹏).

Ethics declarations

Conflict of Interest The authors declare that they have no conflict of interest.

Additional information

This work was supported by the National Science and Technology Major Project under Grant No. 2020AAA0106900, the National Natural Science Foundation of China under Grant Nos. U19B2307 and 61876152, the Shaanxi Provincial Key Research and Development Program of China under Grant No. 2021KWZ-03, and the Natural Science Basic Research Program of Shaanxi Province of China under Grant No. 2021JCW-03.

Wei-Dong Lin received his B.S. degree in hydroacoustic engineering from Northwestern Polytechnical University, Xi’an, in 2019. He is now a Master student in computer science and technology, Northwestern Polytechnical University, Xi’an. His current research interests mainly focus on computer vision and object detection.

Yu-Yan Deng received his B.S. degree in computer science and technology from Xidian University, Xi’an, in 2019. He is now a Master student in computer science and technology, Northwestern Polytechnical University, Xi’an. His current research interests mainly focus on computer vision and obejct detection.

Yang Gao received his B.S. degree in computer science and technology from Northwestern Polytechnical University, Xi’an, in 2019. He is now a Master student in computer science and technology, Northwestern Polytechnical University, Xi’an. His current research interests mainly focus on computer vision and auto machine learning.

Ning Wang received his B.S. degree in computer science and technology from Northwestern Polytechnical University, Xi’an, in 2019. He is now a Master student in computer science and technology, Northwestern Polytechnical University, Xi’an. His current research interests mainly focus on computer vision and neural architecture search.

Ling-Qiao Liu received his B.S. and M.S. degrees in communication engineering from the University of Electronic Science and Technology of China, Chengdu, in 2006 and 2009, respectively, and his Ph.D. degree in computer science from the Australian National University, Canberra, in 2014. In 2016, he was awarded the Discovery Early Career Researcher Award from the Australian Research Council and the University Research Fellow from the University of Adelaide, Adelaide. He is now a senior lecturer at the University of Adelaide and the Australian Institute for Machine Learning, Adelaide. His current research interests include low-supervision learning and various topics in computer vision and natural language processing.

Lei Zhang received his Ph.D. degree in computer science and technology from Northwestern Polytechnical University, Xi’an, in 2018. He was a research staff in the School of Computer Science, the University of Adelaide, Adelaide, between 2017 and 2019. He was a research scientist in the Inception Institute of Artificial Intelligence (IIAI), Abu Dhabi, United Arab Emirates between 2019 and 2020. He is currently a professor with the School of Computer Science, Northwestern Polytechnical University, Xi’an. His research interests include image processing, machine learning and video analysis.

Peng Wang received his B.S. degree in electrical engineering and automation from Beihang University, Beijing, in 2004, and his Ph.D. degree in control science and engineering from Beihang University, Beijing, in 2011. He is now a professor at School of Computer Science, Nothwestern Polytechnical University, Xi’an. He was with School of Computer Science, the University of Adelaide, Adelaide, for about four years. His research interests include computer vision, machine learning and artificial intelligence.

Electronic supplementary material