Abstract
Given a query patch from a novel class, one-shot object detection aims to detect all instances of this class in a target image through the semantic similarity comparison. However, due to the extremely limited guidance in the novel class as well as the unseen appearance difference between the query and target instances, it is difficult to appropriately exploit their semantic similarity and generalize well. To mitigate this problem, we present a universal Cross-Attention Transformer (CAT) module for accurate and efficient semantic similarity comparison in one-shot object detection. The proposed CAT utilizes the transformer mechanism to comprehensively capture bi-directional correspondence between any paired pixels from the query and the target image, which empowers us to sufficiently exploit their semantic characteristics for accurate similarity comparison. In addition, the proposed CAT enables feature dimensionality compression for inference speedup without performance loss. Extensive experiments on three object detection datasets MS-COCO, PASCAL VOC and FSOD under the one-shot setting demonstrate the effectiveness and efficiency of our model, e.g., it surpasses CoAE, a major baseline in this task, by 1.0% in average precision (AP) on MS-COCO and runs nearly 2.5 times faster.
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proc. the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2014, pp.580–587. DOI: https://doi.org/10.1109/CVPR.2014.81.
Ren S Q, He K M, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137–1149. DOI: https://doi.org/10.1109//TPAMI.2016.2577031.
Hsieh T I, Lo Y C, Chen H T, Liu T L. One-shot object detection with co-attention and co-excitation. In Proc. the 33rd International Conference on Neural Information Processing Systems, Dec. 2019, Article No. 245.
Fan Q, Zhuo W, Tang C K, Tai Y W. Few-shot object detection with attention-RPN and multi-relation detector. In Proc. the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2020, pp.4012–4021. DOI: https://doi.org/10.1104/CVPR42600.2020.00407.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I. Attention is all you need. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.6000–6010. DOI: https://doi.org/10.5555/3295222.3295349.
Chen H, Wang Y L, Wang G Y, Qiao Y. LSTD: A low-shot transfer detector for object detection. In Proc. the 32nd AAAI Conference on Artificial Intelligence, Feb. 2018, pp.2836–2843. DOI: https://doi.org/10.1609/aaai.v32i1.11716.
Kang B Y, Liu Z, Wang X, Yu F, Feng J S, Darrell T. Few-shot object detection via feature reweighting. In Proc. the 2019 IEEE/CVF International Conference on Computer Vision, Oct. 27–Nov. 2, 2019, pp.8419–8428. DOI: https://doi.org/10.1104/ICCV.2019.00851.
Karlinsky L, Shtok J, Harary S, Schwartz E, Aides A, Feris R, Giryes R, Bronstein A M. RepMet: Representative-based metric learning for classification and few-shot object detection. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2019, pp.5192–5201. DOI: https://doi.org/10.1109/CVPR.2019.00534.
Osokin A, Sumin D, Lomakin V. OS2D: One-stage one-shot object detection by matching anchor features. In Proc. the 16th European Conference on Computer Vision, Aug. 2020, pp.635–652. DOI: https://doi.org/10.1007/978-3-030-58555-6_38.
Tay Y, Dehghani M, Bahri D, Metzler D. Efficient transformers: A survey. ACM Computing Surveys, 2023, 55(6): Article No. 109. DOI: https://doi.org/10.1145/3530811.
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X H, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. the 9th International Conference on Learning Representations, May 2021.
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H. Training data-efficient image transformers & distillation through attention. In Proc. the 38th International Conference on Machine Learning, Jul. 2021, pp.10347–10357.
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. In Proc. the 16th European Conference on Computer Vision, Aug. 2020, pp.213–229. DOI: https://doi.org/10.1007/978-3-030-58452-8_13.
Zhu X Z, Su W J, Lu L W, Li B, Wang X G, Dai J F. Deformable DETR: Deformable transformers for end-to-end object detection. In Proc. the 9th International Conference on Learning Representations, May 2021.
Ye L W, Rochan M, Liu Z, Wang Y. Cross-modal self-attention network for referring image segmentation. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2019, pp.10494–10503. DOI: https://doi.org/10.1109/CVPR.2019.01075.
Tan H, Bansal M. LXMERT: Learning cross-modality encoder representations from transformers. In Proc. the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Nov. 2019, pp.5100–5111. DOI: https://doi.org/10.18653/v1/D19-1514.
Su W J, Zhu X Z, Cao Y, Li B, Lu L W, Wei F R, Dai J F. VL-BERT: Pre-training of generic visual-linguistic representations. In Proc. the 8th International Conference on Learning Representations, Apr. 2020.
Guo M H, Cai J X, Liu Z N, Mu T J, Martin R R, Hu S M. PCT: Point cloud transformer. Computational Visual Media, 2021, 7(2): 187–199. DOI: https://doi.org/10.1007/s41095-021-0229-5.
Yuan L, Chen Y P, Wang T, Yu W H, Shi Y J, Jiang Z H, Tay F E H, Feng J S, Yan S C. Tokens-to-token ViT: Training vision transformers from scratch on ImageNet. In Proc. the 2021 IEEE/CVF International Conference on Computer Vision, Oct. 2021, pp.538–547. DOI: https://doi.org/10.1109/ICCV48922.2021.00060.
He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016, pp.770–778. DOI: https://doi.org/10.1109/CVPR.2016.90.
Zhang Z M, Warrell J, Torr P H S. Proposal generation for object detection using cascaded ranking SVMs. In Proc. the 2011 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2011, pp.1497–1504. DOI: https://doi.org/10.1109/CVPR.2011.5995411.
Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C L. Microsoft COCO: Common objects in context. In Proc. the 13th European Conference on Computer Vision, Sept. 2014, pp.740–755. DOI: https://doi.org/10.1007/978-3-319-10602-1_48.
Everingham M, Van Gool L, Williams C K I, Winn J, Zisserman A. The Pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 2010, 88(2): 303–338. DOI: https://doi.org/10.1007/s11263-009-0275-4.
Chen K, Wang J Q, Pang J M, Cao Y H, Xiong Y, Li X X, Sun S Y, Feng W S, Liu Z W, Xu J R, Zhang Z, Cheng D Z, Zhu C C, Cheng T H, Zhao Q J, Li B Y, Lu X, Zhu R, Wu Y, Dai J F, Wang J D, Shi J P, Ouyang W L, Loy C C, Lin D H. MMDetection: Open MMLab detection toolbox and benchmark. arXiv: 1906.07155, 2019. https://arxiv.org/abs/1906.07155, March 2024.
Michaelis C, Ustyuzhaninov I, Bethge M, Ecker A S. One-shot instance segmentation. arXiv: 1811.11507, 2018. https://arxiv.org/abs/1811.11507, March 2024.
Fu K, Zhang T F, Zhang Y, Sun X. OSCD: A one-shot conditional object detection framework. Neurocomputing, 2021, 425: 243–255. DOI: https://doi.org/10.1016/j.neucom.2020.04.092.
Cen M B, Jung C. Fully convolutional Siamese fusion networks for object tracking. In Proc. the 25th IEEE International Conference on Image Processing, Oct. 2018, pp.3718–3722. DOI: https://doi.org/10.1109/ICIP.2018.8451102.
Li B, Yan J J, Wu W, Zhu Z, Hu X L. High performance visual tracking with Siamese region proposal network. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2018, pp.8971–8980. DOI: https://doi.org/10.1109/CVPR.2018.00935.
Wang X, Huang T E, Darrell T, Gonzalez J E, Yu F. Frustratingly simple few-shot object detection. In Proc. the 37th International Conference on Machine Learning, Jul. 2020, Article No. 920.
Wu X W, Sahoo D, Hoi S. Meta-RCNN: Meta learning for few-shot object detection. In Proc. the 28th ACM International Conference on Multimedia, Oct. 2020, pp.1679–1687. DOI: https://doi.org/10.1145/3394171.3413832.
Xiao Y, Marlet R. Few-shot object detection and viewpoint estimation for objects in the wild. In Proc. the 16th European Conference on Computer Vision, Aug. 2020, pp.192–210. DOI: https://doi.org/10.1007/978-3-030-58520-4_12.
Sun B, Li B H, Cai S C, Yuan Y, Zhang C. FSCE: Few-shot object detection via contrastive proposal encoding. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021, pp.7348–7358. DOI: https://doi.org/10.1109/CVPR46437.2021.00727.
Wu J X, Liu S T, Huang D, Wang Y H. Multi-scale positive sample refinement for few-shot object detection. In Proc. the 16th European Conference on Computer Vision, August 2020, pp.456–472. DOI: https://doi.org/10.1007/978-3-030-58517-4_27.
Lin T Y, Dollár P, Girshick R, He K M, Hariharan B, Belongie S. Feature pyramid networks for object detection. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2017, pp.936–944. DOI: https://doi.org/10.1109/CVPR.2017.106.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest The authors declare that they have no conflict of interest.
Additional information
This work was supported by the National Science and Technology Major Project under Grant No. 2020AAA0106900, the National Natural Science Foundation of China under Grant Nos. U19B2307 and 61876152, the Shaanxi Provincial Key Research and Development Program of China under Grant No. 2021KWZ-03, and the Natural Science Basic Research Program of Shaanxi Province of China under Grant No. 2021JCW-03.
Wei-Dong Lin received his B.S. degree in hydroacoustic engineering from Northwestern Polytechnical University, Xi’an, in 2019. He is now a Master student in computer science and technology, Northwestern Polytechnical University, Xi’an. His current research interests mainly focus on computer vision and object detection.
Yu-Yan Deng received his B.S. degree in computer science and technology from Xidian University, Xi’an, in 2019. He is now a Master student in computer science and technology, Northwestern Polytechnical University, Xi’an. His current research interests mainly focus on computer vision and obejct detection.
Yang Gao received his B.S. degree in computer science and technology from Northwestern Polytechnical University, Xi’an, in 2019. He is now a Master student in computer science and technology, Northwestern Polytechnical University, Xi’an. His current research interests mainly focus on computer vision and auto machine learning.
Ning Wang received his B.S. degree in computer science and technology from Northwestern Polytechnical University, Xi’an, in 2019. He is now a Master student in computer science and technology, Northwestern Polytechnical University, Xi’an. His current research interests mainly focus on computer vision and neural architecture search.
Ling-Qiao Liu received his B.S. and M.S. degrees in communication engineering from the University of Electronic Science and Technology of China, Chengdu, in 2006 and 2009, respectively, and his Ph.D. degree in computer science from the Australian National University, Canberra, in 2014. In 2016, he was awarded the Discovery Early Career Researcher Award from the Australian Research Council and the University Research Fellow from the University of Adelaide, Adelaide. He is now a senior lecturer at the University of Adelaide and the Australian Institute for Machine Learning, Adelaide. His current research interests include low-supervision learning and various topics in computer vision and natural language processing.
Lei Zhang received his Ph.D. degree in computer science and technology from Northwestern Polytechnical University, Xi’an, in 2018. He was a research staff in the School of Computer Science, the University of Adelaide, Adelaide, between 2017 and 2019. He was a research scientist in the Inception Institute of Artificial Intelligence (IIAI), Abu Dhabi, United Arab Emirates between 2019 and 2020. He is currently a professor with the School of Computer Science, Northwestern Polytechnical University, Xi’an. His research interests include image processing, machine learning and video analysis.
Peng Wang received his B.S. degree in electrical engineering and automation from Beihang University, Beijing, in 2004, and his Ph.D. degree in control science and engineering from Beihang University, Beijing, in 2011. He is now a professor at School of Computer Science, Nothwestern Polytechnical University, Xi’an. He was with School of Computer Science, the University of Adelaide, Adelaide, for about four years. His research interests include computer vision, machine learning and artificial intelligence.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Lin, WD., Deng, YY., Gao, Y. et al. CAT: A Simple yet Effective Cross-Attention Transformer for One-Shot Object Detection. J. Comput. Sci. Technol. 39, 460–471 (2024). https://doi.org/10.1007/s11390-024-1743-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-024-1743-6