Skip to main content

Advertisement

Log in

Transformer-based few-shot object detection in traffic scenarios

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

In few-shot object detection (FSOD), many approaches retrain the detector in the inference stage, which is unrealistic in real applications. Moreover, high-quality region proposals are difficult to generate for novel classes using a limited support set. Inspired by the recent development of visual prompt learning (VPL) and detection with transformers (DETR), an approach is proposed in which 1) a class-agnostic training is designed to extend the detector to novel classes and 2) visual prompts are combined with pseudoclass embeddings to improve the query generation. The proposed approach is evaluated on multiple traffic datasets. The results show that it outperforms other mainstream approaches by a margin of 1.1% in mean average precision (mAP). An effective FSOD approach based on VPL and DETR is proposed, that has no retraining in the inference stage, and it accurately localizes novel objects by using an improved query generation mechanism.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Availability of data and materials

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

  1. Sun B, Li B, Cai S, Yuan Y, Zhang C (2021) Fsce: Few-shot object detection via contrastive proposal encoding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7352–7362

  2. Li B, Yang B, Liu C, Liu F, Ji R, Ye Q (2021) Beyond max-margin: class margin equilibrium for few-shot object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7363–7372

  3. Qiao L, Zhao Y, Li Z, Qiu X, Wu J, Zhang C (2021) Defrcn: Decoupled faster r-cnn for few-shot object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 8681–8690

  4. Guirguis K, Hendawy A, Eskandar G, Abdelsamad M, Kayser M, Beyerer J (2022) Cfa: Constraint-based finetuning approach for generalized few-shot object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4039–4049

  5. Liu F, Zhang X, Peng Z, Guo Z, Wan F, Ji X, Ye Q (2023) Integrally migrating pre-trained transformer encoder-decoders for visual object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6825–6834

  6. Chen T-I, Liu Y-C, Su H-T, Chang Y-C, Lin Y-H, Yeh J-F, Chen W-C, Hsu W (2022) Dual-awareness attention for few-shot object detection. IEEE Trans Multimed 24(12):1–15

    Google Scholar 

  7. Xiao Y, Lepetit V, Marlet R (2022) Few-shot object detection and viewpoint estimation for objects in the wild. IEEE Trans Pattern Anal Mach Intell 45(3):3090–3106

    Google Scholar 

  8. Zhang G, Luo Z, Cui K, Lu S, Xing EP (2022) Meta-detr: Image-level few-shot detection with inter-class correlation exploitation. IEEE Trans Pattern Anal Mach Intell 22(11):143–155

    Google Scholar 

  9. Wu X, Zhu F, Zhao R, Li H (2023) Cora: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7031–7040

  10. Tian Y, Cheng G, Gelernter J, Yu S, Song C, Yang B (2020) Joint temporal context exploitation and active learning for video segmentation. Pattern Recognit 100:107158

    Article  Google Scholar 

  11. Zhou D, Tian Y, Chen W-G, Huang G (2021) Self-supervised saliency estimation for pixel embedding in road detection. IEEE Signal Process Lett 28:1325–1329

    Article  Google Scholar 

  12. Wang P, Tian Y, Liu N, Wang J, Chai S, Wang X, Wang R (2022) A tooth surface design method combining semantic guidance, confidence, and structural coherence. IET Comput Vis 16(8):727–735

    Article  Google Scholar 

  13. Tian Y, Jian G, Wang J, Chen H, Pan L, Xu Z, Li J, Wang R (2023) A revised approach to orthodontic treatment monitoring from oralscan video. IEEE J Biomed Health Inform 27(12):1–10

    Article  Google Scholar 

  14. Tian Y, Fu H, Wang H, Liu Y, Xu Z, Chen H, Li J, Wang R (2023) Rgb oralscan video-based orthodontic treatment monitoring. Sci China Inf Sci 66(12):1–10

    Google Scholar 

  15. Chen Y, Xia R, Zou K, Yang K (2023) Ffti: Image inpainting algorithm via features fusion and two-steps inpainting. J Vis Commun Image Represent 91:103776

    Article  Google Scholar 

  16. Chen Y, Xia R, Yang K, Zou K (2023) Mffn: Image super-resolution via multi-level features fusion network. Vis Comput 1–16

  17. Chen Y, Xia R, Zou K, Yang K (2023) Rnon: image inpainting via repair network and optimization network. Int J Mach Learn Cybern 1–17

  18. Tian Y, Gelernter J, Wang X et al (2019) Traffic sign detection using a multi-scale recurrent attention network. IEEE Trans Intell Transp Syst 20(12):4466–4475

  19. Liu D, Tian Y, Xu Z, Jian G (2022) Handling occlusion in prohibited item detection from x-ray images. Neural Comput Appl 34(22):20285–20298

    Article  Google Scholar 

  20. Tian Y, Chen T, Cheng G, Yu S, Li X, Li J, Yang B (2022) Global context assisted structure-aware vehicle retrieval. IEEE Trans Intell Transp Syst 23(1):165–174

    Article  Google Scholar 

  21. Tian Y, Zhang Y, Xu H et al (2022) 3d tooth instance segmentation learning objectness and affinity in point cloud. ACM Trans Multimedia Comput Commun Appl 18:202–211

    Article  Google Scholar 

  22. Tian Y, Zhang Y, Zhou D et al (2020) Triple attention network for video segmentation. Neurocomputing 417:202–211

    Article  Google Scholar 

  23. Liu D, Tian Y, Zhang Y, Gelernter J, Wang X (2022) Heterogeneous data fusion and loss function design for tooth point cloud segmentation. Neural Comput Appl 34(22):17371–17380

    Article  Google Scholar 

  24. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: International conference on learning representations, pp. 782–792

  25. Liu Z, Lin Y, Cao Y et al (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022

  26. Yang J, Li C, Zhang P et al (2020) Focal self-attention for local-global interactions in vision transformers. In: Proceedings of the advances in neural information processing systems, pp. 138–146

  27. Kim G, Jung H-G, Lee S-W (2021) Spatial reasoning for few-shot object detection. Pattern Recognit 120:108118

    Article  Google Scholar 

  28. Zhang T, Zhang X, Zhu P, Jia X, Tang X, Jiao L (2023) Generalized few-shot object detection in remote sensing images. ISPRS J Photogramm Remote Sens 195:353–364

    Article  Google Scholar 

  29. Cheng M, Wang H, Long Y (2021) Meta-learning-based incremental few-shot object detection. IEEE Trans Circuits Syst Video Technol 32(4):2158–2169

    Article  Google Scholar 

  30. Cheng G, Yan B, Shi P, Li K, Yao X, Guo L, Han J (2021) Prototype-cnn for few-shot object detection in remote sensing images. IEEE Trans Geosci Remote Sens 60:1–10

    Google Scholar 

  31. Li X, Deng J, Fang Y (2021) Few-shot object detection on remote sensing images. IEEE Trans Geosci Remote Sens 60:1–14

    Google Scholar 

  32. Tian Y, Wang H, Wang X (2017) Object localization via evaluation multi-task learning. Neurocomputing 253:34–41

    Article  Google Scholar 

  33. Tian Y, Gelernter J, Wang X, Chen W, Gao J, Zhang Y, Li X (2018) Lane marking detection via deep convolutional neural network. Neurocomputing 280:46–55

  34. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: Proceedings of the European conference on computer vision, pp. 740–755

  35. Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: Proceedings of the international conference on learning representations, pp. 1363–1372

  36. Geiger A, Lenz P, Urtasun R (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3354–3361

  37. Zhu Z, Liang D, Zhang S, Huang X, Li B, Hu S (2016) Traffic-sign detection and classification in the wild. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2110–2118

  38. Guo C, Li C, Guo J, Loy CC, Hou J, Kwong S, Cong R (2020) Zero-reference deep curve estimation for low-light image enhancement. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1780–1789

  39. Zhou S, Li C, Change Loy C (2022) Lednet: Joint low-light enhancement and deblurring in the dark. In: Proceedings of the European conference on computer vision, pp. 573–589

  40. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Commun ACM 63(11):139–144

    Article  MathSciNet  Google Scholar 

  41. Chi C, Zhang S, Xing J, Lei Z, Li SZ, Zou X (2020) Pedhunter: Occlusion robust pedestrian detector in crowded scenes. In: Proceedings of the AAAI conference on artificial intelligence, vol. 34, pp. 10639–10646

  42. Ke L, Tai Y-W, Tang C-K (2021) Deep occlusion-aware instance segmentation with overlapping bilayers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4019–4028

Download references

Acknowledgements

The authors would like to thank AJE (www.aje.com) for its language editing assistance during the preparation of this manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (61976188), the Special Project for Basic Business Expenses of Zhejiang Provincial Colleges and Universities (No. JRK22003), and Opening Foundation of State Key Laboratory of Virtual Reality Technology and System of Beihang University (No. VRLAB2023B02).

Author information

Authors and Affiliations

Authors

Contributions

Erjun Sun: Formal analysis, Writing - original draft preparation. Di Zhou: Conceptualization, Methodology, Writing - review & editing. Yan Tian: Software, Data curation, Writing - review & editing. Zhaocheng Xu: Writing - review & editing. Xun Wang: Writing - review & editing.

Corresponding author

Correspondence to Di Zhou.

Ethics declarations

Ethics and informed consent for data used

The research does not involve human participants and/or animals. Consent for data used has already been fully informed.

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, E., Zhou, D., Tian, Y. et al. Transformer-based few-shot object detection in traffic scenarios. Appl Intell 54, 947–958 (2024). https://doi.org/10.1007/s10489-023-05245-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-05245-5

Keywords

Navigation