PromptDet: Towards Open-Vocabulary Detection Using Uncurated Images

Feng, Chengjian; Zhong, Yujie; Jie, Zequn; Chu, Xiangxiang; Ren, Haibing; Wei, Xiaolin; Xie, Weidi; Ma, Lin

doi:10.1007/978-3-031-20077-9_41

Chengjian Feng¹²,
Yujie Zhong¹²,
Zequn Jie¹²,
Xiangxiang Chu¹²,
Haibing Ren¹²,
Xiaolin Wei¹²,
Weidi Xie¹³ &
…
Lin Ma¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13669))

Included in the following conference series:

European Conference on Computer Vision

5081 Accesses

Abstract

The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations. To achieve that, we make the following four contributions: (i) in pursuit of generalisation, we propose a two-stage open-vocabulary object detector, where the class-agnostic object proposals are classified with a text encoder from pre-trained visual-language model; (ii) To pair the visual latent space (of RPN box proposals) with that of the pre-trained text encoder, we propose the idea of regional prompt learning to align the textual embedding space with regional visual object features; (iii) To scale up the learning procedure towards detecting a wider spectrum of objects, we exploit the available online resource via a novel self-training framework, which allows to train the proposed detector on a large corpus of noisy uncurated web images. Lastly, (iv) to evaluate our proposed detector, termed as PromptDet, we conduct extensive experiments on the challenging LVIS and MS-COCO dataset. PromptDet shows superior performance over existing approaches with fewer additional training images and zero manual annotations whatsoever. Project page with code: https://fcjian.github.io/promptdet.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection

Open-Vocabulary Object Detection by Novel-Class Feature Perception Enhancement

Region-Centric Image-Language Pretraining for Open-Vocabulary Detection

References

Akata, Z., Malinowski, M., Fritz, M., Schiele, B.: Multi-cue zero-shot learning with strong supervision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 59–68 (2016)
Google Scholar
Bansal, A., Sikka, K., Sharma, G., Chellappa, R., Divakaran, A.: Zero-shot object detection. In: Proceedings of the European Conference on Computer Vision. pp. 384–400 (2018)
Google Scholar
Bilen, H., Vedaldi, A.: Weakly supervised deep detection networks. In: CVPR, pp. 2846–2854 (2016)
Google Scholar
Cacheux, Y.L., Borgne, H.L., Crucianu, M.: Modeling inter and intra-class relations in the triplet loss for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10333–10342 (2019)
Google Scholar
Elhoseiny, M., Zhu, Y., Zhang, H., Elgammal, A.: Link the head to the “ beak": zero shot learning from noisy text description at part precision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5640–5649 (2017)
Google Scholar
Everingham, M., Eslami, S., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vision 111(1), 98–136 (2015)
Article Google Scholar
Fan, Q., Zhuo, W., Tang, C.K., Tai, Y.W.: Few-shot object detection with attention-rpn and multi-relation detector. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4013–4022 (2020)
Google Scholar
Feng, C., Zhong, Y., Gao, Y., Scott, M.R., Huang, W.: Tood: task-aligned one-stage object detection. In: Proceedings of the International Conference on Computer Vision, pp. 3490–3499. IEEE Computer Society (2021)
Google Scholar
Feng, C., Zhong, Y., Huang, W.: Exploring classification equilibrium in long-tailed object detection. In: Proceedings of the International Conference on Computer Vision, pp. 3417–3426 (2021)
Google Scholar
Frome, A., et al.: Devise: a deep visual-semantic embedding model. In: Advances in Neural Information Processing Systems 26 (2013)
Google Scholar
Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921 (2021)
Gupta, A., Dollar, P., Girshick, R.: Lvis: a dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5356–5364 (2019)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the International Conference on Computer Vision, pp. 2961–2969 (2017)
Google Scholar
Ji, Z., Fu, Y., Guo, J., Pang, Y., Zhang, Z.M., et al.: Stacked semantics-guided attention model for fine-grained zero-shot learning. In: Advances in Neural Information Processing Systems 31 (2018)
Google Scholar
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: Proceedings of the International Conference on Machine Learning, pp. 4904–4916. PMLR (2021)
Google Scholar
Kang, B., Liu, Z., Wang, X., Yu, F., Feng, J., Darrell, T.: Few-shot object detection via feature reweighting. In: Proceedings of the International Conference on Computer Vision, pp. 8420–8429 (2019)
Google Scholar
Kaul, P., Xie, W., Zisserman, A.: Label, verify, correct: a simple few shot object detection method. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2022)
Google Scholar
Li, Z., Yao, L., Zhang, X., Wang, X., Kanhere, S., Zhang, H.: Zero-shot object detection with textual descriptions. In: Proceedings of the AAAI Conference on Artificial Intelligence (2019)
Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the International Conference on Computer Vision, pp. 2980–2988 (2017)
Google Scholar
Lin, T.Y., et al.: Microsoft coco: common objects in context. In: Proceedings of the European Conference on Computer Vision, pp. 740–755 (2014)
Google Scholar
Mori, Y., Takahashi, H., Oka, R.: Image-to-word transformation based on dividing and vector quantizing images with words. In: MISRM (1999)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of the International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Rahman, S., Khan, S., Barnes, N.: Improved visual-semantic alignment for zero-shot object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020)
Google Scholar
Ramanathan, V., Wang, R., Mahajan, D.: Dlwl: improving detection for lowshot classes with weakly labelled data. In: CVPR, pp. 9342–9352 (2020)
Google Scholar
Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: CVPR, pp. 7263–7271 (2017)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497 (2015)
Rohrbach, M., Stark, M., Schiele, B.: Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2011)
Google Scholar
Schuhmann, C., et al.: Laion-400m: open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)
Tian, Z., Shen, C., Chen, H., He, T.: Fcos: fully convolutional one-stage object detection. In: Proceedings of the International Conference on Computer Vision, pp. 9627–9636 (2019)
Google Scholar
Weston, J., Bengio, S., Usunier, N.: Wsabie: scaling up to large vocabulary image annotation. In: IJCAI (2011)
Google Scholar
Xie, J., Zheng, S.: Zsd-yolo: zero-shot yolo detection using vision-language knowledgedistillation. arXiv preprint arXiv:2109.12066 (2021)
Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.F.: Open-vocabulary object detection using captions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2021)
Google Scholar
Zhao, H., Puig, X., Zhou, B., Fidler, S., Torralba, A.: Open vocabulary scene parsing. In: Proceedings of the International Conference on Computer Vision, pp. 2002–2010 (2017)
Google Scholar
Zhong, Y., Deng, Z., Guo, S., Scott, M.R., Huang, W.: Representation sharing for fast object detector search and beyond. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 471–487. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_28
Chapter Google Scholar
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. arXiv preprint arXiv:2109.01134 (2021)
Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I.: Detecting twenty-thousand classes using image-level supervision. arXiv preprint arXiv:2201.02605 (2022)

Download references

Author information

Authors and Affiliations

Meituan Inc., Beijing, China
Chengjian Feng, Yujie Zhong, Zequn Jie, Xiangxiang Chu, Haibing Ren, Xiaolin Wei & Lin Ma
Shanghai Jiao Tong University, Shanghai, China
Weidi Xie

Authors

Chengjian Feng
View author publications
You can also search for this author in PubMed Google Scholar
Yujie Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Zequn Jie
View author publications
You can also search for this author in PubMed Google Scholar
Xiangxiang Chu
View author publications
You can also search for this author in PubMed Google Scholar
Haibing Ren
View author publications
You can also search for this author in PubMed Google Scholar
Xiaolin Wei
View author publications
You can also search for this author in PubMed Google Scholar
Weidi Xie
View author publications
You can also search for this author in PubMed Google Scholar
Lin Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weidi Xie .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6040 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Feng, C. et al. (2022). PromptDet: Towards Open-Vocabulary Detection Using Uncurated Images. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13669. Springer, Cham. https://doi.org/10.1007/978-3-031-20077-9_41

Download citation

DOI: https://doi.org/10.1007/978-3-031-20077-9_41
Published: 06 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20076-2
Online ISBN: 978-3-031-20077-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics