research-article

ACTOR: Adapting CLIP for Fully Transformer-based Open-vocabulary Detection

Authors:

Shangsong LiangAuthors Info & Claims

GAIIS '24: Proceedings of the 2024 International Conference on Generative Artificial Intelligence and Information Security

Pages 156 - 161

https://doi.org/10.1145/3665348.3665376

Published: 03 July 2024 Publication History

Abstract

Open-vocabulary detection (OVD) aims to identify objects from novel, unseen categories that extend beyond the base categories encountered during training. Recent approaches generally resort to large-scale Visual-Language Models (VLMs), such as CLIP, to identify novel objects. However, incorporating these models into the OVD problem faces two main challenges: (1) the dgenerated performance when applying a VLM that pre-trained for whole image classification on regional object recognition; (2) the difficulty to localize objects from novel categories. To overcome the above challenges, we propose ACTOR, a DETR-style model which Adapting CLIP for fully Transformer-based Open-vocabulary detection via bidiRectional prompt learning and conditional decoding. Bidirectional prompt learning tightens the alignment between CLIP's regional object features and text embeddings. Conditional decoding facilitates learning a generalizable object localizer through a class-aware matching mechanism. Our experiments on the OV-COCO benchmark demonstrate that ACTOR achieves a competitive detection performance with the end-to-end open-vocabulary detector OV-DETR, while exhibiting a much faster inference speed.

References

[1]

Maria A Bravo, Sudhanshu Mittal, and Thomas Brox. 2022. Localized vision-language matching for open-vocabulary object detection. In DAGM German Conference on Pattern Recognition. Springer, 393–408.

Digital Library

[2]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection.

[3]

Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. 2022. Learning to prompt for open-vocabulary object detection with vision-language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14084–14093.

[4]

Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. 2021. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921 (2021).

[5]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961–2969.

[6]

Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, and Anelia Angelova. 2022. F-vlm: Open-vocabulary object detection upon frozen vision and language models. arXiv preprint arXiv:2209.15639 (2022).

[7]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980–2988.

[8]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740–755.

[9]

Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang. 2022. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329 (2022).

[10]

Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).

[11]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.

[12]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015).

[13]

Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. 2019. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 658–666.

[14]

Xiaoshi Wu, Feng Zhu, Rui Zhao, and Hongsheng Li. 2023. CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7031–7040.

[15]

Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. 2022. Open-vocabulary detr with conditional matching. In European Conference on Computer Vision. Springer, 106–122.

Digital Library

[16]

Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and Shih-Fu Chang. 2021. Open-vocabulary object detection using captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14393–14402.

[17]

Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, 2022. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16793–16803.

[18]

Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, and Ishan Misra. 2022. Detecting twenty-thousand classes using image-level supervision. In European Conference on Computer Vision. Springer, 350–368.

Digital Library

[19]

Muhammad Arslan Manzoor, Sarah Albarri, Ziting Xian, Zaiqiao Meng, Preslav Nakov, and Shangsong Liang. 2023. Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications. ACM Trans. Multimedia Comput. Commun. Appl. 20, 3, Article 74 (March 2024), 34 pages. https://doi.org/10.1145/3617833.

Digital Library

[20]

Guanzheng Chen, Fangyu Liu, Zaiqiao Meng, and Shangsong Liang. 2022. Revisiting Parameter-Efficient Tuning: Are We Really There Yet? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2612–2626, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Index Terms

ACTOR: Adapting CLIP for Fully Transformer-based Open-vocabulary Detection
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object detection
      2. Computer vision tasks
        Scene understanding

Recommendations

Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning
Computer Vision – ECCV 2024
Abstract
An increasingly massive number of remote-sensing images spurs the development of extensible object detectors that can detect objects beyond training categories without costly collecting new labeled data. In this paper, we aim to develop open-...
Open-vocabulary object detection via debiased curriculum self-training
Abstract
Open-vocabulary object detection aims to train a detector capable of recognizing various novel classes. Most existing studies exploit image-level weak supervision to generate pseudo object boxes for novel class training. However, the generated ...
Highlights
- Open-vocabulary object detection without using box-annotated images of novel classes.
- Better exploitation of image-level weak supervision for novel class training.
- Proposed debiased curriculum self-training for accurate pseudo-...
Open Vocabulary Object Detection with Pseudo Bounding-Box Labels
Computer Vision – ECCV 2022
Abstract
Despite great progress in object detection, most existing methods work only on a limited set of object categories, due to the tremendous human effort needed for bounding-box annotations of training data. To alleviate the problem, recent open ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

GAIIS '24: Proceedings of the 2024 International Conference on Generative Artificial Intelligence and Information Security

May 2024

439 pages

ISBN:9798400709562

DOI:10.1145/3665348

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 July 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Natural Science Foundation of China

Conference

GAIIS 2024

GAIIS 2024: 2024 International Conference on Generative Artificial Intelligence and Information Security

May 10 - 12, 2024

Kuala Lumpur, Malaysia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
55
Total Downloads

Downloads (Last 12 months)55
Downloads (Last 6 weeks)5

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten