skip to main content
10.1145/3690407.3690424acmotherconferencesArticle/Chapter ViewAbstractPublication PagescaibdaConference Proceedingsconference-collections
research-article

VLP Based Open-set Object Detection with Improved RT-DETR

Published: 24 October 2024 Publication History

Abstract

Despite the remarkable accuracy of traditional object detectors, they are unable to detect novel categories. This paper proposes a method for open-set object detection based on generating pseudo-labels using the Vision-Language Pre-trained (VLP) model. This approach enables traditional object detectors to perform open-set object detection and can be generalized to all object detectors. Additionally, this paper introduces two improvements to RT-DETR. First, replacing the RepC3 in the fusion module with Manhattan Self-Attention (MaSA) to better construct global features. Second, using MPDIoU loss instead of GIoU loss. The results demonstrate that the improved RT-DETR achieves increases of 3.1%, 3.8%, and 3.1% mAP for all classes, base classes, and novel classes on the Pascal VOC07+12 dataset, respectively. Furthermore, the proposed method shows a 1.3% improvement in mAP for open-set object detection (64.6% mAP for novel classes) compared to ZSD methods.

References

[1]
Ren, Shaoqing, et al. "Faster r-cnn: Towards real-time object detection with region proposal networks." Advances in neural information processing systems 28 (2015).
[2]
Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
[3]
Carion, Nicolas, et al. "End-to-end object detection with transformers." European conference on computer vision. Cham: Springer International Publishing, 2020.
[4]
Gu, Xiuye et al. “Open-vocabulary Object Detection via Vision and Language Knowledge Distillation.” International Conference on Learning Representations (2021).
[5]
Radford, Alec et al. “Learning Transferable Visual Models From Natural Language Supervision.” International Conference on Machine Learning (2021).
[6]
Jaiswal, Ayush et al. “Class-agnostic Object Detection.” 2021 IEEE Winter Conference on Applications of Computer Vision (WACV) (2020): 918-927.
[7]
Lv, Wenyu et al. “DETRs Beat YOLOs on Real-time Object Detection.” ArXiv abs/2304.08069 (2023): n. pag.
[8]
Fan, Qihang et al. “RMT: Retentive Networks Meet Vision Transformers.” ArXiv abs/2309.11523 (2023): n. pag.
[9]
Rezatofighi, Seyed Hamid et al. “Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression.” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019): 658-666.
[10]
Ma, Siliang and Yong Xu. “MPDIoU: A Loss for Efficient and Accurate Bounding Box Regression.” ArXiv abs/2307.07662 (2023): n. pag.
[11]
Mark Everingham, Luc Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (VOC) challenge. Int. J. on Computer Vision, 88(2):303–338, 2010.
[12]
Demirel, Berkan et al. “Zero-Shot Object Detection by Hybrid Region Embedding.” British Machine Vision Conference (2018).
[13]
Shafin Rahman, Salman Khan, and Fatih Porikli. Zero-shot object detection: Learning to simultaneously recognize and localize novel concepts. In Asian Conference on Computer Vision, pages 547–563. Springer, 2018.
[14]
Ye Zheng, Ruoran Huang, Chuanqi Han, Xi Huang, and Li Cui. Background learnable cascade for zero-shot object detection. In Proceedings of the Asian Conference on Computer Vision, 2020.
[15]
Nasir Hayat, Munawar Hayat, Shafin Rahman, Salman Khan, Syed Waqas Zamir, and Fahad Shahbaz Khan. Synthesizing the unseen for zero-shot object detection. In Proceedings ofthe Asian Conference on Computer Vision, 2020.
[16]
Sarma, Sandipan et al. “Resolving Semantic Confusions for Improved Zero-Shot Detection.” British Machine Vision Conference (2022).
[17]
Redmon, Joseph and Ali Farhadi. “YOLO9000: Better, Faster, Stronger.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016): 6517-6525.
[18]
Cai, Zhaowei and Nuno Vasconcelos. “Cascade R-CNN: Delving Into High Quality Object Detection.” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2017): 6154-6162.

Index Terms

  1. VLP Based Open-set Object Detection with Improved RT-DETR

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    CAIBDA '24: Proceedings of the 2024 4th International Conference on Artificial Intelligence, Big Data and Algorithms
    June 2024
    1206 pages
    ISBN:9798400710247
    DOI:10.1145/3690407
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 October 2024

    Check for updates

    Author Tags

    1. CLIP
    2. MPDIoU
    3. MaSA
    4. RT-DETR
    5. VLP
    6. ZSD
    7. open-set object detection

    Qualifiers

    • Research-article

    Conference

    CAIBDA 2024

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 31
      Total Downloads
    • Downloads (Last 12 months)31
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 17 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media