skip to main content
10.1145/3664647.3681212acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open access

Uni-YOLO: Vision-Language Model-Guided YOLO for Robust and Fast Universal Detection in the Open World

Published: 28 October 2024 Publication History

Abstract

Universal object detectors aim to detect any object in any scene without human annotation, exhibiting superior generalization. However, the current universal object detectors show degraded performance in harsh weather, and their insufficient real-time capabilities limit their application. In this paper, we present Uni-YOLO, a universal detector designed for complex scenes with real-time performance. Uni-YOLO is a one-stage object detector that uses general object confidence to distinguish between objects and backgrounds, and employs a grid cell regression method for real-time detection. To improve its robustness in harsh weather conditions, the input of Uni-YOLO is adaptively enhanced with a physical model-based enhancement module. During training and inference, Uni-YOLO is guided by the extensive knowledge of the vision-language model CLIP. An object augmentation method is proposed to improve generalization in training by utilizing multiple source datasets with heterogeneous annotations. Furthermore, an online self-enhancement method is proposed to allow Uni-YOLO to further focus on specific objects through self-supervised fine-tuning in a given scene. Extensive experiments on public benchmarks and a UAV deployment are conducted to validate its superiority and practical value.

References

[1]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877--1901.
[2]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Proceedings of the European conference on computer vision (ECCV). Springer, 213--229.
[3]
Dongdong Chen, Mingming He, Qingnan Fan, Jing Liao, Liheng Zhang, Dongdong Hou, Lu Yuan, and Gang Hua. 2019. Gated context aggregation network for image dehazing and deraining. In 2019 IEEE winter conference on applications of computer vision (WACV). 1375--1383.
[4]
Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, and Ying Shan. 2024. YOLO-World: Real-Time Open-Vocabulary Object Detection. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR).
[5]
Xiyang Dai, Yinpeng Chen, Jianwei Yang, Pengchuan Zhang, Lu Yuan, and Lei Zhang. 2021. Dynamic detr: End-to-end object detection with dynamic attention. In Proceedings of the IEEE/CVF international conference on computer vision (CVPR). 2988--2997.
[6]
Jiahua Dong, Yang Cong, Gan Sun, Zhen Fang, and Zhengming Ding. 2024. Where and How to Transfer: Knowledge Aggregation-Induced Transferability Perception for Unsupervised Domain Adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence 46, 3 (2024), 1664--1681.
[7]
Jiahua Dong, Hongliu Li, Yang Cong, Gan Sun, Yulun Zhang, and Luc Van Gool. 2024. No One Left Behind: Real-World Federated Class-Incremental Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 46, 4 (2024), 2054--2070.
[8]
Jiahua Dong, Lixu Wang, Zhen Fang, Gan Sun, Shichao Xu, Xiao Wang, and Qi Zhu. 2022. Federated Class-Incremental Learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[9]
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2010), 303--338.
[10]
Houzhang Fang, Zikai Liao, Lu Wang, Qingshan Li, Yi Chang, Luxin Yan, and Xuhua Wang. 2023. DANet: Multi-scale UAV Target Detection with Dynamic Feature Perception and Scale-aware Knowledge Distillation. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023-3 November 2023. ACM, 2121--2130.
[11]
Chengjian Feng, Yujie Zhong, Zequn Jie, Xiangxiang Chu, Haibing Ren, Xiaolin Wei,Weidi Xie, and Lin Ma. 2022. Promptdet: Towards open-vocabulary detection using uncurated images. In Proceedings of the European conference on computer vision (ECCV). Springer, 701--717.
[12]
Zhenqi Fu, Yan Yang, Xiaotong Tu, Yue Huang, Xinghao Ding, and Kai-Kuang Ma. 2023. Learning a simple low-light image enhancer from paired low-light instances. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). 22252--22261.
[13]
Feng Gao, Jiaxu Leng, Ji Gan, and Xinbo Gao. 2023. Selecting Learnable Training Samples is All DETRs Need in Crowded Pedestrian Detection. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023-3 November 2023. 2714--2722.
[14]
Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. 2021. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921 (2021).
[15]
Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for realtime style transfer and super-resolution. In Proceedings of the European conference on computer vision (ECCV). Springer, 694--711.
[16]
Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. 2020. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International journal of computer vision 128, 7 (2020), 1956--1981.
[17]
Boyi Li, Xiulian Peng, Zhangyang Wang, Jizheng Xu, and Dan Feng. 2017. Aod-net: All-in-one dehazing network. In Proceedings of the IEEE international conference on computer vision (ICCV). 4770--4778.
[18]
Boyi Li, Wenqi Ren, Dengpan Fu, Dacheng Tao, Dan Feng, Wenjun Zeng, and Zhangyang Wang. 2018. Benchmarking single-image dehazing and beyond. IEEE Transactions on Image Processing 28, 1 (2018), 492--505.
[19]
Chengyang Li, Heng Zhou, Yang Liu, Caidong Yang, Yongqiang Xie, Zhongbo Li, and Liping Zhu. 2023. Detection-friendly dehazing: Object detection in realworld hazy scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
[20]
Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. 2022. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10965--10975.
[21]
Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. 2022. Grounded Language-Image Pre-training. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10955--10965.
[22]
Pengteng Li, Ying He, F. Richard Yu, Pinhao Song, Dongfu Yin, and Guang Zhou. 2023. IGG: Improved Graph Generation for Domain Adaptive Object Detection. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023-3 November 2023. ACM, 1314--1324.
[23]
Wenteng Liang, Feng Xue, Yihao Liu, Guofeng Zhong, and Anlong Ming. 2023. Unknown Sniffer for Object Detection: Don't Turn a Blind Eye to Unknown Objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3230--3239.
[24]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740--755.
[25]
Yutong Lin, Yuhui Yuan, Zheng Zhang, Chen Li, Nanning Zheng, and Han Hu. 2023. Detr does not need multi-scale or locality design. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR). 6545--6554.
[26]
Huan Liu, Lu Zhang, Jihong Guan, and Shuigeng Zhou. 2023. Zero-Shot Object Detection by Semantics-Aware DETR with Adaptive Contrastive Loss. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023-3 November 2023. ACM, 4421--4430.
[27]
Wenyu Liu, Gaofeng Ren, Runsheng Yu, Shi Guo, Jianke Zhu, and Lei Zhang. 2022. Image-adaptive YOLO for object detection in adverse weather conditions. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 1792--1800.
[28]
Yuen Peng Loh and Chee Seng Chan. 2019. Getting to know low-light images with the exclusively dark dataset. Computer Vision and Image Understanding 178 (2019), 30--42.
[29]
Zongyang Ma, Guan Luo, Jin Gao, Liang Li, Yuxin Chen, Shaoru Wang, Congxuan Zhang, and Weiming Hu. 2022. Open-vocabulary one-stage detection with hierarchical visual-language knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14074--14083.
[30]
Zeyu Ma, Yang Yang, Guoqing Wang, Xing Xu, Heng Tao Shen, and Mingxing Zhang. 2022. Rethinking Open-World Object Detection in Autonomous Driving Scenarios. In MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022. ACM, 1279--1288.
[31]
Zeyu Ma, Ziqiang Zheng, Jiwei Wei, Xiaoyong Wei, Yang Yang, and Heng Tao Shen. 2023. Open-Scenario Domain Adaptive Object Detection in Autonomous Driving. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023-3 November 2023. ACM, 8453--8462.
[32]
Srinivasa G Narasimhan and Shree K Nayar. 2000. Chromatic framework for vision in bad weather. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1. IEEE, 598--605.
[33]
Xu Qin, Zhilin Wang, Yuanchao Bai, Xiaodong Xie, and Huizhu Jia. 2020. FFA-Net: Feature fusion attention network for single image dehazing. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 11908--11915.
[34]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.
[35]
Alec Radford, JongWook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision., 28492--28518 pages.
[36]
Shafin Rahman, Salman Khan, and Nick Barnes. 2020. Improved visual-semantic alignment for zero-shot object detection. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 11932--11939.
[37]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 779--788.
[38]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015).
[39]
Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. 2019. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision (CVPR). 8430--8439.
[40]
Cheng Shi and Sibei Yang. 2023. EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2023), 15678--15688.
[41]
Hengcan Shi, Munawar Hayat, and Jianfei Cai. 2023. Open-Vocabulary Object Detection via Scene Graph Discovery. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023-3 November 2023. ACM, 4012--4021.
[42]
Yuda Song, Zhuqing He, Hui Qian, and Xin Du. 2023. Vision transformers for single image dehazing. IEEE Transactions on Image Processing 32 (2023), 1927--1941.
[43]
Binyi Su, Hua Zhang, and Zhong Zhou. 2023. HSIC-based Moving Weight Averaging for Few-Shot Open-Set Object Detection. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023-3 November 2023. ACM, 5358--5369.
[44]
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. ERNIE: Enhanced Representation through Knowledge Integration. arXiv:1904.09223 [cs.CL]
[45]
Chenxi Wang and Zhi Jin. 2023. Brighten-and-Colorize: A Decoupled Network for Customized Low-Light Image Enhancement. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023-3 November 2023. ACM, 8356--8366.
[46]
Chenxi Wang, Hongjun Wu, and Zhi Jin. 2023. FourLLIE: Boosting Low-Light Image Enhancement by Fourier Frequency Information. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023-3 November 2023. ACM, 7459--7469.
[47]
Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. 2023. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). 7464--7475.
[48]
Xudong Wang, Xi'ai Chen, Feifan Wang, Chonglong Xu, and Yandong Tang. 2023. Image Recovery and Object Detection Integrated Algorithms for Robots in Harsh Battlefield Environments. In Intelligent Robotics and Applications. Springer Nature Singapore, Singapore, 575--585.
[49]
Zhenyu Wang, Yali Li, Xi Chen, Ser-Nam Lim, Antonio Torralba, Hengshuang Zhao, and ShengjinWang. 2023. Detecting everything in the open world: Towards universal object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11433--11443.
[50]
Chen Wei, Wenjing Wang, Wenhan Yang, and Jiaying Liu. 2018. Deep retinex decomposition for low-light enhancement. arXiv preprint arXiv:1808.04560 (2018).
[51]
Size Wu, Wenwei Zhang, Sheng Jin, Wentao Liu, and Chen Change Loy. 2023. Aligning bag of regions for open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 15254--15264.
[52]
Xiaoshi Wu, Feng Zhu, Rui Zhao, and Hongsheng Li. 2023. Cora: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). 7031--7040.
[53]
Johnathan Xie and Shuai Zheng. 2021. Zero-shot Object Detection Through Vision-Language Embedding Alignment. 2022 IEEE International Conference on Data Mining Workshops (ICDMW) (2021), 1--15.
[54]
Qichao Ying, Jiaxin Liu, Sheng Li, Haisheng Xu, Zhenxing Qian, and Xinpeng Zhang. 2023. RetouchingFFHQ: A Large-scale Dataset for Fine-grained Face Retouching Detection. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023-3 November 2023. 737--746.
[55]
Shenghai Yuan, Jijia Chen, Jiaqi Li, Wenchao Jiang, and Song Guo. 2023. LHNet: A Low-cost Hybrid Network for Single Image Dehazing. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023-3 November 2023. ACM, 7706--7717.
[56]
Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. 2022. Open-vocabulary detr with conditional matching. In Proceedings of the European conference on computer vision (ECCV). Springer, 106--122.
[57]
Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and Shih-Fu Chang. 2021. Open-vocabulary object detection using captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CCVPR). 14393--14402.
[58]
Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. 2022. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605.
[59]
Jianhua Zhang, Jingbo Chen, Shengyong Chen, Zhenhua Wang, and Jianwei Zhang. 2020. Detection and segmentation of unlearned objects in unknown environment. IEEE Transactions on Industrial Informatics 17, 9 (2020), 6211--6220.
[60]
Zhuoran Zheng, Wenqi Ren, Xiaochun Cao, Xiaobin Hu, Tao Wang, Fenglong Song, and Xiuyi Jia. 2021. Ultra-high-definition image dehazing via multi-guided bilateral learning. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 16180--16189.
[61]
Zhaohui Zheng, Ping Wang, Dongwei Ren, Wei Liu, Rongguang Ye, Qinghua Hu, and Wangmeng Zuo. 2021. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE transactions on cybernetics 52, 8 (2021), 8574--8586.
[62]
Yi Zhong, Chengyao Wang, Shiyong Li, Zhu Zhou, Yaowei Wang, and Wei-Shi Zheng. 2022. Mixed Supervision for Instance Learning in Object Detection with Few-shot Annotation. In MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022. ACM, 648--658.
[63]
Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. 2022. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 16793--16803.
[64]
Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, and Ishan Misra. 2022. Detecting twenty-thousand classes using image-level supervision. In Proceedings of the European conference on computer vision (ECCV). Springer, 350--368.
[65]
Pengkai Zhu, HanxiaoWang, and Venkatesh Saligrama. 2019. Zero shot detection. IEEE Transactions on Circuits and Systems for Video Technology 30, 4 (2019), 998--1010.
[66]
Pengkai Zhu, Hanxiao Wang, and Venkatesh Saligrama. 2020. Don't even look once: Synthesizing features for zero-shot detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11693--11702.

Index Terms

  1. Uni-YOLO: Vision-Language Model-Guided YOLO for Robust and Fast Universal Detection in the Open World

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
    October 2024
    11719 pages
    ISBN:9798400706868
    DOI:10.1145/3664647
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Check for updates

    Author Tags

    1. clip
    2. object detection
    3. vision-language model
    4. zero-shot learning

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Acceptance Rates

    MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 257
      Total Downloads
    • Downloads (Last 12 months)257
    • Downloads (Last 6 weeks)67
    Reflects downloads up to 19 Feb 2025

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media