Abstract
We present DetToolChain, a novel prompting paradigm, to unleash the zero-shot object detection ability of multimodal large language models (MLLMs), such as GPT-4V and Gemini. Our approach consists of a detection prompting toolkit inspired by high-precision detection priors and a new Chain-of-Thought to implement these prompts. Specifically, the prompts in the toolkit are designed to guide the MLLM to focus on regional information (e.g., zooming in), read coordinates according to measure standards (e.g., overlaying rulers and compasses), and infer from the contextual information (e.g., overlaying scene graphs). Building upon these tools, the new detection chain-of-thought can automatically decompose the task into simple subtasks, diagnose the predictions, and plan for progressive box refinements. The effectiveness of our framework is demonstrated across a spectrum of detection tasks, especially hard cases. Compared to existing state-of-the-art methods, GPT-4V with our DetToolChain improves state-of-the-art object detectors by +21.5% \(AP_{50}\) on MS COCO Novel class set for open-vocabulary detection, +24.23% Acc on RefCOCO val set for zero-shot referring expression comprehension, +14.5% AP on D-cube describe object detection FULL setting. The codes shall be released upon acceptance.
Y. Wu and Y. Wang—Equal Contributions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Bai, J., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)
Bai, J., et al.: Qwen-VL: a frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)
Bansal, A., Sikka, K., Sharma, G., Chellappa, R., Divakaran, A.: Zero-shot object detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 384–400 (2018)
Besta, M., P., et al.: Graph of thoughts: solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687 (2023)
Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Burić, M., Pobar, M., Ivašić-Kos, M.: Object detection in sports videos. In: 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pp. 1034–1039. IEEE (2018)
Cao, W., Liu, Q., He, Z.: Review of pavement defect detection methods. IEEE Access 8, 14531–14544 (2020). https://doi.org/10.1109/ACCESS.2020.2966881
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chen, B., et al.: SpatialVLM: endowing vision-language models with spatial reasoning capabilities. arXiv preprint arXiv:2401.12168 (2024)
Chen, J., Li, D.Z.X.S.X., Zhang, Z.L.P., Xiong, R.K.V.C.Y., Elhoseiny, M.: MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)
Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal LLM’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)
Chen, P., et al.: Open vocabulary object detection with proposal mining and prediction equalization. arXiv preprint arXiv:2206.11134 (2022)
Chen, S., Sun, P., Song, Y., Luo, P.: DiffusionDet: diffusion model for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19830–19843 (2023)
Chen, X., Lin, M., Schärli, N., Zhou, D.: Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128 (2023)
Chou, Y.C., Li, B., Fan, D.P., Yuille, A., Zhou, Z.: Acquiring weak annotations for tumor localization in temporal and volumetric data. Mach. Intell. Res. 21(2), 318–330 (2024)
Czimmermann, T., et al.: Visual-based defect detection and classification approaches for industrial applications-a survey. Sensors 20(5), 1459 (2020)
Dai, G., Shu, X., Wu, W.: GPT4Ego: unleashing the potential of pre-trained models for zero-shot egocentric action recognition. arXiv preprint arXiv:2401.10039 (2024)
Dou, Z.Y., et al.: Coarse-to-fine vision-language pre-training with fusion in the backbone. Adv. Neural. Inf. Process. Syst. 35, 32942–32956 (2022)
Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023)
Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921 (2021)
Han, Z., Zhu, F., Lao, Q., Jiang, H.: Zero-shot referring expression comprehension via structural similarity between images and captions. arXiv preprint arXiv:2311.17048 (2023)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Kim, D., Angelova, A., Kuo, W.: Contrastive feature masking open-vocabulary vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15602–15612 (2023)
Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
Lei, B., Liao, C., Ding, C., et al.: Boosting logical reasoning in large language models through a new framework: the graph of thought. arXiv preprint arXiv:2308.08614 (2023)
Lei, X., Yang, Z., Chen, X., Li, P., Liu, Y.: Scaffolding coordinates to promote vision-language coordination in large multi-modal models. arXiv preprint arXiv:2402.12058 (2024)
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Lin, Z., et al.: Sphinx: the joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575 (2023)
Liu, J., et al.: Deep industrial image anomaly detection: a survey. Mach. Intell. Res. 21(1), 104–135 (2024)
Liu, S., et al.: Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)
Liu, Z., Wang, H., Weng, L., Yang, Y.: Ship rotated bounding box space for ship extraction from high-resolution optical satellite images with complex backgrounds. IEEE Geosci. Remote Sens. Lett. 13(8), 1074–1078 (2016)
Lu, P., et al.: Chameleon: plug-and-play compositional reasoning with large language models. Adv. Neural Inf. Process. Syst. 36 (2024)
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11–20 (2016)
Minderer, M., et al.: Simple open-vocabulary object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13670. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20080-9_42
Mitra, C., Huang, B., Darrell, T., Herzig, R.: Compositional chain-of-thought prompting for large multimodal models. arXiv preprint arXiv:2311.17076 (2023)
Peng, Z., et al.: Kosmos-2: grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1), 5485–5551 (2020)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28 (2015)
Schick, T., et al.: PEER: a collaborative language model. arXiv preprint arXiv:2208.11663 (2022)
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 618–626 (2017). https://doi.org/10.1109/ICCV.2017.74
Shtedritski, A., Rupprecht, C., Vedaldi, A.: What does clip know about a red circle? Visual prompt engineering for VLMs. arXiv preprint arXiv:2304.06712 (2023)
Subramanian, S., Merrill, W., Darrell, T., Gardner, M., Singh, S., Rohrbach, A.: ReCLIP: a strong zero-shot baseline for referring expression comprehension. arXiv preprint arXiv:2204.05991 (2022)
Sun, P., et al.: Sparse R-CNN: end-to-end object detection with learnable proposals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14454–14463 (2021)
Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
InternLM Team: InternLM: a multilingual language model with progressively enhanced capabilities (2023)
Thomas, G., Gade, R., Moeslund, T.B., Carr, P., Hilton, A.: Computer vision for sports: current applications and research topics. Comput. Vis. Image Underst. 159, 3–18 (2017)
Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? Exploring the visual shortcomings of multimodal LLMs. arXiv preprint arXiv:2401.06209 (2024)
Vandeghen, R., Cioppa, A., Van Droogenbroeck, M.: Semi-supervised training to improve player and ball detection in soccer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3481–3490 (2022)
Wang, L., et al.: Object-aware distillation pyramid for open-vocabulary object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11186–11196 (2023)
Wang, P., et al.: OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: International Conference on Machine Learning, pp. 23318–23340. PMLR (2022)
Wang, W., et al.: VisionLLM: large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175 (2023)
Wang, Y., et al.: Unsupervised object detection pretraining with joint object priors generation and detector learning. Adv. Neural. Inf. Process. Syst. 35, 12435–12448 (2022)
Wang, Y., et al.: Hulk: a universal knowledge translator for human-centric tasks. arXiv preprint arXiv:2312.01697 (2023)
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022)
Wu, S., Zhang, W., Jin, S., Liu, W., Loy, C.C.: Aligning bag of regions for open-vocabulary object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15254–15264 (2023)
Wu, W., Yao, H., Zhang, M., Song, Y., Ouyang, W., Wang, J.: GPT4Vis: what can GPT-4 do for zero-shot visual recognition? arXiv preprint arXiv:2311.15732 (2023)
Wu, X., Zhu, F., Zhao, R., Li, H.: CORA: adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7031–7040 (2023)
Wu, Y., Zhang, Z., Xie, C., Zhu, F., Zhao, R.: Advancing referring expression segmentation beyond single image. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2628–2638 (2023)
Wu, Y., Zheng, B., Chen, J., Chen, D.Z., Wu, J.: Self-learning and one-shot learning based single-slice annotation for 3D medical image segmentation. In: Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S. (eds.) Medical Image Computing and Computer Assisted Intervention – MICCAI 2022. MICCAI 2022. LNCS, vol. 13438. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16452-1_24
Xie, C., Zhang, Z., Wu, Y., Zhu, F., Zhao, R., Liang, S.: Described object detection: Liberating object detection with flexible expressions. Adv. Neural Inf. Process. Syst. 36 (2024)
Yan, B., et al.: Universal instance perception as object discovery and retrieval. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15325–15336 (2023)
Yang, J., Zhang, H., Li, F., Zou, X., Li, C., Gao, J.: Set-of-mark prompting unleashes extraordinary visual grounding in GPT-4V. arXiv preprint arXiv:2310.11441 (2023)
Yang, L., Wang, Y., Li, X., Wang, X., Yang, J.: Fine-grained visual prompting. Adv. Neural Inf. Process. Syst. 36 (2024)
Yang, Z., et al.: The dawn of LMMs: preliminary explorations with GPT-4V(ision). arXiv preprint arXiv:2309.17421 (2023)
Yao, F., et al.: Thinking like an expert: multimodal hypergraph-of-thought (HoT) reasoning to boost foundation modals. arXiv preprint arXiv:2308.06207 (2023)
Yao, S., et al.: Tree of thoughts: deliberate problem solving with large language models. Adv. Neural Inf. Process. Syst. 36 (2024)
Yao, Y., Li, Z., Zhao, H.: Beyond chain-of-thought, effective graph-of-thought reasoning in large language models. arXiv preprint arXiv:2305.16582 (2023)
Yao, Y., Zhang, A., Zhang, Z., Liu, Z., Chua, T.S., Sun, M.: CPT: colorful prompt tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797 (2021)
Yin, Z., et al.: LAMM: language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. Adv. Neural Inf. Process. Syst. 36 (2024)
You, H., et al.: FERRET: refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704 (2023)
Yu, L., et al.: MAttNet: modular attention network for referring expression comprehension. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1307–1315 (2018)
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_5
Zang, Y., Li, W., Han, J., Zhou, K., Loy, C.C.: Contextual object detection with multimodal large language models. arXiv preprint arXiv:2305.18279 (2023)
Zang, Y., Li, W., Zhou, K., Huang, C., Loy, C.C.: Open-vocabulary DETR with conditional matching. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13669. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_7
Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.F.: Open-vocabulary object detection using captions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14393–14402 (2021)
Zeng, N., Wu, P., Wang, Z., Li, H., Liu, W., Liu, X.: A small-sized object detection oriented multi-scale feature fusion approach with application to defect detection. IEEE Trans. Instrum. Meas. 71, 1–14 (2022)
Zhang, Z., Zhang, A., Li, M., Smola, A.: Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493 (2022)
Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923 (2023)
Zhao, Y., Lin, Z., Zhou, D., Huang, Z., Feng, J., Kang, B.: BuboGPT: enabling visual grounding in multi-modal LLMs. arXiv preprint arXiv:2307.08581 (2023)
Zheng, G., Yang, B., Tang, J., Zhou, H.Y., Yang, S.: DDCoT: duty-distinct chain-of-thought prompting for multimodal reasoning in language models. Adv. Neural Inf. Process. Syst. 36 (2024)
Zhong, Y., et al.: RegionCLIP: region-based language-image pretraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16793–16803 (2022)
Zhuge, M., et al.: Mindstorms in natural language-based societies of mind. arXiv preprint arXiv:2305.17066 (2023)
Zong, Z., Song, G., Liu, Y.: DETRs with collaborative hybrid assignments training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6748–6758 (2023)
Acknowledgments
This research was partially supported by National Natural Science Foundation of China under grants No. 62176231, No. 62106218, No. 82202984, No. 92259202 and No. 62132017, Zhejiang Key R&D Program of China under grant No. 2023C03053. This work is Supported by Shanghai Artificial Intelligence Laboratory. This work was supported by the JC STEM Lab of AI for Science and Engineering, funded by The Hong Kong Jockey Club Charities Trust.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wu, Y. et al. (2025). DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15090. Springer, Cham. https://doi.org/10.1007/978-3-031-73411-3_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-73411-3_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73410-6
Online ISBN: 978-3-031-73411-3
eBook Packages: Computer ScienceComputer Science (R0)