Abstract
Fine-grained visual features generally require higher image input resolutions, which in turn necessitate a larger parameter count for general visual models to effectively analyze these features. However, the substantial computational demands of larger models present significant challenges to research in this domain. To address these challenges, our research integrates descriptions of fine-grained visual information from images. We propose an innovative Expert method for Describing Image Regions (EDIR) based on knowledge distillation and triple fusion techniques. Our method comprises a Knowledge-Distilled Expert Network (KDEN) and a Triple Information Set Fusion Network (TIFN) that combine global and regional image descriptions in a controlled prompting manner. Unlike existing studies, our approach not only extracts global and regional image features independently but also relates their spatial information. Our EDIR method reduces visual model parameters by 6.7 times compared to CogVLM, improves ImageNet-1K zero-shot detection accuracy by 0.68%, increases the CIDEr score on NoCaps by 1.9 points, and achieves an average improvement of 1.39% in hallucination accuracy. It also increases the average inference frame rate to 32.92 FPS, representing a 5.82-fold improvement over the baseline.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Availability of data and materials
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request. All the datasets used are publicly available.The demonstration images presented in the paper come from the MOT17 https://motchallenge.net/data/MOT17/ and SA-1B https://segment-anything.com/dataset/index.html public datasets.
References
Alayrac J-B, Donahue J, Luc P, Miech A, Barr I, Hasson Y, Lenc K, Mensch A, Millican K, Reynolds M et al (2022) Flamingo: a visual language model for few-shot learning. Adv Neural Inf Process Syst 35:23716–23736
Bai J, Bai S, Yang S, Wang S, Tan S, Wang P, Lin J, Zhou C, Zhou J (2023) Qwen-vl: a frontier large vision-language model with versatile abilities. arXiv:2308.12966
Chen K, Zhang Z, Zeng W, Zhang R, Zhu F, Zhao R (2023) Unleashing multimodal llm’s referential dialogue magic. Shikra
Chen S, Zhu H, Chen X, Lei Y, Yu G, Chen T (2023) End-to-end 3d dense captioning with vote2cap-detr. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11124–11133
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning. PMLR, pp1597–1607
Chen X, Xie S, He K (2021) An empirical study of training self-supervised vision transformers in 2021 IEEE. In: CVF International conference on computer vision (ICCV), pp 9620–9629
Chen X, Djolonga J, Padlewski P, Mustafa B, Changpinyo S, Wu J, Ruiz CR, Goodman S, Wang X, Tay Y et al (2023) Pali-x: on scaling up a multilingual vision and language model. arXiv:2305.18565
Chen X, Wang X, Changpinyo S, Piergiovanni AJ, Padlewski P, Salz D, Goodman S, Grycner A, Mustafa B, Beyer L et al (2022) Pali: a jointly-scaled multilingual language-image model. arXiv:2209.06794
Chen X, Zhao Z, Zhang Y, Duan M, Qi D, Zhao H (2022) Focalclick: towards practical interactive image segmentation. In: 2022 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 1290–1299
Cornia M, Baraldi L, Fiameni G, Cucchiara R (2024) Generating more pertinent captions by leveraging semantics and style on multi-source datasets. Int J Comput Vis 132(5):1701–1720
Dai W, Li J, Li D, Tiong A, Zhao J, Wang W, Li B, Fung PN, Hoi S (2023) Instructblip: towards general-purpose vision-language models with instruction tuning. Adv Neural Inf Process Syst 36
Fang Y, Wang W, Xie B, Sun Q, Wu L, Wang X, Huang T, Wang X, Cao Y (2023) Eva: exploring the limits of masked visual representation learning at scale. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19358–19369
Fang Z, Wang J, Hu X, Liang L, Gan Z, Wang L, Yang Y, Liu Z (2022) Injecting semantic concepts into end-to-end image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18009–18019
Ghandi T, Pourreza H, Mahyar H (2023) Deep learning approaches on image captioning: a review. ACM Comput Surv 56(3):1–39
He K, Chen X, Xie S, Li Y, Dollár P, Girshick R (2022) Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16000–16009
Hu X, Gan Z, Wang J, Yang Z, Liu Z, Lu Y, Wang L (2022) Scaling up vision-language pre-training for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 17980–17989
Huang X, Wang J, Tang Y, Zhang Z, Hu H, Lu J, Wang L, Liu Z (2024) Segment and caption anything. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13405–13417
Jain J, Li J, Chiu MT, Hassani A, Orlov N, Shi H (2023) Oneformer: one transformer to rule universal image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2989–2998
Jia C, Yang Y, Xia Y, Chen Y-T, Parekh Z, Pham H, Le Q, Sung Y-H, Li Z, Duerig T (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning. PMLR, pp 4904–4916
Jiang H, Misra I, Rohrbach M, Learned-Miller E, Chen X (2020) In defense of grid features for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10267–10276
Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, Whitehead S, Berg AC, Lo W-Y, Dollar P, Girshick R (2023) Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 4015–4026
Li J, Li D, Savarese S, Hoi S (2023) Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International conference on machine learning. PMLR, pp 19730–19742
Li J, Li D, Xiong C, Hoi S (2022) Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International conference on machine learning. PMLR, pp 12888–12900
Li Y, Fan H, Hu R, Feichtenhofer C, He K (2023) Scaling language-image pre-training via masking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 23390–2340
Li Y, Du Y, Zhou K, Wang J, Zhao X, Wen J-R (2023) Evaluating object hallucination in large vision-language models. In: Bouamor H, Pino J, Bali K (eds) Proceedings of the 2023 conference on empirical methods in natural language processing. Singapore, pp 292–305. Association for Computational Linguistics
Liu H, Li C, Wu Q, Lee YJ (2024) Visual instruction tuning. Adv Neural Inf Process Syst 36
Liu Q, Xu Z, Bertasius G, Niethammer M (2023) Simpleclick: interactive image segmentation with simple vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 22290–22300
Long Y, Wen Y, Han J, Xu H, Ren P, Zhang W, Zhao S, Liang X (2023) Capdet: unifying dense captioning and open-world detection pretraining. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15233–15243
Minaee S, Boykov Y, Porikli F, Plaza A, Kehtarnavaz N, Terzopoulos D (2021) Image segmentation using deep learning: a survey. IEEE Trans Pattern Anal Mach Intell 44(7):3523–3542
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning. PMLR, pp 8748–8763
Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Chen M, Sutskever I (2021)Zero-shot text-to-image generation.In: International conference on machine learning. PMLR, pp 8821–8831
Ren K, Hu C, Xi H (2024) Rlm-tracking: online multi-pedestrian tracking supported by relative location mapping. Int J Mach Learn Cybern 1–17
Ren S, Wei F, Zhang Z, Hu H (2023) Tinymim: an empirical study of distilling mim pre-trained models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3687–3697
Ridnik T, Ben-Baruch E, Noy A, Zelnik L (2021) Imagenet-21k pretraining for the masses. In: Vanschoren J, Yeung S (eds) Proceedings of the neural information processing systems track on datasets and benchmarks,vol 1
Shao Z, Han J, Debattista K, Pang Y (2023) Textual context-aware dense captioning with diverse words. IEEE Trans Multimedia
Sun Q, Fang Y, Wu L, Wang X, Cao Y (2023) Eva-clip: Improved training techniques for clip at scale. arXiv:2303.15389
Sun Z, Fang Y, Wu T, Zhang P, Zang Y, Kong S, Xiong Y, Lin D, Wang J (2024) Alpha-clip: a clip model focusing on wherever you want. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13019–13029
Wang J, Yang Z, Hu X, Li X, Lin K, Gan Z, Liu Z, Liu C, Wang L (2022) Git: a generative image-to-text transformer for vision and language
Wang T, Zhang J, Fei J, Ge Y, Zheng H, Tang Y, Li Z, Gao M, Zhao S, Shan Y et al (2023) Caption anything: interactive image description with diverse multimodal controls. arXiv:2305.02677
Wang W, Lv Q, Yu W, Hong W, Qi J, Wang Y, Ji J, Yang Z, Zhao L, Song X et al (2023) Cogvlm: visual expert for pretrained language models. arXiv:2311.03079
Wang Z, Yu J, Yu AW, Dai Z, Tsvetkov Y, Cao Y (2021) Simvlm: simple visual language model pretraining with weak supervision. arXiv:2108.10904
Wu K, Peng H, Zhou Z, Xiao B, Liu M, Yuan L, Xuan H, Valenzuela M, Chen XS, Wang X, Chao H (2023) Tinyclip: clip distillation via affinity mimicking and weight inheritance
Wu K, Zhang J, Peng H, Liu M, Xiao B, Fu J, Yuan L (2022) Tinyvit: fast pretraining distillation for small vision transformers. In: European conference on computer vision. Springer, pp 68–85
Yu J, Wang Z, Vasudevan V, Yeung L, Seyedhosseini M, Wu Y (2022) Coca: contrastive captioners are image-text foundation models. arXiv:2205.01917
Zhang A, Yao Y, Ji W, Liu Z, Chua T-S (2023) Next-chat: an lmm for chat, detection and segmentation
Zhang C, Han D, Qiao Y, Kim JU, Bae S-H, Lee S, Hong CS (2023) Faster segment anything: towards lightweight sam for mobile applications
Zhang F, Xu M, Xu C (2022) Tell, imagine, and search: end-to-end learning for composing text and image to image retrieval. ACM Trans Multimed Comput Commun Appl (TOMM) 18(2):1–23
Zhou Y, Zhang R, Chen C, Li C, Tensmeyer C, Yu T, Gu J, Xu J, Sun T (2022) Towards language-free training for text-to-image generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 17907–17917
Zhu D, Chen J, Shen X, Li X, Elhoseiny M (2023) Minigpt-4: enhancing vision-language understanding with advanced large language models. arXiv:2304.10592
Zou X, Yang J, Zhang H, Li F, Li L, Wang J, Wang L, Gao J, Lee YJ (2024) Segment everything everywhere all at once. Adv Neural Inf Process Syst 36
Acknowledgements
This work was supported by the Collaborative Innovation Key Projects of Zhengzhou NO.123-32211645.
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. Constructive suggestions were made by Chuanping Hu. Material preparation, data collection and analysis were performed by Kai Ren,Hao Xi,Yongqiang Li,Jinhao Fan and Lihua Liu.The first draft of the manuscript was written by Kai Ren and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors have no relevant financial interests in the manuscript and no other potential conflicts of interest to disclose.
Informed consent
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ren, K., Hu, C., Xi, H. et al. EDIR: an expert method for describing image regions based on knowledge distillation and triple fusion. Appl Intell 55, 62 (2025). https://doi.org/10.1007/s10489-024-06027-3
Accepted:
Published:
DOI: https://doi.org/10.1007/s10489-024-06027-3