EDIR: an expert method for describing image regions based on knowledge distillation and triple fusion

Ren, Kai; Hu, Chuanping; Xi, Hao; Li, Yongqiang; Fan, Jinhao; Liu, Lihua

doi:10.1007/s10489-024-06027-3

EDIR: an expert method for describing image regions based on knowledge distillation and triple fusion

Published: 30 November 2024

Volume 55, article number 62, (2025)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Kai Ren¹,
Chuanping Hu ORCID: orcid.org/0009-0003-7769-8005^1,2,
Hao Xi¹,
Yongqiang Li¹,
Jinhao Fan¹ &
…
Lihua Liu³

185 Accesses
Explore all metrics

Abstract

Fine-grained visual features generally require higher image input resolutions, which in turn necessitate a larger parameter count for general visual models to effectively analyze these features. However, the substantial computational demands of larger models present significant challenges to research in this domain. To address these challenges, our research integrates descriptions of fine-grained visual information from images. We propose an innovative Expert method for Describing Image Regions (EDIR) based on knowledge distillation and triple fusion techniques. Our method comprises a Knowledge-Distilled Expert Network (KDEN) and a Triple Information Set Fusion Network (TIFN) that combine global and regional image descriptions in a controlled prompting manner. Unlike existing studies, our approach not only extracts global and regional image features independently but also relates their spatial information. Our EDIR method reduces visual model parameters by 6.7 times compared to CogVLM, improves ImageNet-1K zero-shot detection accuracy by 0.68%, increases the CIDEr score on NoCaps by 1.9 points, and achieves an average improvement of 1.39% in hallucination accuracy. It also increases the average inference frame rate to 32.92 FPS, representing a 5.82-fold improvement over the baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semantic consistency knowledge transfer for unsupervised cross domain object detection

Article 13 August 2024

Hierarchical Region-level Decoupling Knowledge Distillation for semantic segmentation

Article 28 January 2025

Granularity-Aware Adaptation for Image Retrieval Over Multiple Tasks

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Availability of data and materials

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request. All the datasets used are publicly available.The demonstration images presented in the paper come from the MOT17 https://motchallenge.net/data/MOT17/ and SA-1B https://segment-anything.com/dataset/index.html public datasets.

References

Alayrac J-B, Donahue J, Luc P, Miech A, Barr I, Hasson Y, Lenc K, Mensch A, Millican K, Reynolds M et al (2022) Flamingo: a visual language model for few-shot learning. Adv Neural Inf Process Syst 35:23716–23736
Google Scholar
Bai J, Bai S, Yang S, Wang S, Tan S, Wang P, Lin J, Zhou C, Zhou J (2023) Qwen-vl: a frontier large vision-language model with versatile abilities. arXiv:2308.12966
Chen K, Zhang Z, Zeng W, Zhang R, Zhu F, Zhao R (2023) Unleashing multimodal llm’s referential dialogue magic. Shikra
Chen S, Zhu H, Chen X, Lei Y, Yu G, Chen T (2023) End-to-end 3d dense captioning with vote2cap-detr. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11124–11133
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning. PMLR, pp1597–1607
Chen X, Xie S, He K (2021) An empirical study of training self-supervised vision transformers in 2021 IEEE. In: CVF International conference on computer vision (ICCV), pp 9620–9629
Chen X, Djolonga J, Padlewski P, Mustafa B, Changpinyo S, Wu J, Ruiz CR, Goodman S, Wang X, Tay Y et al (2023) Pali-x: on scaling up a multilingual vision and language model. arXiv:2305.18565
Chen X, Wang X, Changpinyo S, Piergiovanni AJ, Padlewski P, Salz D, Goodman S, Grycner A, Mustafa B, Beyer L et al (2022) Pali: a jointly-scaled multilingual language-image model. arXiv:2209.06794
Chen X, Zhao Z, Zhang Y, Duan M, Qi D, Zhao H (2022) Focalclick: towards practical interactive image segmentation. In: 2022 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 1290–1299
Cornia M, Baraldi L, Fiameni G, Cucchiara R (2024) Generating more pertinent captions by leveraging semantics and style on multi-source datasets. Int J Comput Vis 132(5):1701–1720
Article Google Scholar
Dai W, Li J, Li D, Tiong A, Zhao J, Wang W, Li B, Fung PN, Hoi S (2023) Instructblip: towards general-purpose vision-language models with instruction tuning. Adv Neural Inf Process Syst 36
Fang Y, Wang W, Xie B, Sun Q, Wu L, Wang X, Huang T, Wang X, Cao Y (2023) Eva: exploring the limits of masked visual representation learning at scale. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19358–19369
Fang Z, Wang J, Hu X, Liang L, Gan Z, Wang L, Yang Y, Liu Z (2022) Injecting semantic concepts into end-to-end image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18009–18019
Ghandi T, Pourreza H, Mahyar H (2023) Deep learning approaches on image captioning: a review. ACM Comput Surv 56(3):1–39
Article Google Scholar
He K, Chen X, Xie S, Li Y, Dollár P, Girshick R (2022) Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16000–16009
Hu X, Gan Z, Wang J, Yang Z, Liu Z, Lu Y, Wang L (2022) Scaling up vision-language pre-training for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 17980–17989
Huang X, Wang J, Tang Y, Zhang Z, Hu H, Lu J, Wang L, Liu Z (2024) Segment and caption anything. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13405–13417
Jain J, Li J, Chiu MT, Hassani A, Orlov N, Shi H (2023) Oneformer: one transformer to rule universal image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2989–2998
Jia C, Yang Y, Xia Y, Chen Y-T, Parekh Z, Pham H, Le Q, Sung Y-H, Li Z, Duerig T (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning. PMLR, pp 4904–4916
Jiang H, Misra I, Rohrbach M, Learned-Miller E, Chen X (2020) In defense of grid features for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10267–10276
Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, Whitehead S, Berg AC, Lo W-Y, Dollar P, Girshick R (2023) Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 4015–4026
Li J, Li D, Savarese S, Hoi S (2023) Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International conference on machine learning. PMLR, pp 19730–19742
Li J, Li D, Xiong C, Hoi S (2022) Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International conference on machine learning. PMLR, pp 12888–12900
Li Y, Fan H, Hu R, Feichtenhofer C, He K (2023) Scaling language-image pre-training via masking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 23390–2340
Li Y, Du Y, Zhou K, Wang J, Zhao X, Wen J-R (2023) Evaluating object hallucination in large vision-language models. In: Bouamor H, Pino J, Bali K (eds) Proceedings of the 2023 conference on empirical methods in natural language processing. Singapore, pp 292–305. Association for Computational Linguistics
Liu H, Li C, Wu Q, Lee YJ (2024) Visual instruction tuning. Adv Neural Inf Process Syst 36
Liu Q, Xu Z, Bertasius G, Niethammer M (2023) Simpleclick: interactive image segmentation with simple vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 22290–22300
Long Y, Wen Y, Han J, Xu H, Ren P, Zhang W, Zhao S, Liang X (2023) Capdet: unifying dense captioning and open-world detection pretraining. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15233–15243
Minaee S, Boykov Y, Porikli F, Plaza A, Kehtarnavaz N, Terzopoulos D (2021) Image segmentation using deep learning: a survey. IEEE Trans Pattern Anal Mach Intell 44(7):3523–3542
Google Scholar
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning. PMLR, pp 8748–8763
Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Chen M, Sutskever I (2021)Zero-shot text-to-image generation.In: International conference on machine learning. PMLR, pp 8821–8831
Ren K, Hu C, Xi H (2024) Rlm-tracking: online multi-pedestrian tracking supported by relative location mapping. Int J Mach Learn Cybern 1–17
Ren S, Wei F, Zhang Z, Hu H (2023) Tinymim: an empirical study of distilling mim pre-trained models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3687–3697
Ridnik T, Ben-Baruch E, Noy A, Zelnik L (2021) Imagenet-21k pretraining for the masses. In: Vanschoren J, Yeung S (eds) Proceedings of the neural information processing systems track on datasets and benchmarks,vol 1
Shao Z, Han J, Debattista K, Pang Y (2023) Textual context-aware dense captioning with diverse words. IEEE Trans Multimedia
Sun Q, Fang Y, Wu L, Wang X, Cao Y (2023) Eva-clip: Improved training techniques for clip at scale. arXiv:2303.15389
Sun Z, Fang Y, Wu T, Zhang P, Zang Y, Kong S, Xiong Y, Lin D, Wang J (2024) Alpha-clip: a clip model focusing on wherever you want. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13019–13029
Wang J, Yang Z, Hu X, Li X, Lin K, Gan Z, Liu Z, Liu C, Wang L (2022) Git: a generative image-to-text transformer for vision and language
Wang T, Zhang J, Fei J, Ge Y, Zheng H, Tang Y, Li Z, Gao M, Zhao S, Shan Y et al (2023) Caption anything: interactive image description with diverse multimodal controls. arXiv:2305.02677
Wang W, Lv Q, Yu W, Hong W, Qi J, Wang Y, Ji J, Yang Z, Zhao L, Song X et al (2023) Cogvlm: visual expert for pretrained language models. arXiv:2311.03079
Wang Z, Yu J, Yu AW, Dai Z, Tsvetkov Y, Cao Y (2021) Simvlm: simple visual language model pretraining with weak supervision. arXiv:2108.10904
Wu K, Peng H, Zhou Z, Xiao B, Liu M, Yuan L, Xuan H, Valenzuela M, Chen XS, Wang X, Chao H (2023) Tinyclip: clip distillation via affinity mimicking and weight inheritance
Wu K, Zhang J, Peng H, Liu M, Xiao B, Fu J, Yuan L (2022) Tinyvit: fast pretraining distillation for small vision transformers. In: European conference on computer vision. Springer, pp 68–85
Yu J, Wang Z, Vasudevan V, Yeung L, Seyedhosseini M, Wu Y (2022) Coca: contrastive captioners are image-text foundation models. arXiv:2205.01917
Zhang A, Yao Y, Ji W, Liu Z, Chua T-S (2023) Next-chat: an lmm for chat, detection and segmentation
Zhang C, Han D, Qiao Y, Kim JU, Bae S-H, Lee S, Hong CS (2023) Faster segment anything: towards lightweight sam for mobile applications
Zhang F, Xu M, Xu C (2022) Tell, imagine, and search: end-to-end learning for composing text and image to image retrieval. ACM Trans Multimed Comput Commun Appl (TOMM) 18(2):1–23
Article MATH Google Scholar
Zhou Y, Zhang R, Chen C, Li C, Tensmeyer C, Yu T, Gu J, Xu J, Sun T (2022) Towards language-free training for text-to-image generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 17907–17917
Zhu D, Chen J, Shen X, Li X, Elhoseiny M (2023) Minigpt-4: enhancing vision-language understanding with advanced large language models. arXiv:2304.10592
Zou X, Yang J, Zhang H, Li F, Li L, Wang J, Wang L, Gao J, Lee YJ (2024) Segment everything everywhere all at once. Adv Neural Inf Process Syst 36

Download references

Acknowledgements

This work was supported by the Collaborative Innovation Key Projects of Zhengzhou NO.123-32211645.

Author information

Authors and Affiliations

School of Electrical and Information Engineering, The University of Zhengzhou, Zhengzhou, Henan, China
Kai Ren, Chuanping Hu, Hao Xi, Yongqiang Li & Jinhao Fan
School of Cyber Science and Engineering, The University of Zhengzhou, Zhengzhou, Henan, China
Chuanping Hu
Henan Remote Sensing Institute, Zhengzhou, Henan, China
Lihua Liu

Authors

Kai Ren
View author publications
You can also search for this author in PubMed Google Scholar
Chuanping Hu
View author publications
You can also search for this author in PubMed Google Scholar
Hao Xi
View author publications
You can also search for this author in PubMed Google Scholar
Yongqiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Jinhao Fan
View author publications
You can also search for this author in PubMed Google Scholar
Lihua Liu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Constructive suggestions were made by Chuanping Hu. Material preparation, data collection and analysis were performed by Kai Ren,Hao Xi,Yongqiang Li,Jinhao Fan and Lihua Liu.The first draft of the manuscript was written by Kai Ren and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Chuanping Hu.

Ethics declarations

Competing interests

The authors have no relevant financial interests in the manuscript and no other potential conflicts of interest to disclose.

Informed consent

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ren, K., Hu, C., Xi, H. et al. EDIR: an expert method for describing image regions based on knowledge distillation and triple fusion. Appl Intell 55, 62 (2025). https://doi.org/10.1007/s10489-024-06027-3

Download citation

Accepted: 18 September 2024
Published: 30 November 2024
DOI: https://doi.org/10.1007/s10489-024-06027-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

EDIR: an expert method for describing image regions based on knowledge distillation and triple fusion

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Semantic consistency knowledge transfer for unsupervised cross domain object detection

Hierarchical Region-level Decoupling Knowledge Distillation for semantic segmentation

Granularity-Aware Adaptation for Image Retrieval Over Multiple Tasks

Availability of data and materials

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

EDIR: an expert method for describing image regions based on knowledge distillation and triple fusion

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Semantic consistency knowledge transfer for unsupervised cross domain object detection

Hierarchical Region-level Decoupling Knowledge Distillation for semantic segmentation

Granularity-Aware Adaptation for Image Retrieval Over Multiple Tasks

Explore related subjects

Availability of data and materials

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation