Skip to main content

Advertisement

EDIR: an expert method for describing image regions based on knowledge distillation and triple fusion

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Fine-grained visual features generally require higher image input resolutions, which in turn necessitate a larger parameter count for general visual models to effectively analyze these features. However, the substantial computational demands of larger models present significant challenges to research in this domain. To address these challenges, our research integrates descriptions of fine-grained visual information from images. We propose an innovative Expert method for Describing Image Regions (EDIR) based on knowledge distillation and triple fusion techniques. Our method comprises a Knowledge-Distilled Expert Network (KDEN) and a Triple Information Set Fusion Network (TIFN) that combine global and regional image descriptions in a controlled prompting manner. Unlike existing studies, our approach not only extracts global and regional image features independently but also relates their spatial information. Our EDIR method reduces visual model parameters by 6.7 times compared to CogVLM, improves ImageNet-1K zero-shot detection accuracy by 0.68%, increases the CIDEr score on NoCaps by 1.9 points, and achieves an average improvement of 1.39% in hallucination accuracy. It also increases the average inference frame rate to 32.92 FPS, representing a 5.82-fold improvement over the baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Availability of data and materials

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request. All the datasets used are publicly available.The demonstration images presented in the paper come from the MOT17 https://motchallenge.net/data/MOT17/ and SA-1B https://segment-anything.com/dataset/index.html public datasets.

References

  1. Alayrac J-B, Donahue J, Luc P, Miech A, Barr I, Hasson Y, Lenc K, Mensch A, Millican K, Reynolds M et al (2022) Flamingo: a visual language model for few-shot learning. Adv Neural Inf Process Syst 35:23716–23736

    Google Scholar 

  2. Bai J, Bai S, Yang S, Wang S, Tan S, Wang P, Lin J, Zhou C, Zhou J (2023) Qwen-vl: a frontier large vision-language model with versatile abilities. arXiv:2308.12966

  3. Chen K, Zhang Z, Zeng W, Zhang R, Zhu F, Zhao R (2023) Unleashing multimodal llm’s referential dialogue magic. Shikra

  4. Chen S, Zhu H, Chen X, Lei Y, Yu G, Chen T (2023) End-to-end 3d dense captioning with vote2cap-detr. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11124–11133

  5. Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning. PMLR, pp1597–1607

  6. Chen X, Xie S, He K (2021) An empirical study of training self-supervised vision transformers in 2021 IEEE. In: CVF International conference on computer vision (ICCV), pp 9620–9629

  7. Chen X, Djolonga J, Padlewski P, Mustafa B, Changpinyo S, Wu J, Ruiz CR, Goodman S, Wang X, Tay Y et al (2023) Pali-x: on scaling up a multilingual vision and language model. arXiv:2305.18565

  8. Chen X, Wang X, Changpinyo S, Piergiovanni AJ, Padlewski P, Salz D, Goodman S, Grycner A, Mustafa B, Beyer L et al (2022) Pali: a jointly-scaled multilingual language-image model. arXiv:2209.06794

  9. Chen X, Zhao Z, Zhang Y, Duan M, Qi D, Zhao H (2022) Focalclick: towards practical interactive image segmentation. In: 2022 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 1290–1299

  10. Cornia M, Baraldi L, Fiameni G, Cucchiara R (2024) Generating more pertinent captions by leveraging semantics and style on multi-source datasets. Int J Comput Vis 132(5):1701–1720

    Article  Google Scholar 

  11. Dai W, Li J, Li D, Tiong A, Zhao J, Wang W, Li B, Fung PN, Hoi S (2023) Instructblip: towards general-purpose vision-language models with instruction tuning. Adv Neural Inf Process Syst 36

  12. Fang Y, Wang W, Xie B, Sun Q, Wu L, Wang X, Huang T, Wang X, Cao Y (2023) Eva: exploring the limits of masked visual representation learning at scale. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19358–19369

  13. Fang Z, Wang J, Hu X, Liang L, Gan Z, Wang L, Yang Y, Liu Z (2022) Injecting semantic concepts into end-to-end image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18009–18019

  14. Ghandi T, Pourreza H, Mahyar H (2023) Deep learning approaches on image captioning: a review. ACM Comput Surv 56(3):1–39

    Article  Google Scholar 

  15. He K, Chen X, Xie S, Li Y, Dollár P, Girshick R (2022) Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16000–16009

  16. Hu X, Gan Z, Wang J, Yang Z, Liu Z, Lu Y, Wang L (2022) Scaling up vision-language pre-training for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 17980–17989

  17. Huang X, Wang J, Tang Y, Zhang Z, Hu H, Lu J, Wang L, Liu Z (2024) Segment and caption anything. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13405–13417

  18. Jain J, Li J, Chiu MT, Hassani A, Orlov N, Shi H (2023) Oneformer: one transformer to rule universal image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2989–2998

  19. Jia C, Yang Y, Xia Y, Chen Y-T, Parekh Z, Pham H, Le Q, Sung Y-H, Li Z, Duerig T (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning. PMLR, pp 4904–4916

  20. Jiang H, Misra I, Rohrbach M, Learned-Miller E, Chen X (2020) In defense of grid features for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10267–10276

  21. Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, Whitehead S, Berg AC, Lo W-Y, Dollar P, Girshick R (2023) Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 4015–4026

  22. Li J, Li D, Savarese S, Hoi S (2023) Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International conference on machine learning. PMLR, pp 19730–19742

  23. Li J, Li D, Xiong C, Hoi S (2022) Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International conference on machine learning. PMLR, pp 12888–12900

  24. Li Y, Fan H, Hu R, Feichtenhofer C, He K (2023) Scaling language-image pre-training via masking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 23390–2340

  25. Li Y, Du Y, Zhou K, Wang J, Zhao X, Wen J-R (2023) Evaluating object hallucination in large vision-language models. In: Bouamor H, Pino J, Bali K (eds) Proceedings of the 2023 conference on empirical methods in natural language processing. Singapore, pp 292–305. Association for Computational Linguistics

  26. Liu H, Li C, Wu Q, Lee YJ (2024) Visual instruction tuning. Adv Neural Inf Process Syst 36

  27. Liu Q, Xu Z, Bertasius G, Niethammer M (2023) Simpleclick: interactive image segmentation with simple vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 22290–22300

  28. Long Y, Wen Y, Han J, Xu H, Ren P, Zhang W, Zhao S, Liang X (2023) Capdet: unifying dense captioning and open-world detection pretraining. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15233–15243

  29. Minaee S, Boykov Y, Porikli F, Plaza A, Kehtarnavaz N, Terzopoulos D (2021) Image segmentation using deep learning: a survey. IEEE Trans Pattern Anal Mach Intell 44(7):3523–3542

    Google Scholar 

  30. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning. PMLR, pp 8748–8763

  31. Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Chen M, Sutskever I (2021)Zero-shot text-to-image generation.In: International conference on machine learning. PMLR, pp 8821–8831

  32. Ren K, Hu C, Xi H (2024) Rlm-tracking: online multi-pedestrian tracking supported by relative location mapping. Int J Mach Learn Cybern 1–17

  33. Ren S, Wei F, Zhang Z, Hu H (2023) Tinymim: an empirical study of distilling mim pre-trained models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3687–3697

  34. Ridnik T, Ben-Baruch E, Noy A, Zelnik L (2021) Imagenet-21k pretraining for the masses. In: Vanschoren J, Yeung S (eds) Proceedings of the neural information processing systems track on datasets and benchmarks,vol 1

  35. Shao Z, Han J, Debattista K, Pang Y (2023) Textual context-aware dense captioning with diverse words. IEEE Trans Multimedia

  36. Sun Q, Fang Y, Wu L, Wang X, Cao Y (2023) Eva-clip: Improved training techniques for clip at scale. arXiv:2303.15389

  37. Sun Z, Fang Y, Wu T, Zhang P, Zang Y, Kong S, Xiong Y, Lin D, Wang J (2024) Alpha-clip: a clip model focusing on wherever you want. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13019–13029

  38. Wang J, Yang Z, Hu X, Li X, Lin K, Gan Z, Liu Z, Liu C, Wang L (2022) Git: a generative image-to-text transformer for vision and language

  39. Wang T, Zhang J, Fei J, Ge Y, Zheng H, Tang Y, Li Z, Gao M, Zhao S, Shan Y et al (2023) Caption anything: interactive image description with diverse multimodal controls. arXiv:2305.02677

  40. Wang W, Lv Q, Yu W, Hong W, Qi J, Wang Y, Ji J, Yang Z, Zhao L, Song X et al (2023) Cogvlm: visual expert for pretrained language models. arXiv:2311.03079

  41. Wang Z, Yu J, Yu AW, Dai Z, Tsvetkov Y, Cao Y (2021) Simvlm: simple visual language model pretraining with weak supervision. arXiv:2108.10904

  42. Wu K, Peng H, Zhou Z, Xiao B, Liu M, Yuan L, Xuan H, Valenzuela M, Chen XS, Wang X, Chao H (2023) Tinyclip: clip distillation via affinity mimicking and weight inheritance

  43. Wu K, Zhang J, Peng H, Liu M, Xiao B, Fu J, Yuan L (2022) Tinyvit: fast pretraining distillation for small vision transformers. In: European conference on computer vision. Springer, pp 68–85

  44. Yu J, Wang Z, Vasudevan V, Yeung L, Seyedhosseini M, Wu Y (2022) Coca: contrastive captioners are image-text foundation models. arXiv:2205.01917

  45. Zhang A, Yao Y, Ji W, Liu Z, Chua T-S (2023) Next-chat: an lmm for chat, detection and segmentation

  46. Zhang C, Han D, Qiao Y, Kim JU, Bae S-H, Lee S, Hong CS (2023) Faster segment anything: towards lightweight sam for mobile applications

  47. Zhang F, Xu M, Xu C (2022) Tell, imagine, and search: end-to-end learning for composing text and image to image retrieval. ACM Trans Multimed Comput Commun Appl (TOMM) 18(2):1–23

    Article  MATH  Google Scholar 

  48. Zhou Y, Zhang R, Chen C, Li C, Tensmeyer C, Yu T, Gu J, Xu J, Sun T (2022) Towards language-free training for text-to-image generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 17907–17917

  49. Zhu D, Chen J, Shen X, Li X, Elhoseiny M (2023) Minigpt-4: enhancing vision-language understanding with advanced large language models. arXiv:2304.10592

  50. Zou X, Yang J, Zhang H, Li F, Li L, Wang J, Wang L, Gao J, Lee YJ (2024) Segment everything everywhere all at once. Adv Neural Inf Process Syst 36

Download references

Acknowledgements

This work was supported by the Collaborative Innovation Key Projects of Zhengzhou NO.123-32211645.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception and design. Constructive suggestions were made by Chuanping Hu. Material preparation, data collection and analysis were performed by Kai Ren,Hao Xi,Yongqiang Li,Jinhao Fan and Lihua Liu.The first draft of the manuscript was written by Kai Ren and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Chuanping Hu.

Ethics declarations

Competing interests

The authors have no relevant financial interests in the manuscript and no other potential conflicts of interest to disclose.

Informed consent

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ren, K., Hu, C., Xi, H. et al. EDIR: an expert method for describing image regions based on knowledge distillation and triple fusion. Appl Intell 55, 62 (2025). https://doi.org/10.1007/s10489-024-06027-3

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10489-024-06027-3

Keywords