Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models

Ju, Chen; Wang, Haicheng; Cheng, Haozhe; Chen, Xu; Zhai, Zhonghua; Huang, Weilin; Lan, Jinsong; Xiao, Shuai; Zheng, Bo

doi:10.1007/978-3-031-72952-2_25

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15104))

Included in the following conference series:

European Conference on Computer Vision

313 Accesses

Abstract

Vision-Language Large Models (VLMs) recently become primary backbone of AI, due to the impressive performance. However, their expensive computation costs, i.e., throughput and delay, impede potentials in the real-world scenarios. To achieve acceleration for VLMs, most existing methods focus on the model perspective: pruning, distillation, quantization, but completely overlook the data-perspective redundancy. To fill the overlook, this paper pioneers the severity of data redundancy, and designs one plug-and-play Turbo module guided by information degree to prune inefficient tokens from visual or textual data. In pursuit of efficiency-performance trade-offs, information degree takes two crucial factors into consideration: mutual redundancy and semantic value. Concretely, the former evaluates data duplication between sequential tokens; while the latter evaluates each token by its contribution to the overall semantics. As a result, tokens with high information degree carry less redundancy and stronger semantics. For VLMs’ calculation, Turbo works as a user-friendly plug-in that sorts data referring to information degree, utilizing only top-level ones to save costs. Its advantages are multifaceted, e.g., being generally compatible to various VLMs across understanding and generation, simple use without re-training and trivial engineering efforts. On multiple VLMs benchmarks, we fully experiment to demonstrate the good acceleration of Turbo, under negligible performance drop.

C. Ju and H. Wang—These authors contribute equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.99; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

IVTP: Instruction-Guided Visual Token Pruning for Large Vision-Language Models

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

MoAI: Mixture of All Intelligence for Large Language and Vision Models

References

Bigham, J.P., et al.: Vizwiz: nearly real-time answers to visual questions. In: Proceedings of the 23nd Annual ACM Symposium on User Interface Software and Technology, pp. 333–342 (2010)
Google Scholar
Bolya, D., Fu, C.Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: your vit but faster. arXiv preprint arXiv:2210.09461 (2022)
Bolya, D., Hoffman, J.: Token merging for fast stable diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4599–4603 (2023)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Google Scholar
Chen, J., et al.: Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)
Chen, M., et al.: Wear-any-way: manipulable virtual try-on via sparse correspondence alignment. arXiv preprint arXiv:2403.12965 (2024)
Chen, X., et al.: Enhancing cross-domain click-through rate prediction via explicit feature augmentation. In: Companion Proceedings of the ACM on Web Conference 2024 (2024)
Google Scholar
Chen, Y.-C., et al.: UNITER: UNiversal image-TExt representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7
Chapter Google Scholar
Cheng, H., et al.: Denoiser: rethinking the robustness for open-vocabulary action recognition. arXiv preprint arXiv:2404.14890 (2024)
Cheng, Z., et al.: Image to multi-modal retrieval for industrial scenarios. arXiv preprint arXiv:2305.03972 (2023)
Cheng, Z., Xiao, S., Zhai, Z., Zeng, X., Huang, W.: Mixer: image to multi-modal retrieval learning for industrial application. arXiv preprint arXiv:2305.03972 (2023)
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12873–12883 (2021)
Google Scholar
Fang, Z., Wang, J., Hu, X., Wang, L., Yang, Y., Liu, Z.: Compressing visual-linguistic model via knowledge distillation. In: Proceedings of the International Conference on Computer Vision (2021)
Google Scholar
Fayyaz, M., et al.: Adaptive token sampling for efficient vision transformers. In: European Conference on Computer Vision, pp. 396–414. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20083-0_24
Frantar, E., Ashkboos, S., Hoefler, T., Alistarh, D.: GPTQ: accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 (2022)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. (2020)
Google Scholar
Huang, P.Y., et al.: Mavil: masked audio-video learners. Adv. Neural Inf. Process. Syst. (2024)
Google Scholar
Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6700–6709 (2019)
Google Scholar
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: Proceedings of the International Conference on Machine Learning (2021)
Google Scholar
Jiang, C., et al.: Trips: efficient vision-and-language pre-training with text-relevant image patch selection. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 4084–4096 (2022)
Google Scholar
Ju, C., Han, T., Zheng, K., Zhang, Y., Xie, W.: Prompting visual-language models for efficient video understanding. In: Proceedings of the European Conference on Computer Vision. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-19833-5_7
Ju, C., et al.: Multi-modal prompting for low-shot temporal action localization. arXiv preprint arXiv:2303.11732 (2023)
Ju, C., et al.: Constraint and union for partially-supervised temporal sentence grounding. arXiv preprint arXiv:2302.09850 (2023)
Ju, C., Zhao, P., Chen, S., Zhang, Y., Wang, Y., Tian, Q.: Divide and conquer for single-frame temporal action localization. In: Proceedings of the International Conference on Computer Vision (2021)
Google Scholar
Ju, C., et al.: Adaptive mutual supervision for weakly-supervised temporal action localization. IEEE Trans. Multimedia (2022)
Google Scholar
Ju, C., Zhao, P., Zhang, Y., Wang, Y., Tian, Q.: Point-level temporal action localization: bridging fully-supervised proposals to weakly-supervised losses. arXiv preprint arXiv:2012.08236 (2020)
Ju, C., et al.: Distilling vision-language pre-training to collaborate with weakly-supervised temporal action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2023)
Google Scholar
Kim, W., Son, B., Kim, I.: Vilt: vision-and-language transformer without convolution or region supervision. In: Proceedings of the International Conference on Machine Learning. PMLR (2021)
Google Scholar
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
Google Scholar
Liang, Y., Ge, C., Tong, Z., Song, Y., Wang, J., Xie, P.: Not all patches are what you need: expediting vision transformers via token reorganizations. arXiv preprint arXiv:2202.07800 (2022)
Liu, D., Kan, M., Shan, S., Xilin, C.: A simple romance between multi-exit vision transformer and token reduction. In: The Twelfth International Conference on Learning Representations (2023)
Google Scholar
Liu, J., Ju, C., Ma, C., Wang, Y., Wang, Y., Zhang, Y.: Audio-aware query-enhanced transformer for audio-visual segmentation. arXiv preprint arXiv:2307.13236 (2023)
Liu, J., Ju, C., Xie, W., Zhang, Y.: Exploiting transformation invariance and equivariance for self-supervised sound localisation. In: Proceedings of ACM International Conference on Multimedia (2022)
Google Scholar
Liu, J., Liu, Y., Zhang, F., Ju, C., Zhang, Y., Wang, Y.: Audio-visual segmentation via unlabeled frame exploitation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)
Google Scholar
Liu, J., Wang, Y., Ju, C., Ma, C., Zhang, Y., Xie, W.: Annotation-free audio-visual segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 5604–5614 (2024)
Google Scholar
Liu, Z., et al.: Deja vu: contextual sparsity for efficient LLMS at inference time. In: Proceedings of the International Conference on Machine Learning. PMLR (2023)
Google Scholar
Luo, H., et al.: Clip4clip: an empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing 508, 293–304 (2022)
Article Google Scholar
Ma, C., et al.: Diffusionseg: adapting diffusion towards unsupervised object discovery. arXiv preprint arXiv:2303.09813 (2023)
Ma, C., Yang, Y., Ju, C., Zhang, F., Zhang, Y., Wang, Y.: Attrseg: open-vocabulary semantic segmentation via attribute decomposition-aggregation. In: Thirty-Seventh Conference on Neural Information Processing Systems (2023)
Google Scholar
Ma, C., Yang, Y., Ju, C., Zhang, F., Zhang, Y., Wang, Y.: Open-vocabulary semantic segmentation via attribute decomposition-aggregation. arXiv preprint arXiv:2309.00096 (2023)
Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: a visual question answering benchmark requiring external knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3195–3204 (2019)
Google Scholar
Mokady, R., Hertz, A., Bermano, A.H.: Clipcap: clip prefix for image captioning. arXiv preprint arXiv:2111.09734 (2021)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of the International Conference on Machine Learning (2021)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)
Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: Dynamicvit: efficient vision transformers with dynamic token sparsification. Adv. Neural Inf. Process. Syst. (2021)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Google Scholar
Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948)
Article MathSciNet Google Scholar
Shen, S., et al.: Q-bert: hessian based ultra low precision quantization of bert. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020)
Google Scholar
Shi, D., Tao, C., Jin, Y., Yang, Z., Yuan, C., Wang, J.: Upop: unified and progressive pruning for compressing vision-language transformers. arXiv preprint arXiv:2301.13741 (2023)
Shi, Z., Zhou, X., Qiu, X., Zhu, X.: Improving image captioning with better use of captions. arXiv preprint arXiv:2006.11807 (2020)
Singh, M., et al.: Revisiting weakly supervised pre-training of visual perception models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 804–814 (2022)
Google Scholar
Song, H., Dong, L., Zhang, W.N., Liu, T., Wei, F.: Clip models are few-shot learners: empirical studies on VQA and visual entailment. arXiv preprint arXiv:2203.07190 (2022)
Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021)
Tong, Z., Song, Y., Wang, J., Wang, L.: Videomae: masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural Inf. Process. Syst. (2022)
Google Scholar
Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. Adv. Neural Inf. Process. Syst. 30 (2017)
Google Scholar
Wang, T., Zhou, W., Zeng, Y., Zhang, X.: Efficientvlm: fast and accurate vision-language models via knowledge distillation and modal-adaptive pruning. arXiv preprint arXiv:2210.07795 (2022)
Wei, J., et al.: Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022)
Wei, S., Ye, T., Zhang, S., Tang, Y., Liang, J.: Joint token pruning and squeezing towards more aggressive compression of vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2092–2101 (2023)
Google Scholar
Wu, X., Zeng, F., Wang, X., Wang, Y., Chen, X.: PPT: token pruning and pooling for efficient vision transformers. arXiv preprint arXiv:2310.01812 (2023)
Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., Han, S.: Smoothquant: accurate and efficient post-training quantization for large language models. In: International Conference on Machine Learning, pp. 38087–38099. PMLR (2023)
Google Scholar
Xu, Y., et al.: Evo-vit: slow-fast token evolution for dynamic vision transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2964–2972 (2022)
Google Scholar
Yang, Y., Ma, C., Ju, C., Zhang, Y., Wang, Y.: Multi-modal prototypes for open-set semantic segmentation. arXiv preprint arXiv:2307.02003 (2023)
Ye, Z., Ju, C., Ma, C., Zhang, X.: Unsupervised domain adaption via similarity-based prototypes for cross-modality segmentation. In: Albarqouni, S., et al. (eds.) DART/FAIR -2021. LNCS, vol. 12968, pp. 133–143. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87722-4_13
Chapter Google Scholar
Zhao, P., Xie, L., Ju, C., Zhang, Y., Wang, Y., Tian, Q.: Bottom-up temporal action localization with mutual regularization. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 539–555. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3_32
Chapter Google Scholar
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. Int. J. Comput. Vision 130, 2337–2348 (2022)
Article Google Scholar
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

Download references

Acknowledgements

This work is supported by Alibaba Group through Alibaba Research Intern Program.

Author information

Authors and Affiliations

Alibaba Group, Hangzhou, China
Chen Ju, Haicheng Wang, Xu Chen, Zhonghua Zhai, Weilin Huang, Jinsong Lan, Shuai Xiao & Bo Zheng
Shanghai Jiao Tong University, Shanghai, China
Haicheng Wang & Haozhe Cheng

Authors

Chen Ju
View author publications
You can also search for this author in PubMed Google Scholar
Haicheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Haozhe Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Xu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zhonghua Zhai
View author publications
You can also search for this author in PubMed Google Scholar
Weilin Huang
View author publications
You can also search for this author in PubMed Google Scholar
Jinsong Lan
View author publications
You can also search for this author in PubMed Google Scholar
Shuai Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Bo Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuai Xiao .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 852 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ju, C. et al. (2025). Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15104. Springer, Cham. https://doi.org/10.1007/978-3-031-72952-2_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-72952-2_25
Published: 01 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72951-5
Online ISBN: 978-3-031-72952-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models