Resizing codebook of vector quantization without retraining

Li, Lei; Liu, Tingting; Wang, Chengyu; Qiu, Minghui; Chen, Cen; Gao, Ming; Zhou, Aoying

doi:10.1007/s00530-023-01065-2

Resizing codebook of vector quantization without retraining

Regular Paper
Published: 07 March 2023

Volume 29, pages 1499–1512, (2023)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Lei Li¹,
Tingting Liu¹,
Chengyu Wang³,
Minghui Qiu³,
Cen Chen^1,2,
Ming Gao^1,2 &
…
Aoying Zhou¹

510 Accesses
Explore all metrics

Abstract

Large models pre-trained on massive data have become a flourishing paradigm of artificial intelligence systems. Recent works, such as M6, CogView, WenLan 2.0, NÜWA, and ERNIE-ViLG, further extend this diagram to joint Vision Language Pre-training (VLP). For VLP, the two-stage architecture is a popular design, which includes the first stage learning an encoding function of data and the second stage learning a probabilistic model of encoded representation of data. Vector quantization (VQ) has usually engaged in the encoding function of image data for the first stage. VQ includes a data structure (codebook) and an algorithm (finding nearest quantization). The publicly available VQ models (e.g., VQGAN, VQVAE, VQVAE2) include a codebook whose size is assigned empirically (e.g., 1024, 4096, and 16,384) by their authors. If we want a smaller codebook for a lower computation load of the VQ process, or we want a larger codebook for better reconstruction quality, we have to retrain VQ models that consist of the down-sampling net, the codebook, and the up-sampling net. However, retraining VQ models is very expensive since these models, with billions of parameters, are trained on massive datasets. It motivates us to find an approach to resize the codebook of Vector quantization without retraining. In this paper, we leverage hyperbolic embeddings to enhance codebook vectors with the co-occurrence information and reorder the enhanced codebook by the Hilbert curve. Then we can resize the codebook of vector quantization for lower computation load or better reconstruction quality. Experimental results prove the efficiency and effectiveness of our approach when compared with competitive baselines. The code will be released to the public.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Unsupervised Prototype Adapter for Vision-Language Models

Learning to Prompt for Vision-Language Models

Article 31 July 2022

Kaiyang Zhou, Jingkang Yang, … Ziwei Liu

Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective

Article Open access 07 September 2023

Wenhao Wu, Zhun Sun, … Wanli Ouyang

Availability of supporting data

All datasets and model weights used in our experiments are publicly available with references or URLs. Our proposed approaches are in detail described with pseudo codes or figures. In addition, we will release all codes as a open-source project upon acceptance of this paper.

Notes

References

Agustsson, E., Timofte, R.: NTIRE 2017 challenge on single image super-resolution: dataset and study. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 1122–1131 (2017)
Ai, L., Yu, J., Wu, Z., et al.: Optimized residual vector quantization for efficient approximate nearest neighbor search. Multimed. Syst. 23(2), 169–181 (2017)
Article Google Scholar
Bai, Y., Ying, Z., Ren, H., et al.: Modeling heterogeneous hierarchies with relation-specific hyperbolic cones. IIn: Advances in Neural Information Processing Systems, pp. 12316–12327 (2021)
Chen, H., Chang, Y.: All-nearest-neighbors finding based on the Hilbert curve. Expert Syst. Appl. 38(6), 7462–7475 (2011)
Article Google Scholar
Devlin, J., Chang, M., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: arXiv preprint arXiv: https://arxiv.org/1810.04805 (2019
Ding, M., Kong, K., Li, J., et al.: Vq-gnn: a universal framework to scale up graph neural networks using vector quantization. arXiv: https://arxiv.org/2110.14363 (2021)
Ding, M., Yang, Z., et al.: Cogview: mastering text-to-image generation via transformers. CoRR arXiv: https://arxiv.org/abs/2105.13290 (2021)
Esser, P., Rombach, R., et al.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12873–12883 (2021)
Fei, N., Lu, Z., et al.: Wenlan 2.0: make AI imagine via a multimodal foundation model. CoRR arXiv:https://arxiv.org/abs/2110.14378 (2021)
Fu, C., Xiang, C., Wang, C., et al.: Fast approximate nearest neighbor search with the navigating spreading-out graph. Proc. VLDB Endow. 12(5), 461–474 (2019)
Article Google Scholar
Ge, T., He, K., Ke, Q., et al.: Optimized product quantization for approximate nearest neighbor search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2946–2953 (2013)
Guo, R., Sun, P., Lindgren, E., et al.: Accelerating large-scale inference with anisotropic vector quantization.In: International Conference on Machine Learning, vol 119. PMLR, pp. 3887–3896 (2020)
He, K., Wen, F., Sun, J.: K-means hashing: an affinity-preserving quantization method for learning binary compact codes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2938–2945 (2013)
Heusel, M., Ramsauer, H., Unterthiner, T., et al.: Gans trained by a two time-scale update rule converge to a local nash equilibrium.In: Advances in neural information processing systems, pp. 6626–6637 (2017)
Ignatov, A., Timofte, R., et al.: PIRM challenge on perceptual image enhancement on smartphones: report. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 315–333 (2018)
Kalantidis, Y., Avrithis, Y.: Locally optimized product quantization for approximate nearest neighbor search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2329–2336 (2014)
Karras, T., Aila, T., et al.: Progressive growing of gans for improved quality, stability, and variation. In: ICLR (2018)
Khrulkov, V., Mirvakhabova, L., et al.: Hyperbolic image embeddings. n: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6417–6427 (2020)
Kitaev, N., Kaiser, L., et al.: Reformer: the efficient transformer. In: ICLR (2020)
Lin, J., Men, R., et al.: M6: A chinese multimodal pretrainer. CoRR arXiv: https://arxiv.org/abs/2103.00823 (2021)
Lin, T., Maire, M., et al.: Microsoft COCO: common objects in context.In: Proceedings of 13th European Conference of Computer Vision (ECCV), pp. 740–755 (2014)
Liu, Z., Luo, P., et al.: Deep learning face attributes in the wild. In: Proceedings of the IEEE international conference on computer vision, pp. 3730–3738 (2015)
Malkov, Y.A., Yashunin, D.A.: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell. 42(4), 824–836 (2020)
Article Google Scholar
Nickel, M., Kiela, D.: Poincaré embeddings for learning hierarchical representations. In: Advances in neural information processing systems, pp. 6338–6347 (2017)
van den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: Advances in neural information processing systems, pp. 6306–6315 (2017)
Peng, W., Varanka, T., et al.: Hyperbolic deep neural networks: a survey. CoRR arXiv: https://arxiv.org/abs/2101.04562 (2021)
Radford, A., Wu, J., Child, R., et al.: Language models are unsupervised multitask learners. CoRR (2019)
Razavi, A., van den Oord, A., et al.: Generating diverse high-fidelity images with VQ-VAE-2. In: Advances in neural information processing systems, pp. 14837–14847 (2019)
Roy, A., Grangier, D.: Unsupervised paraphrasing without translation. In: arXiv preprint arXiv: https://arxiv.org/1905.12752 (2019)
Setiadi, D.R.I.M.: PSNR vs SSIM: imperceptibility quality assessment for image steganography. Multimed. Tools Appl. 80(6), 8423–8444 (2021)
Article Google Scholar
Skilling, J.: Programming the Hilbert curve. In: Bayesian Inference and Maximum Entropy Methods in Science and Engineering, pp. 381–387 (2004). https://aip.scitation.org/doi/abs/10.1063/1.1751381. https://aip.scitation.org/toc/apc/707/1. https://www.aip.org/aip/history/locations
Tsinganos, P., Cornelis, B., et al.: A Hilbert curve based representation of semg signals for gesture recognition. In: International Conference on Systems, Signals and Image Processing (IWSSIP), pp. 201–206 (2019)
Wang, L., Hu, F., Wu, S., et al.: Fully hyperbolic graph convolution network for recommendation. In: Proceedings of the 30th ACM 10.1007/s00530-023-01065-2 International Conference on Information & Knowledge Management, pp. 3483–3487 (2021)
Wang, Z., Bovik, A.C., Sheikh, H.R., et al.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Article Google Scholar
Wu, C., Liang, J., et al.: Nüwa: visual synthesis pre-training for neural visual world creation.CoRR arXiv: https://arxiv.org/abs/2111.12417 (2021)
Wu, X., Guo, R., Suresh, A.T., et al.: Multiscale quantization for fast similarity search. In: Advances in neural information processing systems, 30 pp. 5745–5755 (2017)
Wu, Y., Cao, X., An, Z.: A spatiotemporal trajectory data index based on the Hilbert curve code. In: IOP Conference Series: Earth and Environmental Science, Vol. 502, No. 1, pp. 012005 (2020)
Yang, M., Zhou, M., Liu, J., et al.: HRCF: enhancing collaborative filtering via hyperbolic geometric regularization.In: Proceedings of the ACM Web Conference, pp. 2462–2471 (2022)
Zhang, H., Yin, W., et al.: Ernie-vilg: Unified generative pre-training for bidirectional vision-language generation. CoRR arXiv: https://arxiv.org/abs/2112.15283 (2021)
Zhang, R., Isola, P., Efros, A.A., et al.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR. Computer Vision Foundation/IEEE Computer Society, pp. 586–595 (2018)

Download references

Acknowledgements

This work has been supported by the National Natural Science Foundation of China under Grant No. U1911203, 61877018, 61977025, 62202170, and Alibaba Group through the Alibaba Innovation Research Program.

Funding

National Natural Science Foundation of China under Grant No. U1911203 National Natural Science Foundation of China under Grant No. 61877018 National Natural Science Foundation of China under Grant No. 61977025 National Natural Science Foundation of China under Grant No. 62202170 Alibaba Innovation Research Program

Author information

Authors and Affiliations

Shanghai Engineering Research Center of Big Data Management, School of Data Science and Engineering, East China Normal University, Shanghai, 200062, China
Lei Li, Tingting Liu, Cen Chen, Ming Gao & Aoying Zhou
KLATASDS-MOE, School of Statistics, East China Normal University, Shanghai, 200062, China
Cen Chen & Ming Gao
Alibaba Group, Hangzhou, 311121, Zhejiang, China
Chengyu Wang & Minghui Qiu

Authors

Lei Li
View author publications
You can also search for this author in PubMed Google Scholar
Tingting Liu
View author publications
You can also search for this author in PubMed Google Scholar
Chengyu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Minghui Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Cen Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ming Gao
View author publications
You can also search for this author in PubMed Google Scholar
Aoying Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

LL and TL proposed main ideas of this paper and conducted all experiments. LL, TL, CW, and MQ wrote the main manuscript text. CC, MG, and AZ provided computing resource for experiments and reviewed the manuscript

Corresponding author

Correspondence to Cen Chen.

Ethics declarations

Conflict of interest

All authors declare that the authors have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.

Ethical approval and consent to participate

Not applicable.

Consent for publication

We would like to declare that this manuscript has not been published previously and is not under consideration for any other conferences or journals. All the authors have approved the manuscript for publication.

Human and animal ethics

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, L., Liu, T., Wang, C. et al. Resizing codebook of vector quantization without retraining. Multimedia Systems 29, 1499–1512 (2023). https://doi.org/10.1007/s00530-023-01065-2

Download citation

Received: 20 June 2022
Accepted: 10 February 2023
Published: 07 March 2023
Issue Date: June 2023
DOI: https://doi.org/10.1007/s00530-023-01065-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Resizing codebook of vector quantization without retraining

Abstract

Access this article

Similar content being viewed by others

Unsupervised Prototype Adapter for Vision-Language Models

Learning to Prompt for Vision-Language Models

Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective

Availability of supporting data

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval and consent to participate

Consent for publication

Human and animal ethics

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Resizing codebook of vector quantization without retraining

Abstract

Access this article

Similar content being viewed by others

Unsupervised Prototype Adapter for Vision-Language Models

Learning to Prompt for Vision-Language Models

Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective

Availability of supporting data

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval and consent to participate

Consent for publication

Human and animal ethics

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation