Skip to main content
Log in

Resizing codebook of vector quantization without retraining

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Large models pre-trained on massive data have become a flourishing paradigm of artificial intelligence systems. Recent works, such as M6, CogView, WenLan 2.0, NÜWA, and ERNIE-ViLG, further extend this diagram to joint Vision Language Pre-training (VLP). For VLP, the two-stage architecture is a popular design, which includes the first stage learning an encoding function of data and the second stage learning a probabilistic model of encoded representation of data. Vector quantization (VQ) has usually engaged in the encoding function of image data for the first stage. VQ includes a data structure (codebook) and an algorithm (finding nearest quantization). The publicly available VQ models (e.g., VQGAN, VQVAE, VQVAE2) include a codebook whose size is assigned empirically (e.g., 1024, 4096, and 16,384) by their authors. If we want a smaller codebook for a lower computation load of the VQ process, or we want a larger codebook for better reconstruction quality, we have to retrain VQ models that consist of the down-sampling net, the codebook, and the up-sampling net. However, retraining VQ models is very expensive since these models, with billions of parameters, are trained on massive datasets. It motivates us to find an approach to resize the codebook of Vector quantization without retraining. In this paper, we leverage hyperbolic embeddings to enhance codebook vectors with the co-occurrence information and reorder the enhanced codebook by the Hilbert curve. Then we can resize the codebook of vector quantization for lower computation load or better reconstruction quality. Experimental results prove the efficiency and effectiveness of our approach when compared with competitive baselines. The code will be released to the public.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Availability of supporting data

All datasets and model weights used in our experiments are publicly available with references or URLs. Our proposed approaches are in detail described with pseudo codes or figures. In addition, we will release all codes as a open-source project upon acceptance of this paper.

Notes

  1. https://github.com/CompVis/taming-transformers/blob/master/taming/modules/vqvae/quantize.py, https://github.com/MishaLaskin/vqvae/blob/master/models/quantizer.py.

  2. https://github.com/CompVis/taming-transformers.

  3. https://openi.pcl.ac.cn/BAAI/WuDao-Model/src/branch/master/CogView.

References

  1. Agustsson, E., Timofte, R.: NTIRE 2017 challenge on single image super-resolution: dataset and study. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 1122–1131 (2017)

  2. Ai, L., Yu, J., Wu, Z., et al.: Optimized residual vector quantization for efficient approximate nearest neighbor search. Multimed. Syst. 23(2), 169–181 (2017)

    Article  Google Scholar 

  3. Bai, Y., Ying, Z., Ren, H., et al.: Modeling heterogeneous hierarchies with relation-specific hyperbolic cones. IIn: Advances in Neural Information Processing Systems, pp. 12316–12327 (2021)

  4. Chen, H., Chang, Y.: All-nearest-neighbors finding based on the Hilbert curve. Expert Syst. Appl. 38(6), 7462–7475 (2011)

    Article  Google Scholar 

  5. Devlin, J., Chang, M., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: arXiv preprint arXiv: https://arxiv.org/1810.04805 (2019

  6. Ding, M., Kong, K., Li, J., et al.: Vq-gnn: a universal framework to scale up graph neural networks using vector quantization. arXiv: https://arxiv.org/2110.14363 (2021)

  7. Ding, M., Yang, Z., et al.: Cogview: mastering text-to-image generation via transformers. CoRR arXiv: https://arxiv.org/abs/2105.13290 (2021)

  8. Esser, P., Rombach, R., et al.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12873–12883 (2021)

  9. Fei, N., Lu, Z., et al.: Wenlan 2.0: make AI imagine via a multimodal foundation model. CoRR arXiv:https://arxiv.org/abs/2110.14378 (2021)

  10. Fu, C., Xiang, C., Wang, C., et al.: Fast approximate nearest neighbor search with the navigating spreading-out graph. Proc. VLDB Endow. 12(5), 461–474 (2019)

    Article  Google Scholar 

  11. Ge, T., He, K., Ke, Q., et al.: Optimized product quantization for approximate nearest neighbor search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2946–2953 (2013)

  12. Guo, R., Sun, P., Lindgren, E., et al.: Accelerating large-scale inference with anisotropic vector quantization.In: International Conference on Machine Learning, vol 119. PMLR, pp. 3887–3896 (2020)

  13. He, K., Wen, F., Sun, J.: K-means hashing: an affinity-preserving quantization method for learning binary compact codes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2938–2945 (2013)

  14. Heusel, M., Ramsauer, H., Unterthiner, T., et al.: Gans trained by a two time-scale update rule converge to a local nash equilibrium.In: Advances in neural information processing systems, pp. 6626–6637 (2017)

  15. Ignatov, A., Timofte, R., et al.: PIRM challenge on perceptual image enhancement on smartphones: report. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 315–333 (2018)

  16. Kalantidis, Y., Avrithis, Y.: Locally optimized product quantization for approximate nearest neighbor search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2329–2336 (2014)

  17. Karras, T., Aila, T., et al.: Progressive growing of gans for improved quality, stability, and variation. In: ICLR (2018)

  18. Khrulkov, V., Mirvakhabova, L., et al.: Hyperbolic image embeddings. n: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6417–6427 (2020)

  19. Kitaev, N., Kaiser, L., et al.: Reformer: the efficient transformer. In: ICLR (2020)

  20. Lin, J., Men, R., et al.: M6: A chinese multimodal pretrainer. CoRR arXiv: https://arxiv.org/abs/2103.00823 (2021)

  21. Lin, T., Maire, M., et al.: Microsoft COCO: common objects in context.In: Proceedings of 13th European Conference of Computer Vision (ECCV), pp. 740–755 (2014)

  22. Liu, Z., Luo, P., et al.: Deep learning face attributes in the wild. In: Proceedings of the IEEE international conference on computer vision, pp. 3730–3738 (2015)

  23. Malkov, Y.A., Yashunin, D.A.: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell. 42(4), 824–836 (2020)

    Article  Google Scholar 

  24. Nickel, M., Kiela, D.: Poincaré embeddings for learning hierarchical representations. In: Advances in neural information processing systems, pp. 6338–6347 (2017)

  25. van den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: Advances in neural information processing systems, pp. 6306–6315 (2017)

  26. Peng, W., Varanka, T., et al.: Hyperbolic deep neural networks: a survey. CoRR arXiv: https://arxiv.org/abs/2101.04562 (2021)

  27. Radford, A., Wu, J., Child, R., et al.: Language models are unsupervised multitask learners. CoRR (2019)

  28. Razavi, A., van den Oord, A., et al.: Generating diverse high-fidelity images with VQ-VAE-2. In: Advances in neural information processing systems, pp. 14837–14847 (2019)

  29. Roy, A., Grangier, D.: Unsupervised paraphrasing without translation. In: arXiv preprint arXiv: https://arxiv.org/1905.12752 (2019)

  30. Setiadi, D.R.I.M.: PSNR vs SSIM: imperceptibility quality assessment for image steganography. Multimed. Tools Appl. 80(6), 8423–8444 (2021)

    Article  Google Scholar 

  31. Skilling, J.: Programming the Hilbert curve. In: Bayesian Inference and Maximum Entropy Methods in Science and Engineering, pp. 381–387 (2004). https://aip.scitation.org/doi/abs/10.1063/1.1751381. https://aip.scitation.org/toc/apc/707/1. https://www.aip.org/aip/history/locations

  32. Tsinganos, P., Cornelis, B., et al.: A Hilbert curve based representation of semg signals for gesture recognition. In: International Conference on Systems, Signals and Image Processing (IWSSIP), pp. 201–206 (2019)

  33. Wang, L., Hu, F., Wu, S., et al.: Fully hyperbolic graph convolution network for recommendation. In: Proceedings of the 30th ACM 10.1007/s00530-023-01065-2 International Conference on Information & Knowledge Management, pp. 3483–3487 (2021)

  34. Wang, Z., Bovik, A.C., Sheikh, H.R., et al.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)

    Article  Google Scholar 

  35. Wu, C., Liang, J., et al.: Nüwa: visual synthesis pre-training for neural visual world creation.CoRR arXiv: https://arxiv.org/abs/2111.12417 (2021)

  36. Wu, X., Guo, R., Suresh, A.T., et al.: Multiscale quantization for fast similarity search. In: Advances in neural information processing systems, 30 pp. 5745–5755 (2017)

  37. Wu, Y., Cao, X., An, Z.: A spatiotemporal trajectory data index based on the Hilbert curve code. In: IOP Conference Series: Earth and Environmental Science, Vol. 502, No. 1, pp. 012005 (2020)

  38. Yang, M., Zhou, M., Liu, J., et al.: HRCF: enhancing collaborative filtering via hyperbolic geometric regularization.In: Proceedings of the ACM Web Conference, pp. 2462–2471 (2022)

  39. Zhang, H., Yin, W., et al.: Ernie-vilg: Unified generative pre-training for bidirectional vision-language generation. CoRR arXiv: https://arxiv.org/abs/2112.15283 (2021)

  40. Zhang, R., Isola, P., Efros, A.A., et al.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR. Computer Vision Foundation/IEEE Computer Society, pp. 586–595 (2018)

Download references

Acknowledgements

This work has been supported by the National Natural Science Foundation of China under Grant No. U1911203, 61877018, 61977025, 62202170, and Alibaba Group through the Alibaba Innovation Research Program.

Funding

National Natural Science Foundation of China under Grant No. U1911203 National Natural Science Foundation of China under Grant No. 61877018 National Natural Science Foundation of China under Grant No. 61977025 National Natural Science Foundation of China under Grant No. 62202170 Alibaba Innovation Research Program

Author information

Authors and Affiliations

Authors

Contributions

LL and TL proposed main ideas of this paper and conducted all experiments. LL, TL, CW, and MQ wrote the main manuscript text. CC, MG, and AZ provided computing resource for experiments and reviewed the manuscript

Corresponding author

Correspondence to Cen Chen.

Ethics declarations

Conflict of interest

All authors declare that the authors have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.

Ethical approval and consent to participate

Not applicable.

Consent for publication

We would like to declare that this manuscript has not been published previously and is not under consideration for any other conferences or journals. All the authors have approved the manuscript for publication.

Human and animal ethics

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, L., Liu, T., Wang, C. et al. Resizing codebook of vector quantization without retraining. Multimedia Systems 29, 1499–1512 (2023). https://doi.org/10.1007/s00530-023-01065-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-023-01065-2

Keywords

Navigation