Abstract
Whilst diffusion probabilistic models can generate high quality image content, key limitations remain in terms of both generating high-resolution imagery and their associated high computational requirements. Recent Vector-Quantized image models have overcome this limitation of image resolution but are prohibitively slow and unidirectional as they generate tokens via element-wise autoregressive sampling from the prior. By contrast, in this paper we propose a novel discrete diffusion probabilistic model prior which enables parallel prediction of Vector-Quantized tokens by using an unconstrained Transformer architecture as the backbone. During training, tokens are randomly masked in an order-agnostic manner and the Transformer learns to predict the original tokens. This parallelism of Vector-Quantized token prediction in turn facilitates unconditional generation of globally consistent high-resolution and diverse imagery at a fraction of the computational expense. In this manner, we can generate image resolutions exceeding that of the original training set samples whilst additionally provisioning per-image likelihood estimates (in a departure from generative adversarial approaches). Our approach achieves state-of-the-art results in terms of the manifold overlap metrics Coverage (LSUN Bedroom: 0.83; LSUN Churches: 0.73; FFHQ: 0.80) and Density (LSUN Bedroom: 1.51; LSUN Churches: 1.12; FFHQ: 1.20), and performs competitively on FID (LSUN Bedroom: 3.27; LSUN Churches: 4.07; FFHQ: 6.11) whilst offering advantages in terms of both computation and reduced training set requirements.
S. Bond-Taylor and P. Hessey—Authors contributed equally. Source code for this work is available at https://github.com/samb-t/unleashing-transformers
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Austin, J., Johnson, D., Ho, J., Tarlow, D., van den Berg, R.: Structured Denoising Diffusion Models in Discrete State-Spaces. arXiv preprint arXiv:2107.03006 (2021)
Barannikov, S., et al.: Manifold topology divergence: a framework for comparing data manifolds. arXiv preprint arXiv:2106.04024 (2021)
Bengio, Y.: Estimating or propagating gradients through stochastic neurons (2013)
van den Berg, R., Gritsenko, A.A., Dehghani, M., Sønderby, C.K., Salimans, T.: IDF++: analyzing and improving integer discrete flows for lossless compression. In: International Conference on Learning Representations (2021)
Bond-Taylor, S., Leach, A., Long, Y., Willcocks, C.G.: Deep generative modelling: a comparative review of VAEs, GANs, normalizing flows, energy-based and autoregressive models. IEEE Trans. Pattern Anal. Mach. Intell. (2021). https://doi.org/10.1109/TPAMI.2021.3116668
Borji, A.: Pros and Cons of GAN Evaluation Measures: New Developments. arXiv preprint arXiv:2103.09396 (2021)
Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A.M., Jozefowicz, R., Bengio, S.: Generating Sentences from a Continuous Space. arXiv:1511.06349 (2016)
Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: International Conference on Learning Representations (2019)
Chan, W., Saharia, C., Hinton, G., Norouzi, M., Jaitly, N.: Imputer: sequence modelling via imputation and dynamic programming. In: International Conference on Machine Learning, pp. 1403–1413. PMLR (2020)
Chang, H., Zhang, H., Jiang, L., Liu, C., Freeman, W.T.: MaskGIT: masked generative image transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11315–11325 (2022)
Chen, X., Cohen-Or, D., Chen, B., Mitra, N.J.: Towards a neural graphics pipeline for controllable image generation. In: Computer Graphics Forum, vol. 40, pp. 127–140. Wiley Online Library (2021)
Child, R.: Very deep VAEs generalize autoregressive models and can outperform them on images. In: International Conference on Learning Representations (2021)
Child, R., Gray, S., Radford, A., Sutskever, I.: Generating Long Sequences with Sparse Transformers. arXiv preprint arXiv:1904.10509 (2019)
Child, R., Gray, S., Radford, A., Sutskever, I.: Generating long sequences with sparse transformers. arXiv:1904.10509 (2019)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (2019)
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: Advances in Neural Information Processing Systems 34 (2021)
Dieleman, S., Oord, A.v.d., Simonyan, K.: The challenge of realistic music generation: modelling raw audio at scale. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Dockhorn, T., Vahdat, A., Kreis, K.: Score-based generative modeling with critically-damped langevin diffusion. In: International Conference on Learning Representations (2022)
Du, Y., Mordatch, I.: Implicit generation and generalization in energy-based models. In: Advances in Neural Information Processing Systems, vol. 33 (2019)
Ericsson, L., Gouk, H., Hospedales, T.M.: How well do self-supervised models transfer? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5414–5423 (2021)
Esser, P., Rombach, R., Blattmann, A., Ommer, B.: Imagebart: bidirectional context with multinomial diffusion for autoregressive image synthesis. arXiv preprint arXiv:2108.08827 (2021)
Esser, P., Rombach, R., Ommer, B.: Taming Transformers for High-Resolution Image Synthesis. arXiv:2012.09841 (2021)
Fetty, L., et al.: Latent space manipulation for high-resolution medical image synthesis via the StyleGAN. Z. Med. Phys. 30(4), 305–314 (2020)
Ghazvininejad, M., Levy, O., Liu, Y., Zettlemoyer, L.: Mask-predict: parallel decoding of conditional masked language models. arXiv preprint arXiv:1904.09324 (2019)
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Goyal, K., Dyer, C., Berg-Kirkpatrick, T.: Exposing the implicit energy networks behind masked language models via metropolis-hastings. arXiv preprint arXiv:2106.02736 (2021)
Grathwohl, W., Swersky, K., Hashemi, M., Duvenaud, D., Maddison, C.J.: Oops I took a gradient: scalable sampling for discrete distributions. In: International Conference on Machine Learning (2021)
Gu, J., Wang, C., Zhao, J.: Levenshtein transformer. In: Advances in Neural Information Processing Systems 32 (2019)
Gu, S., et al.: Vector quantized diffusion model for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10696–10706 (2022)
Han, K., et al.: A survey on visual transformer. arXiv preprint arXiv:2012.12556 (2020)
Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14(8), 1771–1800 (2002). https://doi.org/10.1162/089976602760128018
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems, vol. 33 (2020)
Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded diffusion models for high fidelity image generation. arXiv preprint arXiv:2106.15282 (2021)
Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res. 23(47), 1–33 (2022)
Hoogeboom, E., Gritsenko, A.A., Bastings, J., Poole, B., van den Berg, R., Salimans, T.: Autoregressive diffusion models. In: International Conference on Learning Representations (2022)
Hoogeboom, E., Nielsen, D., Jaini, P., Forré, P., Welling, M.: Argmax flows and multinomial diffusion: towards non-autoregressive language models. arXiv preprint arXiv:2102.05379 (2021)
Hoogeboom, E., Peters, J., van den Berg, R., Welling, M.: Integer discrete flows and lossless compression. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Hudson, D.A., Zitnick, C.L.: Generative adversarial transformers. In: Proceedings of the 38th International Conference on Machine Learning, ICML (2021)
Jaegle, A., et al.: Perceiver IO: a general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795 (2021)
Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax. In: International Conference on Learning Representations (2017)
Jun, H., et al.: Distribution augmentation for generative modeling. In: ICML (2020)
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. In: International Conference on Learning Representations (2018)
Karras, T., et al.: Alias-Free Generative Adversarial Networks. arXiv preprint arXiv:2106.12423 (2021)
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of StyleGAN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020)
Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are RNNs: fast autoregressive transformers with linear attention. In: International Conference on Machine Learning (2020)
Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980 (2014)
Kingma, D.P., Salimans, T., Poole, B., Ho, J.: Variational Diffusion Models. arXiv preprint arXiv:2107.00630 (2021)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: International Conference on Learning Representations (2014)
Kingma, D.P., Dhariwal, P.: Glow: generative flow with invertible 1x1 convolutions. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., Aila, T.: Improved precision and recall metric for assessing generative models. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32 (2019)
Lin, J., Zhang, R., Ganz, F., Han, S., Zhu, J.Y.: Anycost GANs for interactive image synthesis and editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14986–14996 (2021)
Liu, B., Zhu, Y., Song, K., Elgammal, A.: Towards faster and stabilized GAN training for high-fidelity few-shot image synthesis. In: International Conference on Learning Representations (2021)
Maddison, C.J., Mnih, A., Teh, Y.W.: The concrete distribution: a continuous relaxation of discrete random variables. In: International Conference on Learning Representations (2017)
Menick, J., Kalchbrenner, N.: Generating high fidelity images with subscale pixel networks and multidimensional upscaling. In: International Conference on Learning Representations (2019)
Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. In: International Conference on Learning Representations (2018)
Naeem, M.F., Oh, S.J., Uh, Y., Choi, Y., Yoo, J.: Reliable fidelity and diversity metrics for generative models. In: International Conference on Machine Learning, pp. 7176–7185 (2020)
Nash, C., Menick, J., Dieleman, S., Battaglia, P.W.: Generating images with sparse representations. arXiv preprint arXiv:2103.03841 (2021)
Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, pp. 8162–8171. PMLR (2021)
Nie, W., Narodytska, N., Patel, A.: RelGAN: relational generative adversarial networks for text generation. In: International Conference on Learning Representations (2019)
Obukhov, A., Seitzer, M., Wu, P.W., Zhydenko, S., Kyl, J., Lin, E.Y.J.: High-fidelity performance metrics for generative models in pytorch (2020)
van den Oord, A., Kalchbrenner, N., Espeholt, L., kavukcuoglu, k., Vinyals, O., Graves, A.: Conditional Image Generation with PixelCNN Decoders. NeurIPS 29 (2016)
van den Oord, A., Vinyals, O., Kavukcuoglu, K.: Neural discrete representation learning. NeurIPS 30 (2017)
Parmar, N., et al.: Image transformer. In: ICML (2018)
Radford, A., et al.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019)
Ramesh, A., e al.: Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092 (2021)
Razavi, A., van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with VQ-VAE-2. NeurIPS 32 (2019)
Reed, S., et al.: Parallel multiscale autoregressive density estimation. In: International Conference on Machine Learning (2017)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Roy, A., Saffar, M., Vaswani, A., Grangier, D.: Efficient content-based sparse attention with routing transformers. Trans. Assoc. Comput. Ling. 9, 53–68 (2021)
Ruis, L., Stern, M., Proskurnia, J., Chan, W.: Insertion-deletion transformer. arXiv preprint arXiv:2001.05540 (2020)
Saharia, C., Chan, W., Saxena, S., Norouzi, M.: Non-autoregressive machine translation with latent alignments. arXiv preprint arXiv:2004.07437 (2020)
Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D.J., Norouzi, M.: Image super-resolution via iterative refinement. arXiv preprint arXiv:2104.07636 (2021)
Sajjadi, M.S., Bachem, O., Lucic, M., Bousquet, O., Gelly, S.: Assessing generative models via precision and recall. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Salimans, T., Karpathy, A., Chen, X., Kingma, D.P.: PixelCNN++: improving the PixelCNN with discretized logistic mixture likelihood and other modifications. ICLR (2017)
Sasaki, H., Willcocks, C.G., Breckon, T.P.: UNIT-DDPM: UNpaired image translation with denoising diffusion probabilistic models. arXiv preprint arXiv:2104.05358 (2021)
Sauer, A., Chitta, K., Müller, J., Geiger, A.: Projected GANs converge faster. Adv. Neural. Inf. Process. Syst. 34, 17480–17492 (2021)
Savinov, N., Chung, J., Binkowski, M., Elsen, E., van den Oord, A.: Step-unrolled denoising autoencoders for text generation. In: International Conference on Learning Representations (2022)
Sohl-Dickstein, J., Weiss, E.A., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning (2015)
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations (2021)
Theis, L., Oord, A.v.d., Bethge, M.: A note on the evaluation of generative models. arXiv:1511.01844 (2016)
Tsitsulin, A., et al.: The shape of data: intrinsic distance for data distributions. In: International Conference on Learning Representations (2020)
Vahdat, A., Kreis, K., Kautz, J.: Score-based generative modeling in latent space. In: Advances in Neural Information Processing Systems 34 (2021)
Van Den Oord, A., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. In: ICML (2016)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Wang, A., Cho, K.: BERT has a mouth, and it must speak: BERT as a Markov random field language model. In: NeuralGen (2019)
Watson, D., Ho, J., Norouzi, M., Chan, W.: Learning to efficiently sample from diffusion probabilistic models. arXiv preprint arXiv:2106.03802 (2021)
Xiao, Z., Kreis, K., Kautz, J., Vahdat, A.: VAEBM: a symbiosis between variational autoencoders and energy-based models. In: International Conference on Learning Representations (2021)
Yu, F., Zhang, Y., Song, S., Seff, A., Xiao, J.: Lsun: construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365 (2015)
Yu, N., Barnes, C., Shechtman, E., Amirghodsi, S., Lukac, M.: Texture mixer: a network for controllable synthesis and interpolation of texture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12164–12173 (2019)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)
Zhao, S., Liu, Z., Lin, J., Zhu, J.Y., Han, S.: Differentiable augmentation for data-efficient GAN training. In: Advances in Neural Information Processing Systems, vol. 33 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Bond-Taylor, S., Hessey, P., Sasaki, H., Breckon, T.P., Willcocks, C.G. (2022). Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13683. Springer, Cham. https://doi.org/10.1007/978-3-031-20050-2_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-20050-2_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20049-6
Online ISBN: 978-3-031-20050-2
eBook Packages: Computer ScienceComputer Science (R0)