Region-Native Visual Tokenization

Wang, Mengyu; Huang, Yuyao; Ding, Henghui; Wang, Xinlong; Huang, Tiejun; Zhao, Yao; Wei, Yunchao; Yan, Shuicheng

doi:10.1007/978-3-031-72904-1_2

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15132))

Included in the following conference series:

European Conference on Computer Vision

281 Accesses

Abstract

We explore an innovative region-based visual token representation and present the REgion-native AutoencoDER (Reader). In contrast to the majority of previous methods, which represent each image as a grid-shaped tokens map, Reader perceives each image into sequential region-based tokens, with each token corresponding to an object or one part of an object in the image. Specifically, Reader comprises both an encoder and a decoder. The encoder can partition each image into an adaptive number of arbitrary-shaped regions and encode each region into a token. Subsequently, the decoder utilizes this adaptive-length token sequence to reconstruct the original image. Experimental results demonstrate that such region-based token representation possesses two main notable characteristics. Firstly, it achieves highly efficient image encoding. Reader can adaptively use more regions to represent complex areas and fewer regions in simpler ones, thus avoiding information redundancy. Consequently, it achieves superior reconstruction fidelity compared to previous methods, despite using significantly fewer tokens for each image. Secondly, the region-based manner enables manipulation on a local region without causing global changes. As a result, Reader inherently supports diverse image editing operations, including erasing, adding, replacing, and modifying shapes on the objects, and achieves excellent performance in the image editing benchmark of smile transferring. Code is provided at https://github.com/MengyuWang826/Reader.git.

M. Wang and Y. Huang—Internship at BAAI.

Y. Huang—Internship at Skywork AI.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

ST-LDM: A Universal Framework for Text-Grounded Object Generation in Real Images

Expressive Image Generation and Editing with Rich Text

Article 14 March 2025

MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask

Article 12 December 2024

References

Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077–6086 (2018)
Google Scholar
Binkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying MMD GANs. In: ICLR. OpenReview.net (2018)
Google Scholar
Brooks, T., Holynski, A., Efros, A.A.: InstructPix2Pix: learning to follow image editing instructions. In: CVPR, pp. 18392–18402. IEEE (2023)
Google Scholar
Burgess, C.P., et al.: MONet: unsupervised scene decomposition and representation. CoRR arXiv:1901.11390 (2019)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1290–1299 (2022)
Google Scholar
Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: Ranzato, M., Beygelzimer, A., Dauphin, Y.N., Liang, P., Vaughan, J.W. (eds.) NeurIPS, pp. 17864–17875 (2021)
Google Scholar
Choi, Y., Choi, M., Kim, M., Ha, J., Kim, S., Choo, J.: StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In: CVPR, pp. 8789–8797. Computer Vision Foundation / IEEE Computer Society (2018)
Google Scholar
Collins, E., Bala, R., Price, B., Süsstrunk, S.: Editing in style: uncovering the local semantics of GANs. In: CVPR, pp. 5770–5779. Computer Vision Foundation / IEEE (2020)
Google Scholar
Couairon, G., Verbeek, J., Schwenk, H., Cord, M.: DiffEdit: diffusion-based semantic image editing with mask guidance. In: ICLR. OpenReview.net (2023)
Google Scholar
Crowson, K., et al.: VQGAN-CLIP: open domain image generation and editing with natural language guidance. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV, pp. 88–105. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19836-6_6
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699. Computer Vision Foundation / IEEE (2019)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) NAACL-HLT, pp. 4171–4186. Association for Computational Linguistics (2019)
Google Scholar
Ding, H., Liu, C., He, S., Jiang, X., Loy, C.C.: MeViS: a large-scale benchmark for video segmentation with motion expressions. In: ICCV (2023)
Google Scholar
Dosovitskiy, A., et al.: An image is worth $16\times 16$ words: transformers for image recognition at scale. In: ICLR. OpenReview.net (2021)
Google Scholar
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: CVPR, pp. 12873–12883. Computer Vision Foundation / IEEE (2021)
Google Scholar
Fernandez, P., Couairon, G., Jégou, H., Douze, M., Furon, T.: The stable signature: rooting watermarks in latent diffusion models. CoRR arXiv:2303.15435 (2023)
Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., Taigman, Y.: Make-a-scene: scene-based text-to-image generation with human priors. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV, pp. 89–106. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19784-0_6
Goel, V., et al.: Pair-diffusion: object-level image editing with structure-and-appearance paired diffusion models. CoRR arXiv:2303.17546 (2023)
Gondal, M.W., et al.: On the transfer of inductive bias from simulation to the real world: a new disentanglement dataset. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) NeurIPS, pp. 15714–15725 (2019)
Google Scholar
Goodfellow, I.J., et al.: Generative adversarial nets. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) NeurIPS (2014)
Google Scholar
Härkönen, E., Hertzmann, A., Lehtinen, J., Paris, S.: GANSpace: discovering interpretable GAN controls. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) NeurIPS (2020)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778. IEEE Computer Society (2016)
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (eds.) NeurIPS, pp. 6626–6637 (2017)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) NeurIPS (2020)
Google Scholar
Horé, A., Ziou, D.: Image quality metrics: PSNR vs. SSIM. In: ICPR, pp. 2366–2369. IEEE Computer Society (2010)
Google Scholar
Huang, M., Mao, Z., Chen, Z., Zhang, Y.: Towards accurate image coding: improved autoregressive image generation with dynamic vector quantization. In: CVPR, pp. 22596–22605 (2023)
Google Scholar
Isola, P., Zhu, J., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR, pp. 5967–5976. IEEE Computer Society (2017)
Google Scholar
Kim, H., Mnih, A.: Disentangling by factorising. In: Dy, J.G., Krause, A. (eds.) ICML. Proceedings of Machine Learning Research, vol. 80, pp. 2654–2663. PMLR (2018)
Google Scholar
Kirillov, A., et al.: Segment anything. CoRR arXiv:2304.02643 (2023)
Lee, C., Liu, Z., Wu, L., Luo, P.: MaskGAN: towards diverse and interactive facial image manipulation. In: CVPR, pp. 5548–5557. Computer Vision Foundation / IEEE (2020)
Google Scholar
Lee, D., Kim, C., Kim, S., Cho, M., Han, W.: Autoregressive image generation using residual quantization. In: CVPR, pp. 11513–11522. IEEE (2022). https://doi.org/10.1109/CVPR52688.2022.01123
Li, Y., Mao, H., Girshick, R., He, K.: Exploring plain vision transformer backbones for object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV, pp. 280–296. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_17
Ling, H., Kreis, K., Li, D., Kim, S.W., Torralba, A., Fidler, S.: EditGAN: high-precision semantic image editing. In: Ranzato, M., Beygelzimer, A., Dauphin, Y.N., Liang, P., Vaughan, J.W. (eds.) NeurIPS, pp. 16331–16345 (2021)
Google Scholar
Liu, C., Ding, H., Jiang, X.: GRES: generalized referring expression segmentation. In: CVPR (2023)
Google Scholar
Liu, C., Jiang, X., Ding, H.: PrimitiveNet: decomposing the global constraints for referring segmentation. Vis. Intell. 2(1), 16 (2024)
Article MATH Google Scholar
Locatello, F., et al.: Object-centric learning with slot attention. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) NeurIPS (2020)
Google Scholar
Ning, J., et al.: All in tokens: unifying output space of visual tasks via soft token. In: ICCV, pp. 19900–19910 (2023)
Google Scholar
van den Oord, A., Vinyals, O., Kavukcuoglu, K.: Neural discrete representation learning. In: Guyon, I., et al. (eds.) NeurIPS, pp. 6306–6315 (2017)
Google Scholar
Pan, X., Tewari, A., Leimkühler, T., Liu, L., Meka, A., Theobalt, C.: Drag your GAN: interactive point-based manipulation on the generative image manifold. In: Brunvand, E., Sheffer, A., Wimmer, M. (eds.) SIGGRAPH, pp. 78:1–78:11. ACM (2023)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. PMLR (2021)
Google Scholar
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
Google Scholar
Reed, S.E., Zhang, Y., Zhang, Y., Lee, H.: Deep visual analogy-making. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) NeurIPS, pp. 1252–1260 (2015)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS, vol. 28 (2015)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR, pp. 10674–10685. IEEE (2022)
Google Scholar
Shen, Y., Yang, C., Tang, X., Zhou, B.: InterfaceGAN: interpreting the disentangled face representation learned by GANs. IEEE TPAMI 44(4), 2004–2018 (2022)
Article MATH Google Scholar
Shuai, X., Ding, H., Ma, X., Tu, R., Jiang, Y.G., Tao, D.: A survey of multimodal-guided image editing with text-to-image diffusion models. arXiv preprint arXiv:2406.14555 (2024)
Sohl-Dickstein, J., Weiss, E.A., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: Bach, F.R., Blei, D.M. (eds.) ICML. JMLR Workshop and Conference Proceedings, vol. 37, pp. 2256–2265. JMLR.org (2015)
Google Scholar
Tov, O., Alaluf, Y., Nitzan, Y., Patashnik, O., Cohen-Or, D.: Designing an encoder for styleGAN image manipulation. ACM TOG 40(4), 1–14 (2021)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) NeurIPS, pp. 5998–6008 (2017)
Google Scholar
Wu, B., et al.: Visual transformers: token-based image representation and processing for computer vision. arXiv preprint arXiv:2006.03677 (2020)
Wu, R., Zhang, G., Lu, S., Chen, T.: Cascade EF-GAN: progressive facial expression editing with local focuses. In: CVPR, pp. 5020–5029. Computer Vision Foundation / IEEE (2020)
Google Scholar
Xu, J., et al.: GroupViT: semantic segmentation emerges from text supervision. In: CVPR, pp. 18113–18123. IEEE (2022)
Google Scholar
Yang, B., et al.: Paint by example: exemplar-based image editing with diffusion models. In: CVPR, pp. 18381–18391. IEEE (2023). https://doi.org/10.1109/CVPR52729.2023.01763
Yang, T., Wang, Y., Lu, Y., Zheng, N.: Visual concepts tokenization. In: NeurIPS (2022)
Google Scholar
Yu, J., et al.: Vector-quantized image modeling with improved VQGAN. In: ICLR. OpenReview.net (2022)
Google Scholar
Zellers, R., Yatskar, M., Thomson, S., Choi, Y.: Neural motifs: scene graph parsing with global context. In: CVPR, pp. 5831–5840 (2018)
Google Scholar
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR, pp. 586–595. Computer Vision Foundation / IEEE Computer Society (2018)
Google Scholar
Zhong, Y., Wang, L., Chen, J., Yu, D., Li, Y.: Comprehensive image captioning via scene graph decomposition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds.) ECCV, pp. 211–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_13
Zhu, J., Yang, C., Shen, Y., Shi, Z., Zhao, D., Chen, Q.: LinkGAN: linking GAN latents to pixels for controllable image synthesis. CoRR arXiv:2301.04604 (2023)

Download references

Acknowledgements

This research was funded by the Fundamental Research Funds for the Central Universities (2024XKRC082) and the National NSF of China (No. U23A20314).

Author information

Authors and Affiliations

Institute of Information Science, Beijing Jiaotong University, Beijing, China
Mengyu Wang, Yuyao Huang, Yao Zhao & Yunchao Wei
Beijing Academy of Artificial Intelligence, Beijing, China
Xinlong Wang & Tiejun Huang
Skywork AI, Singapore, Singapore
Shuicheng Yan
Institute of Big Data, Fudan University, Shanghai, China
Henghui Ding
Visual Intelligence + X International Joint Laboratory of the Ministry of Education, Beijing, China
Mengyu Wang, Yuyao Huang, Yao Zhao & Yunchao Wei
PengCheng Laboratory, Shenzhen, China
Yao Zhao & Yunchao Wei

Authors

Mengyu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yuyao Huang
View author publications
You can also search for this author in PubMed Google Scholar
Henghui Ding
View author publications
You can also search for this author in PubMed Google Scholar
Xinlong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Tiejun Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yao Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yunchao Wei
View author publications
You can also search for this author in PubMed Google Scholar
Shuicheng Yan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yunchao Wei .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 781 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, M. et al. (2025). Region-Native Visual Tokenization. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15132. Springer, Cham. https://doi.org/10.1007/978-3-031-72904-1_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-72904-1_2
Published: 21 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72903-4
Online ISBN: 978-3-031-72904-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics