Trace Controlled Text to Image Generation

Yan, Kun; Ji, Lei; Wu, Chenfei; Bao, Jianmin; Zhou, Ming; Duan, Nan; Ma, Shuai

doi:10.1007/978-3-031-20059-5_4

Trace Controlled Text to Image Generation

Kun Yan ORCID: orcid.org/0000-0001-8290-5169¹²,
Lei Ji¹³,
Chenfei Wu¹³,
Jianmin Bao¹³,
Ming Zhou¹⁴,
Nan Duan¹³ &
…
Shuai Ma¹²

Conference paper
First Online: 29 October 2022

1836 Accesses
5 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13696))

Abstract

Text to Image generation is a fundamental and inevitable challenging task for visual linguistic modeling. The recent surge in this area such as DALL\(\cdot \)E has shown breathtaking technical breakthroughs, however, it still lacks a precise control of the spatial relation corresponding to semantic text. To tackle this problem, mouse trace paired with text provides an interactive way, in which users can describe the imagined image with natural language while drawing traces to locate those they want. However, this brings the challenges of both controllability and compositionality of the generation. Motivated by this, we propose a Trace Controlled Text to Image Generation model (TCTIG), which takes trace as a bridge between semantic concepts and spatial conditions. Moreover, we propose a set of new technique to enhance the controllability and compositionality of generation, including trace guided re-weighting loss (TGR) and semantic aligned augmentation (SAA). In addition, we establish a solid benchmark for the trace-controlled text-to-image generation task, and introduce several new metrics to evaluate both the controllability and compositionality of the model. Upon that, we demonstrate TCTIG’s superior performance and further present the fruitful qualitative analysis of our model.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
The checkpoint and model config can be found at https://heibox.uni-heidelberg.de/d/8088892a516d4e3baf92/.

References

Agnese, J., Herrera, J., Tao, H., Zhu, X.: A survey and taxonomy of adversarial neural networks for text-to-image synthesis. Wiley Interdis. Rev. Data Min. Knowl. Discovery 10(4), e1345 (2020)
Google Scholar
Barratt, S., Sharma, R.: A note on the inception score (2018)
Google Scholar
Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: thing and stuff classes in context. In: Computer Vision and Pattern Recognition (CVPR), 2018 IEEE conference on. IEEE (2018)
Google Scholar
Changpinyo, S., Pont-Tuset, J., Ferrari, V., Soricut, R.: Telling the what while pointing to the where: multimodal queries for image retrieval. arXiv preprint arXiv:2102.04980 (2021)
Chen, Q., Koltun, V.: Photographic image synthesis with cascaded refinement networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1511–1520 (2017)
Google Scholar
Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Adv. Neural Inf. Process. Syst. 34 (2021)
Google Scholar
Ding, M., et al.: Cogview: mastering text-to-image generation via transformers. arXiv preprint arXiv:2105.13290 (2021)
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12873–12883 (2021)
Google Scholar
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis (2020)
Google Scholar
Frolov, S., Hinz, T., Raue, F., Hees, J., Dengel, A.: Adversarial text-to-image synthesis: a review. arXiv preprint arXiv:2101.09983 (2021)
González, C., Ayobi, N., Hernandez, I., Hernández, J., Pont-Tuset, J., Arbelaez, P.: Panoptic narrative grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1364–1373 (2021)
Google Scholar
Gu, S., et al.: Vector quantized diffusion model for text-to-image synthesis. arXiv preprint arXiv:2111.14822 (2021)
Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: Clipscore: a reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: NIPS (2017)
Google Scholar
Hinz, T., Heinrich, S., Wermter, S.: Semantic object accuracy for generative text-to-image synthesis. IEEE Trans. Pattern Anal. Mach. Intell. PP (2020)
Google Scholar
Ho, J., Kalchbrenner, N., Weissenborn, D., Salimans, T.: Axial attention in multidimensional transformers (2019)
Google Scholar
Johnson, J., Gupta, A., Fei-Fei, L.: Image generation from scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1219–1228 (2018)
Google Scholar
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4396–4405 (2019). https://doi.org/10.1109/CVPR.2019.00453
Kim, G., Ye, J.C.: Diffusionclip: text-guided image manipulation using diffusion models. arXiv preprint arXiv:2110.02711 (2021)
Koh, J.Y., Baldridge, J., Lee, H., Yang, Y.: Text-to-image generation grounded by fine-grained user attention. In: Winter Conference on Applications of Computer Vision (WACV) (2021)
Google Scholar
Li, B., Qi, X., Torr, P.H., Lukasiewicz, T.: Image-to-image translation with text guidance. arXiv preprint arXiv:2002.05235 (2020)
Meng, Z., et al.: Connecting what to say with where to look by modeling human attention traces. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12679–12688 (2021)
Google Scholar
Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: Glide: towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)
van den Oord, A., Vinyals, O., kavukcuoglu, k.: Neural discrete representation learning. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/7a98af17e63a0ac09ce2e96d03992fbc-Paper.pdf
Park, D.H., Azadi, S., Liu, X., Darrell, T., Rohrbach, A.: Benchmark for compositional text-to-image synthesis (2021)
Google Scholar
Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2337–2346 (2019)
Google Scholar
Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., Lischinski, D.: Styleclip: text-driven manipulation of stylegan imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2085–2094 (2021)
Google Scholar
Pont-Tuset, J., Uijlings, J., Changpinyo, B., Soricut, R., Ferrari, V.: Connecting vision and language with localized narratives. In: ECCV (2020). https://arxiv.org/abs/1912.03098
Pont-Tuset, J., Uijlings, J., Changpinyo, S., Soricut, R., Ferrari, V.: Connecting vision and language with localized narratives. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 647–664. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_38
Chapter Google Scholar
Qiao, T., Zhang, J., Xu, D., Tao, D.: Mirrorgan: learning text-to-image generation by redescription. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1505–1514 (2019)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021)
Ramesh, A., et al.: Zero-shot text-to-image generation. ArXiv abs/2102.12092 (2021)
Google Scholar
Redmon, J., Farhadi, A.: Yolov3: an incremental improvement (2018)
Google Scholar
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: International Conference on Machine Learning, pp. 1060–1069. PMLR (2016)
Google Scholar
Salimans, T., Goodfellow, I.J., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. In: NIPS (2016)
Google Scholar
Sun, W., Wu, T.: Image synthesis from reconfigurable layout and style. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10531–10540 (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., He, X.: Attngan: fine-grained text to image generation with attentional generative adversarial networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1316–1324 (2018)
Google Scholar
Yan, K., Ji, L., Luo, H., Zhou, M., Duan, N., Ma, S.: Control image captioning spatially and temporally. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 2014–2025 (2021)
Google Scholar
Zhang, H., Koh, J.Y., Baldridge, J., Lee, H., Yang, Y.: Cross-modal contrastive learning for text-to-image generation. In: CVPR (2021)
Google Scholar
Zhang, Z., et al.: M6-ufc: unifying multi-modal controls for conditional image synthesis. arXiv preprint arXiv:2105.14211 (2021)
Zhao, B., Meng, L., Yin, W., Sigal, L.: Image generation from layout. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar

Download references

Acknowledgement

This work is supported in part by NSFC 61925203.

Author information

Authors and Affiliations

SKLSDE Lab, Beihang University, Beijing, China
Kun Yan & Shuai Ma
Microsoft Research Asia, Beijing, China
Lei Ji, Chenfei Wu, Jianmin Bao & Nan Duan
Langboat Technology, Beijing, China
Ming Zhou

Authors

Kun Yan
View author publications
You can also search for this author in PubMed Google Scholar
Lei Ji
View author publications
You can also search for this author in PubMed Google Scholar
Chenfei Wu
View author publications
You can also search for this author in PubMed Google Scholar
Jianmin Bao
View author publications
You can also search for this author in PubMed Google Scholar
Ming Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Nan Duan
View author publications
You can also search for this author in PubMed Google Scholar
Shuai Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kun Yan .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4348 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yan, K. et al. (2022). Trace Controlled Text to Image Generation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13696. Springer, Cham. https://doi.org/10.1007/978-3-031-20059-5_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-20059-5_4
Published: 29 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20058-8
Online ISBN: 978-3-031-20059-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics