Skip to main content

TP2O: Creative Text Pair-to-Object Generation Using Balance Swap-Sampling

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15130))

Included in the following conference series:

  • 237 Accesses

Abstract

Generating creative combinatorial objects from two seemingly unrelated object texts is a challenging task in text-to-image synthesis, often hindered by a focus on emulating existing data distributions. In this paper, we develop a straightforward yet highly effective method, called balance swap-sampling. First, we propose a swapping mechanism that generates a novel combinatorial object image set by randomly exchanging intrinsic elements of two text embeddings through a cutting-edge diffusion model. Second, we introduce a balance swapping region to efficiently sample a small subset from the newly generated image set by balancing CLIP distances between the new images and their original generations, increasing the likelihood of accepting the high-quality combinations. Last, we employ a segmentation method to compare CLIP distances among the segmented components, ultimately selecting the most promising object from the sampled subset. Extensive experiments demonstrate that our approach outperforms recent SOTA T2I methods. Surprisingly, our results even rival those of human artists, such as frog-broccoli in Fig. 1. Project

J. Li and Z. Zhang—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Avrahami, O., et al.: Spatext: spatio-textual representation for controllable image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18370–18380 (2023)

    Google Scholar 

  2. Boden, M.A.: Creativity and artificial intelligence. Artif. Intell. 103, 347–356 (1998)

    Article  MathSciNet  Google Scholar 

  3. Boden, M.A.: The Creative Mind-Myths and Mechanisms. Taylor & Francis e-Library (2004)

    Google Scholar 

  4. Boutin, V., et al.: Diffusion models as artists: are we closing the gap between humans and machines? In: Proceedings of the International Conference on Machine Learning (ICML), pp. 2953–3002 (2023)

    Google Scholar 

  5. Boutin, V., Singhal, L., Thomas, X., Serre, T.: Diversity vs. recognizability: human-like generalization in one-shot generative models. In: Proceedings of Advances in Neural Information Processing Systems, pp. 20933–20946 (2022)

    Google Scholar 

  6. Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: learning to follow image editing instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18392–18402 (2023)

    Google Scholar 

  7. Cetinic, E., She, J.: Understanding and creating art with AI: review and outlook. ACM Trans. Multimedia Comput. Commun. Appl. 18(2), Article 66 (2022)

    Google Scholar 

  8. Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-excite: attention-based semantic guidance for text-to-image diffusion models. ACM Trans. Graph. 42(4), 148 (2023)

    Article  Google Scholar 

  9. Cintas, C., Das, P., Quanz, B., Tadesse, G.A., Speakman, S., Chen, P.Y.: Towards creativity characterization of generative models via group-based subset scanning. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 4929–4935 (2022)

    Google Scholar 

  10. Cong, Y., Min, M.R., Li, L.E., Rosenhahn, B., Yang, M.Y.: Attribute-centric compositional text-to-image generation. arXiv:2301.01413 (2023)

  11. Dai, Y., et al.: Harmonious group choreography with trajectory-controllable diffusion. arXiv preprint arXiv:2403.06189 (2024)

  12. Das, P., Quanz, B., Chen, P.Y., wook Ahn, J., Shah, D.: Toward a neuro-inspired creative decoder. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 2746–2753 (2020)

    Google Scholar 

  13. Das, P., Varshney, L.R.: Explaining artificial intelligence generation and creativity. IEEE Signal Process. Mag. 39(4), 85–95 (2022)

    Article  Google Scholar 

  14. daspartho: MagicMix (2022). https://github.com/daspartho/MagicMix#magicmix

  15. Du, X., Sun, Y., Zhu, X., Li, Y.: Dream the impossible: outlier imagination with diffusion models. In: Proceedings of Advances in Neural Information Processing Systems (2023)

    Google Scholar 

  16. Du, Y., Li, S., Mordatch, I.: Compositional visual generation with energy based models. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pp. 6637–6647 (2020)

    Google Scholar 

  17. Elgammal, A., Liu, B., Elhoseiny, M., Mazzone, M.: Can: creative adversarial networks: generating “art” by learning about styles and deviating from style norms. In: International Conference on Computational Creativity (ICCC), pp. 96–103 (2017)

    Google Scholar 

  18. Feng, W., et al.: Training-free structured diffusion guidance for compositional text-to-image synthesis. In: Proceedings of the International Conference on Learning Representations (ICLR) (2023)

    Google Scholar 

  19. Feng, Z., et al.: Ernie-vilg 2.0: improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10135–10145 (2023)

    Google Scholar 

  20. Frans, K., Soros, L.B., Witkowski, O.: Clipdraw: exploring text-to-drawing synthesis through language-image encoders. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) (2022)

    Google Scholar 

  21. Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. In: Proceedings of the International Conference on Learning Representations (ICLR) (2023)

    Google Scholar 

  22. Ge, S., Goswami, V., Zitnick, C.L., Parikh, D.: Creative sketch generation. In: Proceedings of the International Conference on Learning Representations (ICLR) (2021)

    Google Scholar 

  23. Goodfellow, I., et al.: Generative adversarial nets. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pp. 2672–2680 (2014)

    Google Scholar 

  24. Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. In: Proceedings of the International Conference on Learning Representations (ICLR) (2023)

    Google Scholar 

  25. Hinton, G., Srivastava, N., Swersky, K.: Neural networks for machine learning lecture 6A overview of mini-batch gradient descent. University of Toronto, Course-CSC321 (2012)

    Google Scholar 

  26. Hitsuwari, J., Ueda, Y., Yun, W., Nomura, M.: Does human-AI collaboration lead to more creative art? Aesthetic evaluation of human-made and AI-generated haiku poetry. Comput. Hum. Behav. 139(2), 107502 (2023)

    Article  Google Scholar 

  27. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS), pp. 6840–6851 (2020)

    Google Scholar 

  28. Kawar, B., Elad, M., Ermon, S., Song, J.: Denoising diffusion restoration models. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) (2022)

    Google Scholar 

  29. Khemakhem, I., Kingma, D., Monti, R., Hyvarinen, A.: Variational autoencoders and nonlinear ICA: a unifying framework. In: Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 2207–2217 (2020)

    Google Scholar 

  30. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: Proceedings of the International Conference on Learning Representations (ICLR) (2014)

    Google Scholar 

  31. Kirillov, A., et al.: Segment anything. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2023)

    Google Scholar 

  32. Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., Levy, O.: Pick-a-pic: an open dataset of user preferences for text-to-image generation. arXiv preprint arXiv:2305.01569 (2023)

  33. Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customization of text-to-image diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1931–1941 (2023)

    Google Scholar 

  34. Li, Z., Min, M.R., Li, K., Xu, C.: Stylet2i: toward compositional and high-fidelity text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18197–18207 (2022)

    Google Scholar 

  35. Liao, W., Hu, K., Yang, M.Y., Rosenhahn, B.: Text to image generation with semantic-spatial aware GAN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18187–18196 (2022)

    Google Scholar 

  36. Liew, J.H., Yan, H., Zhou, D., Feng, J.: Magicmix: semantic mixing with diffusion models. arXiv:2210.16056 (2022). https://magicmix.github.io/

  37. Liu, L., Ren, Y., Lin, Z., Zhao, Z.: Pseudo numerical methods for diffusion models on manifolds. In: Proceedings of International Conference on Learning Representations (ICLR) (2022)

    Google Scholar 

  38. Liu, N., Li, S., Du, Y., Torralba, A., Tenenbaum, J.B.: Compositional visual generation with composable diffusion models. In: Proceedings of the European Conference Computer Vision (ECCV), pp. 423–439 (2022)

    Google Scholar 

  39. Maher, M.L.: Evaluating creativity in humans, computers, and collectively intelligent systems. In: Proceedings of the 1st DESIRE Network Conference on Creativity and Innovation in Design, pp. 22–28 (2010)

    Google Scholar 

  40. Nobari, A.H., Chen, W., Ahmed, F.: Range-constrained generative adversarial network: design synthesis under constraints using conditional generative adversarial networks. J. Mech. Des. 144(2), 021708 (2021)

    Google Scholar 

  41. Nobari, A.H., Rashad, M.F., Ahmed, F.: Creativegan: editing generative adversarial networks for creative design synthesis. In: Proceedings of ASME International Design Engineering Technical Conferences and Computers and Information in Engineering Conference (2021)

    Google Scholar 

  42. Orgad, H., Kawar, B., Belinkov, Y.: Editing implicit assumptions in text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7053–7061 (2023)

    Google Scholar 

  43. Park, D.H., Azadi, S., Liu, X., Darrell, T., Rohrbach, A.: Benchmark for compositional text-to-image synthesis. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) (2021)

    Google Scholar 

  44. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of International Conference on Machine Learning (ICML), pp. 8748–8763 (2021)

    Google Scholar 

  45. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. In: arXiv:2204.06125 (2022)

  46. Ren, J., et al.: Out-of-distribution detection and selective generation for conditional language models. In: Proceedings of International Conference on Learning Representations (ICLR) (2023)

    Google Scholar 

  47. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695 (2022)

    Google Scholar 

  48. Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) (2022)

    Google Scholar 

  49. Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision (IJCV) 115(3), 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  50. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) (2022)

    Google Scholar 

  51. Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS): Datasets and Benchmarks Track, pp. 25278–25294 (2022)

    Google Scholar 

  52. Shen, X., Liu, F., Dong, H., Lian, Q., Chen, Z., Zhang, T.: Weakly supervised disentangled generative causal representation learning. J. Mach. Learn. Res. 23, 1–55 (2022)

    MathSciNet  Google Scholar 

  53. Shen, Z., et al.: Towards out-of-distribution generalization: a survey. arXiv preprint arXiv:2108.13624 (2021)

  54. Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models. In: Proceedings of International Conference on Machine Learning (ICML) (2023)

    Google Scholar 

  55. Träuble, F., et al.: On disentangled representations learned from correlated data. In: Proceedings of International Conference on Machine Learning (ICML), pp. 10401–10412 (2021)

    Google Scholar 

  56. Wang, R., Que, G., Chen, S., Li, X., Li, J., Yang, J.: Creative birds: self-supervised single-view 3D style transfer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8775–8784 (2023)

    Google Scholar 

  57. Wu, X., et al.: Human preference score V2: a solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341 (2023)

  58. Xu, R., Zhang, X., Shen, Z., Zhang, T., Cui, P.: A theoretical analysis on independence-driven importance weighting for covariate-shift generalization. In: Proceedings of International Conference on Machine Learning (ICML), pp. 24803–24829 (2022)

    Google Scholar 

  59. Ye, N., et al.: OOD-bench: quantifying and understanding two dimensions of out-of-distribution generalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7947–7958 (2022)

    Google Scholar 

  60. Yu, J., et al.: Scaling autoregressive models for content-rich text-to-image generation. Trans. Mach. Learn. Res. (2022)

    Google Scholar 

  61. Zhang, Y., Zhou, D., Hooi, B., Wang, K., Feng, J.: Expanding small-scale datasets with guided imagination. In: Proceedings of Advances in Neural Information Processing Systems (2023)

    Google Scholar 

  62. Zhao, W., Bai, L., Rao, Y., Zhou, J., Lu, J.: UNIPC: a unified predictor-corrector framework for fast sampling of diffusion models. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS) (2023)

    Google Scholar 

  63. Zhou, K., Yang, Y., Hospedales, T., Xiang, T.: Deep domain-adversarial image generation for domain generalisation. In: Proceedings of AAAI Conference on Artificial Intelligence (AAAI), vol. 34, pp. 13025–13032 (2020)

    Google Scholar 

  64. Zhou, K., Yang, Y., Hospedales, T., Xiang, T.: Learning to generate novel domains for domain generalization. In: Proceedings of European Conference Computer Vision (ECCV), pp. 561–578 (2020)

    Google Scholar 

  65. Zhu, M., Pan, P., Chen, W., Yang, Y.: DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5802–5810 (2019)

    Google Scholar 

Download references

Acknowledgements

This work was partially supported by the National Science Fund of China, Grant Nos. 62072242 and 62361166670. We sincerely thank the French artist Les Creatonautes for granting us permission to use their images.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun Li .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4933 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, J., Zhang, Z., Yang, J. (2025). TP2O: Creative Text Pair-to-Object Generation Using Balance Swap-Sampling. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15130. Springer, Cham. https://doi.org/10.1007/978-3-031-73220-1_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73220-1_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73219-5

  • Online ISBN: 978-3-031-73220-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics