Abstract
Generating creative combinatorial objects from two seemingly unrelated object texts is a challenging task in text-to-image synthesis, often hindered by a focus on emulating existing data distributions. In this paper, we develop a straightforward yet highly effective method, called balance swap-sampling. First, we propose a swapping mechanism that generates a novel combinatorial object image set by randomly exchanging intrinsic elements of two text embeddings through a cutting-edge diffusion model. Second, we introduce a balance swapping region to efficiently sample a small subset from the newly generated image set by balancing CLIP distances between the new images and their original generations, increasing the likelihood of accepting the high-quality combinations. Last, we employ a segmentation method to compare CLIP distances among the segmented components, ultimately selecting the most promising object from the sampled subset. Extensive experiments demonstrate that our approach outperforms recent SOTA T2I methods. Surprisingly, our results even rival those of human artists, such as frog-broccoli in Fig. 1. Project
J. Li and Z. Zhang—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Avrahami, O., et al.: Spatext: spatio-textual representation for controllable image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18370–18380 (2023)
Boden, M.A.: Creativity and artificial intelligence. Artif. Intell. 103, 347–356 (1998)
Boden, M.A.: The Creative Mind-Myths and Mechanisms. Taylor & Francis e-Library (2004)
Boutin, V., et al.: Diffusion models as artists: are we closing the gap between humans and machines? In: Proceedings of the International Conference on Machine Learning (ICML), pp. 2953–3002 (2023)
Boutin, V., Singhal, L., Thomas, X., Serre, T.: Diversity vs. recognizability: human-like generalization in one-shot generative models. In: Proceedings of Advances in Neural Information Processing Systems, pp. 20933–20946 (2022)
Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: learning to follow image editing instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18392–18402 (2023)
Cetinic, E., She, J.: Understanding and creating art with AI: review and outlook. ACM Trans. Multimedia Comput. Commun. Appl. 18(2), Article 66 (2022)
Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-excite: attention-based semantic guidance for text-to-image diffusion models. ACM Trans. Graph. 42(4), 148 (2023)
Cintas, C., Das, P., Quanz, B., Tadesse, G.A., Speakman, S., Chen, P.Y.: Towards creativity characterization of generative models via group-based subset scanning. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 4929–4935 (2022)
Cong, Y., Min, M.R., Li, L.E., Rosenhahn, B., Yang, M.Y.: Attribute-centric compositional text-to-image generation. arXiv:2301.01413 (2023)
Dai, Y., et al.: Harmonious group choreography with trajectory-controllable diffusion. arXiv preprint arXiv:2403.06189 (2024)
Das, P., Quanz, B., Chen, P.Y., wook Ahn, J., Shah, D.: Toward a neuro-inspired creative decoder. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 2746–2753 (2020)
Das, P., Varshney, L.R.: Explaining artificial intelligence generation and creativity. IEEE Signal Process. Mag. 39(4), 85–95 (2022)
daspartho: MagicMix (2022). https://github.com/daspartho/MagicMix#magicmix
Du, X., Sun, Y., Zhu, X., Li, Y.: Dream the impossible: outlier imagination with diffusion models. In: Proceedings of Advances in Neural Information Processing Systems (2023)
Du, Y., Li, S., Mordatch, I.: Compositional visual generation with energy based models. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pp. 6637–6647 (2020)
Elgammal, A., Liu, B., Elhoseiny, M., Mazzone, M.: Can: creative adversarial networks: generating “art” by learning about styles and deviating from style norms. In: International Conference on Computational Creativity (ICCC), pp. 96–103 (2017)
Feng, W., et al.: Training-free structured diffusion guidance for compositional text-to-image synthesis. In: Proceedings of the International Conference on Learning Representations (ICLR) (2023)
Feng, Z., et al.: Ernie-vilg 2.0: improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10135–10145 (2023)
Frans, K., Soros, L.B., Witkowski, O.: Clipdraw: exploring text-to-drawing synthesis through language-image encoders. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) (2022)
Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. In: Proceedings of the International Conference on Learning Representations (ICLR) (2023)
Ge, S., Goswami, V., Zitnick, C.L., Parikh, D.: Creative sketch generation. In: Proceedings of the International Conference on Learning Representations (ICLR) (2021)
Goodfellow, I., et al.: Generative adversarial nets. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pp. 2672–2680 (2014)
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. In: Proceedings of the International Conference on Learning Representations (ICLR) (2023)
Hinton, G., Srivastava, N., Swersky, K.: Neural networks for machine learning lecture 6A overview of mini-batch gradient descent. University of Toronto, Course-CSC321 (2012)
Hitsuwari, J., Ueda, Y., Yun, W., Nomura, M.: Does human-AI collaboration lead to more creative art? Aesthetic evaluation of human-made and AI-generated haiku poetry. Comput. Hum. Behav. 139(2), 107502 (2023)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS), pp. 6840–6851 (2020)
Kawar, B., Elad, M., Ermon, S., Song, J.: Denoising diffusion restoration models. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) (2022)
Khemakhem, I., Kingma, D., Monti, R., Hyvarinen, A.: Variational autoencoders and nonlinear ICA: a unifying framework. In: Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 2207–2217 (2020)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: Proceedings of the International Conference on Learning Representations (ICLR) (2014)
Kirillov, A., et al.: Segment anything. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2023)
Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., Levy, O.: Pick-a-pic: an open dataset of user preferences for text-to-image generation. arXiv preprint arXiv:2305.01569 (2023)
Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customization of text-to-image diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1931–1941 (2023)
Li, Z., Min, M.R., Li, K., Xu, C.: Stylet2i: toward compositional and high-fidelity text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18197–18207 (2022)
Liao, W., Hu, K., Yang, M.Y., Rosenhahn, B.: Text to image generation with semantic-spatial aware GAN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18187–18196 (2022)
Liew, J.H., Yan, H., Zhou, D., Feng, J.: Magicmix: semantic mixing with diffusion models. arXiv:2210.16056 (2022). https://magicmix.github.io/
Liu, L., Ren, Y., Lin, Z., Zhao, Z.: Pseudo numerical methods for diffusion models on manifolds. In: Proceedings of International Conference on Learning Representations (ICLR) (2022)
Liu, N., Li, S., Du, Y., Torralba, A., Tenenbaum, J.B.: Compositional visual generation with composable diffusion models. In: Proceedings of the European Conference Computer Vision (ECCV), pp. 423–439 (2022)
Maher, M.L.: Evaluating creativity in humans, computers, and collectively intelligent systems. In: Proceedings of the 1st DESIRE Network Conference on Creativity and Innovation in Design, pp. 22–28 (2010)
Nobari, A.H., Chen, W., Ahmed, F.: Range-constrained generative adversarial network: design synthesis under constraints using conditional generative adversarial networks. J. Mech. Des. 144(2), 021708 (2021)
Nobari, A.H., Rashad, M.F., Ahmed, F.: Creativegan: editing generative adversarial networks for creative design synthesis. In: Proceedings of ASME International Design Engineering Technical Conferences and Computers and Information in Engineering Conference (2021)
Orgad, H., Kawar, B., Belinkov, Y.: Editing implicit assumptions in text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7053–7061 (2023)
Park, D.H., Azadi, S., Liu, X., Darrell, T., Rohrbach, A.: Benchmark for compositional text-to-image synthesis. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) (2021)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of International Conference on Machine Learning (ICML), pp. 8748–8763 (2021)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. In: arXiv:2204.06125 (2022)
Ren, J., et al.: Out-of-distribution detection and selective generation for conditional language models. In: Proceedings of International Conference on Learning Representations (ICLR) (2023)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695 (2022)
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) (2022)
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision (IJCV) 115(3), 211–252 (2015)
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) (2022)
Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS): Datasets and Benchmarks Track, pp. 25278–25294 (2022)
Shen, X., Liu, F., Dong, H., Lian, Q., Chen, Z., Zhang, T.: Weakly supervised disentangled generative causal representation learning. J. Mach. Learn. Res. 23, 1–55 (2022)
Shen, Z., et al.: Towards out-of-distribution generalization: a survey. arXiv preprint arXiv:2108.13624 (2021)
Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models. In: Proceedings of International Conference on Machine Learning (ICML) (2023)
Träuble, F., et al.: On disentangled representations learned from correlated data. In: Proceedings of International Conference on Machine Learning (ICML), pp. 10401–10412 (2021)
Wang, R., Que, G., Chen, S., Li, X., Li, J., Yang, J.: Creative birds: self-supervised single-view 3D style transfer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8775–8784 (2023)
Wu, X., et al.: Human preference score V2: a solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341 (2023)
Xu, R., Zhang, X., Shen, Z., Zhang, T., Cui, P.: A theoretical analysis on independence-driven importance weighting for covariate-shift generalization. In: Proceedings of International Conference on Machine Learning (ICML), pp. 24803–24829 (2022)
Ye, N., et al.: OOD-bench: quantifying and understanding two dimensions of out-of-distribution generalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7947–7958 (2022)
Yu, J., et al.: Scaling autoregressive models for content-rich text-to-image generation. Trans. Mach. Learn. Res. (2022)
Zhang, Y., Zhou, D., Hooi, B., Wang, K., Feng, J.: Expanding small-scale datasets with guided imagination. In: Proceedings of Advances in Neural Information Processing Systems (2023)
Zhao, W., Bai, L., Rao, Y., Zhou, J., Lu, J.: UNIPC: a unified predictor-corrector framework for fast sampling of diffusion models. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS) (2023)
Zhou, K., Yang, Y., Hospedales, T., Xiang, T.: Deep domain-adversarial image generation for domain generalisation. In: Proceedings of AAAI Conference on Artificial Intelligence (AAAI), vol. 34, pp. 13025–13032 (2020)
Zhou, K., Yang, Y., Hospedales, T., Xiang, T.: Learning to generate novel domains for domain generalization. In: Proceedings of European Conference Computer Vision (ECCV), pp. 561–578 (2020)
Zhu, M., Pan, P., Chen, W., Yang, Y.: DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5802–5810 (2019)
Acknowledgements
This work was partially supported by the National Science Fund of China, Grant Nos. 62072242 and 62361166670. We sincerely thank the French artist Les Creatonautes for granting us permission to use their images.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, J., Zhang, Z., Yang, J. (2025). TP2O: Creative Text Pair-to-Object Generation Using Balance Swap-Sampling. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15130. Springer, Cham. https://doi.org/10.1007/978-3-031-73220-1_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-73220-1_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73219-5
Online ISBN: 978-3-031-73220-1
eBook Packages: Computer ScienceComputer Science (R0)