research-article

Customizing Text-to-Image Generation with Inverted Interaction

Authors:

Ashish Sirasao,

Emad BarsoumAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 10901 - 10909

https://doi.org/10.1145/3664647.3680621

Published: 28 October 2024 Publication History

Abstract

Subject-driven image generation, aimed at customizing user-specified subjects, has experienced rapid progress. However, most of them focus on transferring the customized appearance of subjects. In this work, we consider a novel concept customization task, that is, capturing the interaction between subjects in exemplar images and transferring the learned concept of interaction to achieve customized text-to-image generation. Intrinsically, the interaction between subjects is diverse and is difficult to describe in only a few words. In addition, typical exemplar images are about the interaction between humans, which further intensifies the challenge of interaction-driven image generation with various categories of subjects. To address this task, we adopt a divide-and-conquer strategy and propose a two-stage interaction inversion framework. The framework begins by learning a pseudo-word for a single pose of each subject in the interaction. This is then employed to promote the learning of the concept for the interaction. In addition, language prior and cross-attention loss are incorporated into the optimization process to encourage the modeling of interaction. Extensive experiments demonstrate that the proposed methods are able to effectively invert the interactive pose from exemplar images and apply it to the customized generation with user-specified interaction.

References

[1]

Aishwarya Agarwal, Srikrishna Karanam, Tripti Shukla, and Balaji Vasan Srinivasan. 2023. An Image is Worth Multiple Words: Multi-attribute Inversion for Constrained Text-to-Image Synthesis. CoRR, Vol. abs/2311.11919 (2023).

[2]

Namhyuk Ahn, Junsoo Lee, Chunggi Lee, Kunhee Kim, Daesik Kim, Seung-Hun Nam, and Kibeom Hong. 2024. Dreamstyler: Paint by style inversion with text-to-image diffusion models. In AAAI.

[3]

Omri Avrahami, Kfir Aberman, Ohad Fried, Daniel Cohen-Or, and Dani Lischinski. 2023. Break-A-Scene: Extracting Multiple Concepts from a Single Image. CoRR, Vol. abs/2305.16311 (2023).

[4]

Tim Brooks, Aleksander Holynski, and Alexei A Efros. 2023. Instructpix2pix: Learning to follow image editing instructions. In CVPR.

[5]

Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. 2023. MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing. CoRR, Vol. abs/2304.08465 (2023).

[6]

Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR.

[7]

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. 2023. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. TOG (2023).

[8]

Wenhu Chen, Hexiang Hu, Yandong Li, Nataniel Ruiz, Xuhui Jia, Ming-Wei Chang, and William W Cohen. 2024. Subject-driven text-to-image generation via apprenticeship learning. NIPS (2024).

[9]

Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. 2016. A learned representation for artistic style. CoRR, Vol. abs/1610.07629 (2016).

[10]

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. 2023. An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR.

[11]

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2022. Prompt-to-prompt image editing with cross attention control. In ICLR.

[12]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. In NIPS.

[13]

Jiun Tian Hoe, Xudong Jiang, Chee Seng Chan, Yap-Peng Tan, and Weipeng Hu. 2023. InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models. CoRR, Vol. abs/2312.05849 (2023).

[14]

Jiancheng Huang, Yifan Liu, Jin Qin, and Shifeng Chen. 2023. KV Inversion: KV Embeddings Learning for Text-Conditioned Real Image Action Editing. CoRR, Vol. abs/2309.16608 (2023).

[15]

Lianghua Huang, Di Chen, Yu Liu, Shen Yujun, Deli Zhao, and Zhou Jingren. 2023. Composer: Creative and Controllable Image Synthesis with Composable Conditions. CoRR, Vol. abs/2302.09778 (2023).

[16]

Siteng Huang, Biao Gong, Yutong Feng, Xi Chen, Yuqian Fu, Yu Liu, and Donglin Wang. 2024. Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation. In CVPR.

[17]

Ziqi Huang, Tianxing Wu, Yuming Jiang, Kelvin CK Chan, and Ziwei Liu. 2023. ReVersion: Diffusion-Based Relation Inversion from Images. CoRR, Vol. abs/2303.13495 (2023).

[18]

Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. CoRR, Vol. abs/1609.02907 (2016).

[19]

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. 2023. Segment anything. In ICCV.

[20]

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. 2023. Multi-concept customization of text-to-image diffusion. In CVPR.

[21]

Long Lian, Boyi Li, Adam Yala, and Trevor Darrell. 2023. LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models. CoRR, Vol. abs/2305.13655 (2023).

[22]

Zhiheng Liu, Yifei Zhang, Yujun Shen, Kecheng Zheng, Kai Zhu, Ruili Feng, Yu Liu, Deli Zhao, Jingren Zhou, and Yang Cao. 2023. Customizable Image Synthesis with Multiple Subjects. In NIPS.

[23]

Haoming Lu, Hazarapet Tunanyan, Kai Wang, Shant Navasardyan, Zhangyang Wang, and Humphrey Shi. 2023. Specialist diffusion: Plug-and-play sample-efficient fine-tuning of text-to-image diffusion models to learn any unseen style. In CVPR.

[24]

Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2023. Null-text inversion for editing real images using guided diffusion models. In CVPR.

[25]

OpenAI. 2023. DALL-E-2, https://openai.com/product/dall-e-2.

[26]

Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. 2021. Styleclip: Text-driven manipulation of stylegan imagery. In ICCV.

[27]

Quynh Phung, Songwei Ge, and Jia-Bin Huang. 2023. Grounded Text-to-Image Synthesis with Attention Refocusing. CoRR, Vol. abs/2306.05427 (2023).

[28]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In ICML.

[29]

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with clip latents. CoRR, Vol. abs/2204.06125 (2022).

[30]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In CVPR.

[31]

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. 2023. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation.

[32]

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. 2022. Photorealistic text-to-image diffusion models with deep language understanding.

[33]

Raphael Tang, Linqing Liu, Akshat Pandey, Zhiying Jiang, Gefei Yang, Karun Kumar, Pontus Stenetorp, Jimmy Lin, and Ferhan Ture. 2023. What the daam: Interpreting stable diffusion using cross attention. CoRR, Vol. abs/2210.04885 (2023).

[34]

Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing Data using t-SNE., Vol. 9, 86 (2008), 2579--2605.

[35]

Andrey Voynov, Qinghao Chu, Daniel Cohen-Or, and Kfir Aberman. 2023. P : Extended Textual Conditioning in Text-to-Image Generation. CoRR, Vol. abs/2303.09522 (2023).

[36]

Qiucheng Wu, Yujian Liu, Handong Zhao, Ajinkya Kale, Trung Bui, Tong Yu, Zhe Lin, Yang Zhang, and Shiyu Chang. 2023. Uncovering the disentanglement capability in text-to-image diffusion models. In CVPR.

[37]

Guangxuan Xiao, Tianwei Yin, William T Freeman, Frédo Durand, and Song Han. 2023. FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention. CoRR, Vol. abs/2305.10431 (2023).

[38]

Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR.

[39]

Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. 2022. Vitpose: Simple vision transformer baselines for human pose estimation. In NIPS.

[40]

Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. 2022. ViTPose: Vision Transformer Foundation Model for Generic Body Pose Estimation. CoRR, Vol. abs/2212.04246 (2022).

[41]

Jingkang Yang, Yi Zhe Ang, Zujin Guo, Kaiyang Zhou, Wayne Zhang, and Ziwei Liu. 2022. Panoptic scene graph generation. In ECCV.

[42]

Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi. 2016. Situation recognition: Visual semantic role labeling for image understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition.

[43]

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. 2023. Scaling autoregressive models for content-rich text-to-image generation. CoRR, Vol. abs/2206.10789 (2023).

[44]

Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang. 2021. Cross-modal contrastive learning for text-to-image generation. In CVPR.

[45]

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models. In ICCV.

[46]

Yuxin Zhang, Weiming Dong, Fan Tang, Nisha Huang, Haibin Huang, Chongyang Ma, Tong-Yee Lee, Oliver Deussen, and Changsheng Xu. 2023. ProSpect: Expanded Conditioning for the Personalization of Attribute-aware Image Generation. CoRR, Vol. abs/2305.16225 (2023).

[47]

Yuxin Zhang, Nisha Huang, Fan Tang, Haibin Huang, Chongyang Ma, Weiming Dong, and Changsheng Xu. 2023. Inversion-based style transfer with diffusion models. In CVPR.

[48]

Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. 2019. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In CVPR.

Index Terms

Customizing Text-to-Image Generation with Inverted Interaction
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

Inverted FTIR: easy multitouch sensing for flatscreens
ITS '09: Proceedings of the ACM International Conference on Interactive Tabletops and Surfaces

The increased attention which multitouch interfaces have received in recent years is partly due to the availability of cheap sensing hardware such as FTIR-based screens. However, this method has so far required a bulky projector-camera setup behind the ...
fingerT9: leveraging thumb-to-finger interaction for one-handed text entry on smartwatches
SA '17: SIGGRAPH Asia 2017 Mobile Graphics & Interactive Applications

We introduce FingerT9, leveraging the action of thumb-to-finger touching and the surface of the finger segments, to support one-handed text entry on smartwatches. This is achieved by mapping a T9 keyboard layout on the surface of finger segments, ...
Touchless gestural interaction with small displays: a case study
CHItaly '13: Proceedings of the Biannual Conference of the Italian Chapter of SIGCHI

Touchless gestural interaction enables users to interact with digital devices using body movements and gestures, and without the burden of a physical contact with technology (e.g., data gloves, body markers, or remote controllers). Most gesture-based ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
149
Total Downloads

Downloads (Last 12 months)149
Downloads (Last 6 weeks)20

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten