extended-abstract

Public Access

Beyond Text-to-Image: Multimodal Prompts to Explore Generative AI

Author:
Vivian Liu

Columbia University, United States

Columbia University, United States

0000-0001-5328-0120
View Profile

CHI EA '23: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing SystemsApril 2023Article No.: 482Pages 1–6https://doi.org/10.1145/3544549.3577043

Published:19 April 2023Publication History

CHI EA '23: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems

Pages 1–6

Abstract

Text-to-image AI systems have proven to have extraordinary generative capacities that have facilitated widespread adoption. However, these systems are primarily text-based, which is a fundamental inversion of what many artists are traditionally used to: having full control over the composition of their work. Prior work has shown that there is great utility in using text prompts and that AI augmented workflows can increase momentum on creative tasks for end users. However, multimodal interactions beyond text need to be further defined, so end users can have rich points of interaction that allow them to truly co-pilot AI-generated content creation. To this end, the goal of my research is to equip creators with workflows that 1) translate abstract design goals into prompts of visual language, 2) structure exploration of design outcomes, and 3) integrate creator contributions into generations.

Figure 1: Creative workflows augmented with prompt-based AI should 1) translate abstract design goals into prompts of visual language, 2) structure exploration of design outcomes, and 3) integrate creator contributions into generations.

1 INTRODUCTION

Due to advancements in deep learning, AI models have shown extraordinary generative capacities in the text [4], image [9, 16], video [29], and 3D domains [23]. Human-computer interaction research has sought to translate these computational advancements into novel abilities for end users. To this end, large foundational AI models such as GPT-3, CLIP, and DALL-E have been applied to complex tasks such as science communication [12], news illustration [18], and 3D modeling [19]. Interactions with these new models rely on prompting, the ability to steer and condition the generative process using natural language. My research has focused on studying prompting as an emerging form of interaction. Prompts are generally text-based and capable of capturing creator goals at the highest level. For example, a designer could prompt for a “3D render of a cyberpunk ballerina” and receive generations without the need for any further specifications. However, creators are generally used to having more control over their creative output. They are familiar with direct manipulation at the level of words, pixels, and geometry. Control over AI outputs and the ability to better steer the generation process is one of the most fundamental problems of AI co-piloted creativity.

My focus as a researcher is to scope out points of rich interaction such that the process and workflows of artists and designers can be augmented rather than automated. To this end, I strive to equip creators with workflows that 1) translate abstract goals and intentions into prompts of visual language, 2) structure exploration of design outcomes, and 3) integrate creator contributions into generations. These concepts have been actualized in two systems I have built: Opal, a system for news illustrations, and 3DALL-E, a text-to-image system conceptualizing 3D product designs.

2 RELATED WORK

2.1 Prompting

Prompting is an emergent form of interaction that allows users to engage with generative AI through natural language. Breakthroughs from transformers in terms of scale and multimodal representation learning have increased the ability of models to properly capture long-range dependencies in the text, image, video, and 3D domains. A user can use prompts to flexibly adapt what task they want a large pretrained model to handle [3, 26].

Prompts capture different elements depending on the domain they operate over. For example, text-to-image prompts can capture the visual vocabulary of images, while text prompts capture the syntax, semantics, and style of language. Early work in my PhD generated and annotated 5000 AI images to empirically understand text-to-image prompting capabilities [17, 24]. The quantitative and qualitative analysis synthesized design guidelines for prompt engineering text-to-image generations. These included suggestions to parameterize prompts with richly visual keywords such as concrete subjects and figurative art styles.

While much work has focused on how to arrive at the right prompt, prior work has also looked into how prompts can integrate into workflows. For example, prompting has been purposed to modularize complex tasks into smaller, more prompt-addressable ones [31]. Prompts have also been folded into node-based editors to give novice end users the ability to visually program prompts [30]. My first system work Opal [18] leveraged both text and text-to-image prompts in a creative workflow for illustration.

Prompting has also expanded beyond text-only abilities. Users are now capable of conditioning their generations based on image inputs through novel functionalities such as inpainting and outpainting [9, 22]. Image prompts [24], though less explored than text prompts, have great potential in their ability to bring existing creator work and progress into the final generation. Further functions such as textual inversion, the ability to capture concepts with a few exemplar images, are also helping end users prompt outputs in ways beyond words. 3DALL-E, my second system is one of the first to integrate and field test text-to-image generations in a feature-rich software authoring environment using both text and image prompts.

2.2 Generative Models

Generative models are capable of text-based generation for text [4], images [9, 27], videos [29], and 3D shapes [21]. Their outputs in each domain have reached high fidelity and diversity. This is owing to the ability of these models to undergo extensive multimodal pretraining across different domains [25]. An early multimodal embedding was CLIP, which had text-image understanding that was leveraged in many early generative workflows [1, 7, 8]. Many text-to-image tools such as Midjourney, DALL-E, and Stable Diffusion are now mainstream and in production. Different architectures lend to different interactions. For example, while users can paint with sketches and in broad strokes with Make-a-Scene [10], in another model they can continually paint outside the edges of an image [22].

However, precisely because these models are capable of generating nearly anything, a significant amount of auditing and AI ethics has gone into understanding how models can be pre-emptively stopped from producing extreme and inappropriate content. Auditing work has unearthed social and gender biases [6], and strategies such as redteaming [5] have been implemented to discover and patch loopholes that can produce nefarious outputs.

2.3 Creativity Support Tools

Text-based interaction has been well-studied in many creative contexts prior to the advancement of prompt-based AI. Advances in AI have now been embedded within systems such as Opal [18], Sparks [12], FashionQ [15], and the writing editors in [28]. Many of these systems have established that AI can be an excellent assistant for ideation and brainstorming, even when the topics are more technical and niche [12]. Theory and principles around human-AI interaction [2, 20] have also been established to help navigate the trade-offs of control and exploration that AI can bring to the creative process.

3 RESEARCH CONTRIBUTIONS

Most of my work has concentrated around text-to-image generative frameworks, which allow users to generate images based on text. These frameworks utilize deep learning models that have been pre-trained on large magnitudes of data, allowing users the freedom to experiment with a near infinite number of visual concepts. However, while we can compose any number of visual concepts in text and language, it is not guaranteed that any prompt passed through a text-to-image AI will produce a quality outcome, and what models are capable of generating is opaque to the user.

My first research contribution in this space was empirical work done in [17], which showed that structured prompts such as "{SUBJECT} in the style of {STYLE}" (e.g. "Manhattan in the style of Van Gogh") can produce consistent results across a wide variety of subjects and styles. Furthermore, styles with very salient visual characteristics could help guide users towards high quality aesthetic outcomes. However, these models still perform variably depending on the style and subject matter. One core usability problem with these frameworks is that understanding what combinations of subjects and styles the model is capable of generating is often a random, trial and error process. When the generations fail, they can often result in generations with distorted, uncanny, and unnatural compositions. Because these systems are also stochastic, I next explored the challenge of translating text into images in a systematic, structured, and efficient way in Opal [18].

3.1 Opal: Visual Language-Based Exploration

To help news illustrators efficiently create illustrations in tandem with the breakneck pace of the journalism, we developed Opal, a system that guides users through a structured search for visual concepts beginning with article text. Through a co-design process with illustrators at a local paper using VQGAN+CLIP as a design probe [13], we found that illustrators tended to engage with the news material they were given through its subject matter, its tone, and the styles they had the ability to illustrate.

Figure 2: In Opal, we provided a system that translated news article text into prompts of visual language through a series of stages headlined above. Through GPT-3, we suggested article keywords, tones, and styles and had GPT-3 find icons that could visualize them more concretely. A gallery of results that could be arrived at from the interface is shown at the bottom. The figure is adapted from [18].

At the high level, Opal helped illustrators explore text-to-image generation for news illustration. It organized and streamlined prompt ideation for text-to-image generation by providing pipelines that allowed users to engage with the article through the stages pictured in Fig. 2. At each stage of the pipeline, we employed techniques from natural language processing such as semantic search and methods of prompt engineering GPT-3 to query it as a knowledge base and association network. From prompts contributed to by GPT-3, Opal initiated streams of text-to-image generations from VQGAN+CLIP. As such, it was one of the first systems to show how large language models can structure exploration with text-to-image AI and how two large pretrained AI models can be concatenated in one workflow while still complementing one other.

Opal demonstrated that its pipeline could help participants be over two times more efficient at generating images. Participants were also over two times more likely to encounter more usable results than participants using a baseline version of our system. Opal further demonstrated that participants could engage with generations not only as ideas or images just taken as is—they could also remix and post-process the images to make them their own.

Qualitatively, participants mentioned that Opal allowed them to access styles they otherwise generally would not have thought of. They liked that keywords and prompts structured exploration and allowed them to track their conceptual pivots. Importantly, participants commented that even when they had vague ideas prior to Opal use, they were able to build upon them using the system.

3.2 3DALL-E: Multimodal Prompting

A follow up system was 3DALL-E, pictured in Figure 4. 3DALL-E extrapolated the idea of joining a LLM and a text-to-image model to the 3D domain. 3D models are challenging to build from scratch—designers have to satisfy a number of objectives that can range from functional and aesthetic goals to feasibility constraints. With the idea that text-to-image AI could provide inspiration to 3D designers, we built 3DALL-E, a system that embedded DALL-E, GPT-3, and CLIP within Fusion 360, a computer-aided design (CAD) software. This system helped 3D designers conceptualize CAD product designs by crafting multimodal (text and image) prompts and generating text-to-image inspiration. After a designer would input a design goal (i.e. design a "garbage truck"), the plugin provided related parts, styles, and designs as prompt suggestions. These prompt suggestions helped users familiarize themselves with relevant design language and 3D keywords. In addition, the system created image prompts from user progress in the 3D viewport, giving users a direct bridge between their 3D design workspace and the text-to-image model. We utilized the model knowledge of CLIP to highlight which multimodal (text+image) prompts could work best.

To evaluate 3DALL-E, we conducted a user study (n=13) demonstrating text-to-image AI use cases in 3D design workflows and an analysis of prompting patterns and prompt complexity. We looked at the following research questions: what patterns of prompting users made, how users reacted when AI model knowledge was highlighted with information scent, and how complex prompts tended to be. We further proposed prompt bibliographies, a concept of human-AI design history to track inspiration from text-to-image AI, which is depicted in Fig. 3.

3DALL-E field tested the utility of text-to-image generations across a wide range of technical disciplines from robotics to mechanical engineering to industrial design. Qualitatively, we found that 3DALL-E could help participants of all disciplines prevent design fixation, generate reference images that they could use for 3D sketching, and produce materials to edit the appearance of their models.

Figure 3: Prompt bibliography capturing prompting activity, conceptual exploration, and 3D modeling progress of a participant using 3DALL-E. Within 3DALL-E, alongside text prompts, 3D designers could pass in multimodal prompts such as initial images (text+image prompts) and existing generations (variations). The figure is adapted from [19].

Figure 4: 3DALL-E’s annotated interface demonstrates how text-to-image generation can help conceptualize 3D designs. Exploration of the text prompt was structured by suggesting different designs / styles, parts, and 3D keywords. Users could also pass in an image prompt for a multimodal text+image prompt.

4 RESEARCH VISION

In early work, I conducted audits of generative models, surveying thousands of outputs and finding that they were capable of vast stylistic range [17, 24]. However, in follow-up qualitative analysis with art experts, we noted that AI generations prompted after existing artistic styles tended to only shallowly reproduce the past styles, eliciting stereotypes such as heavy pink palettes for feminism [14]. It is imperative to research ways for creators to mitigate bias within AI-generated art, such that AI-generated art can contribute to creative and cultural conversations rather than diluting them. Furthermore, novel interactions can strive to build AI-assisted art into a medium of its own. If we can engineer more interesting conceptual functionalities akin to the way textual inversion (the ability to trade concepts) [11] and outpainting [22] naturally grew out of new model affordances, I believe we can steer end users away from arriving at generations through the improper usage of existing artist styles.

My research to date has progressed from understanding generative models to applying them. One of my next steps is to do computational work that can expand beyond the image domain. I intend to build a system that facilitates text-to-image generation for video creation, as shortform video has become a predominant format on social media. I believe there are many fascinating research angles for this next project, such as how to guarantee consistency across frames of text-to-image generations and how to enliven text-to-image generations with sound, music, and animation. These additional dimensions to engage with present challenging modality alignment problems to solve.

I additionally intend to expand conversations about the ethics of AI-generated content by proposing features to respect ownership and attribution. In 3DALL-E, we proposed prompt bibliographies, which could be further embedded in AI images as metadata. Citing AI involvement could become an automatically embedded practice and a community norm that could help tag and contextualize AI-generated content.

A philosophy within my research is that AI-generated content can only be worth as much as the total human effort put into them. Expertise and effort are what give the artifacts we venerate meaning and value. Thus, no matter how advanced AI generation capabilities are, they will always have limited value if the interactions we define single-mindedly focus on efficiency or blackbox us into the simplest interactions. Complementing the traditional methods of direct manipulation with the high-level expressiveness of prompts at various intermediate representations and points of creator workflows is the motivating theme of my work. As I integrate language into visual workflows, I intend to implement new avenues for digital creativity while preserving the human-centered nature of the creative process.

ACKNOWLEDGMENTS

This research is supported by NSF Grant DGE - 1644869.

Supplemental Material

3544549.3577043-video.mp4

Video Poster Presentation

mp4

131.9 MB

Download

Available for Download

vtt

3544549.3577043-video.vtt (4.9 KB)

References

Adverb. 2021. Advadnoun. https://twitter.com/advadnounGoogle Scholar
Reference
Saleema Amershi, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, Eric Horvitz, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, and Paul Bennett. 2019. Guidelines for Human-AI Interaction. 1–13. https://doi.org/10.1145/3290605.3300233Google ScholarDigital Library
Reference
Gwern Branwen. 2020. Gpt-3 creative fiction. https://www.gwern.net/GPT-3Google Scholar
Reference
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. https://doi.org/10.48550/ARXIV.2005.14165Google ScholarCross Ref
Reference 1Reference 2
Miles Brundage, Shahar Avin, Jasmine Wang, Haydn Belfield, Gretchen Krueger, Gillian Hadfield, Heidy Khlaaf, Jingying Yang, Helen Toner, Ruth Fong, Tegan Maharaj, Pang Wei Koh, Sara Hooker, Jade Leung, Andrew Trask, Emma Bluemke, Jonathan Lebensold, Cullen O’Keefe, Mark Koren, Théo Ryffel, JB Rubinovitz, Tamay Besiroglu, Federica Carugati, Jack Clark, Peter Eckersley, Sarah de Haas, Maritza Johnson, Ben Laurie, Alex Ingerman, Igor Krawczuk, Amanda Askell, Rosario Cammarota, Andrew Lohn, David Krueger, Charlotte Stix, Peter Henderson, Logan Graham, Carina Prunkl, Bianca Martin, Elizabeth Seger, Noa Zilberman, Seán Ó hÉigeartaigh, Frens Kroeger, Girish Sastry, Rebecca Kagan, Adrian Weller, Brian Tse, Elizabeth Barnes, Allan Dafoe, Paul Scharre, Ariel Herbert-Voss, Martijn Rasser, Shagun Sodhani, Carrick Flynn, Thomas Krendl Gilbert, Lisa Dyer, Saif Khan, Yoshua Bengio, and Markus Anderljung. 2020. Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims. https://doi.org/10.48550/ARXIV.2004.07213Google ScholarCross Ref
Reference
Jaemin Cho, Abhay Zala, and Mohit Bansal. 2022. DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers. https://doi.org/10.48550/ARXIV.2202.04053Google ScholarCross Ref
Reference
Katherine Crowson. 2021. Rivers Have Wings. https://twitter.com/RiversHaveWingsGoogle Scholar
Reference
Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Edward Raff. 2022. VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance. arXiv preprint arXiv:2204.08583(2022).Google Scholar
Reference
Boris Dayma, Suraj Patil, Pedro Cuenca, Khalid Saifullah, Tanishq Abraham, Phúc Le Khac, Luke Melas, and Ritobrata Ghosh. 2021. DALLE Mini. https://doi.org/10.5281/zenodo.1234Google ScholarCross Ref
Reference 1Reference 2Reference 3
Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. 2022. Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors. https://doi.org/10.48550/ARXIV.2203.13131Google ScholarCross Ref
Reference
Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. 2022. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. https://doi.org/10.48550/ARXIV.2208.01618Google ScholarCross Ref
Reference
Katy Ilonka Gero, Vivian Liu, and Lydia B. Chilton. 2021. Sparks: Inspiration for Science Writing using Language Models. https://doi.org/10.48550/ARXIV.2110.07640Google ScholarCross Ref
Reference 1Reference 2Reference 3
Hilary Hutchinson, Ben Bederson, Allison Druin, Catherine Plaisant, Wendy Mackay, Helen Evans, Heiko Hansen, Stéphane Conversy, Michel Beaudouin-Lafon, Nicolas Roussel, Loïc Lacomme, Björn Eiderbäck, Sinna Lindquist, Yngve Sundblad, and Bo Westerlund. 2003. Technology Probes: Inspiring Design for and with Families. Conference on Human Factors in Computing Systems - Proceedings (02 2003).Google ScholarDigital Library
Reference
Aaron Jackson, Vivian Liu, and Lydia Chilton. 2022. Analyzing the Cultural Relevance of AI Generated Art. https://drive.google.com/file/d/1MbFhc1dBbxsRf-jC8jXp5Cd5i3RlpaNI/viewGoogle Scholar
Reference
Youngseung Jeon, Seungwan Jin, Patrick C. Shih, and Kyungsik Han. 2021. FashionQ: An AI-Driven Creativity Support Tool for Facilitating Ideation in Fashion Design. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3411764.3445093Google ScholarDigital Library
Reference
Tero Karras, Samuli Laine, and Timo Aila. 2018. A Style-Based Generator Architecture for Generative Adversarial Networks. https://doi.org/10.48550/ARXIV.1812.04948Google ScholarCross Ref
Reference
Vivian Liu and Lydia B. Chilton. 2021. Design Guidelines for Prompt Engineering Text-to-Image Generative Models. arxiv:2109.06977 [cs.HC]Google Scholar
Reference 1Reference 2Reference 3
Vivian Liu, Han Qiao, and Lydia Chilton. 2022. Opal: Multimodal Image Generation for News Illustration. https://doi.org/10.48550/ARXIV.2204.09007Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Vivian Liu, Jo Vermeulen, George Fitzmaurice, and Justin Matejka. 2022. 3DALL-E: Integrating Text-to-Image AI in 3D Design Workflows. https://doi.org/10.48550/ARXIV.2210.11603Google ScholarCross Ref
Reference 1Reference 2
Maria Teresa Llano, Mark d’Inverno, Matthew Yee-King, Jon McCormack, Alon Ilsar, Alison Pease, and Simon Colton. 2022. Explainable Computational Creativity. (2022). https://doi.org/10.48550/ARXIV.2205.05682Google ScholarCross Ref
Reference
Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. 2022. Point-E: A System for Generating 3D Point Clouds from Complex Prompts. https://doi.org/10.48550/ARXIV.2212.08751Google ScholarCross Ref
Reference
OpenAI. 2022. Dall·E: Introducing outpainting. https://openai.com/blog/dall-e-introducing-outpainting/Google Scholar
Reference 1Reference 2Reference 3
Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. 2022. DreamFusion: Text-to-3D using 2D Diffusion. https://doi.org/10.48550/ARXIV.2209.14988Google ScholarCross Ref
Reference
Han Qiao, Vivian Liu, and Lydia Chilton. 2022. Initial Images: Using Image Prompts to Improve Subject Representation in Multimodal AI Generated Art. In Creativity and Cognition (Venice, Italy) (C&C ’22). Association for Computing Machinery, New York, NY, USA, 15–28. https://doi.org/10.1145/3527927.3532792Google ScholarDigital Library
Reference 1Reference 2Reference 3
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arxiv:2103.00020 [cs.CV]Google Scholar
Reference
Laria Reynolds and Kyle McDonell. 2021. Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm. arxiv:2102.07350 [cs.CL]Google Scholar
Reference
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695.Google ScholarCross Ref
Reference
Nikhil Singh, Guillermo Bernal, Daria Savchenko, and Elena L. Glassman. 2022. Where to Hide a Stolen Elephant: Leaps in Creative Writing with Multimodal Machine Intelligence. ACM Trans. Comput.-Hum. Interact. (jan 2022). https://doi.org/10.1145/3511599 Just Accepted.Google ScholarDigital Library
Reference
Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. 2022. Phenaki: Variable Length Video Generation From Open Domain Textual Description. https://doi.org/10.48550/ARXIV.2210.02399Google ScholarCross Ref
Reference 1Reference 2
Tongshuang Wu, Ellen Jiang, Aaron Donsbach, Jeff Gray, Alejandra Molina, Michael Terry, and Carrie J Cai. 2022. PromptChainer: Chaining Large Language Model Prompts through Visual Programming. https://doi.org/10.48550/ARXIV.2203.06566Google ScholarCross Ref
Reference
Tongshuang Wu, Michael Terry, and Carrie J Cai. 2022. AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts. https://doi.org/10.1145/3491102.3517582Google ScholarDigital Library
Reference

Index Terms

Beyond Text-to-Image: Multimodal Prompts to Explore Generative AI

Recommendations

3DALL-E: Integrating Text-to-Image AI in 3D Design Workflows
DIS '23: Proceedings of the 2023 ACM Designing Interactive Systems Conference

Text-to-image AI are capable of generating novel images for inspiration, but their applications for 3D design workflows and how designers can build 3D models using AI-provided inspiration have not yet been explored. To investigate this, we integrated ...
Read More
The Creativity of Text-to-Image Generation
Academic Mindtrek '22: Proceedings of the 25th International Academic Mindtrek Conference

Text-guided synthesis of images has made a giant leap towards becoming a mainstream phenomenon. With text-to-image generation systems, anybody can create digital images and artworks. This provokes the question of whether text-to-image generation is ...
Read More
Opal: Multimodal Image Generation for News Illustration
UIST '22: Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology

Advances in multimodal AI have presented people with powerful ways to create images from text. Recent work has shown that text-to-image generations are able to represent a broad range of subjects and artistic styles. However, finding the right visual ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CHI EA '23: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems
April 2023
3914 pages
ISBN:9781450394222
DOI:10.1145/3544549
Editors:
Albrecht Schmidt
LMU Munich, Germany
,
Kaisa Väänänen
Tampere University, Finland
,
Tesh Goyal
Google Research, USA
,
Per Ola Kristensson
University of Cambridge, UK
,
Anicia Peters
University of Namibia, Namibia
Copyright © 2023 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 April 2023
Check for updates
Author Tags
co-creative AI
creativity support tools
large language models
multimodal
prompt engineering
prompting
text-to-image generation
Qualifiers
- extended-abstract
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate6,164of23,696submissions,26%
Upcoming Conference
CHI PLAY '24

Sponsor:

sigchi

The Annual Symposium on Computer-Human Interaction in Play

October 14 - 17, 2024

Tampere , Finland
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 1,361
  Total Downloads
- Downloads (Last 12 months)1,101
- Downloads (Last 6 weeks)281
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Beyond Text-to-Image: Multimodal Prompts to Explore Generative AI

CHI EA '23: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems

Abstract

1 INTRODUCTION

2 RELATED WORK

2.1 Prompting

2.2 Generative Models

2.3 Creativity Support Tools

3 RESEARCH CONTRIBUTIONS

3.1 Opal: Visual Language-Based Exploration

3.2 3DALL-E: Multimodal Prompting

4 RESEARCH VISION

ACKNOWLEDGMENTS

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

3DALL-E: Integrating Text-to-Image AI in 3D Design Workflows

The Creativity of Text-to-Image Generation

Opal: Multimodal Image Generation for News Illustration

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media