demonstration

Experiencing Visual Captions: Augmented Communication with Real-time Visuals using Large Language Models

Authors:
Xingyu 'Bruce' Liu

HCI Research, UCLA, United States

HCI Research, UCLA, United States

0000-0002-6988-5471
View Profile

,
Vladimir Kirilyuk

Google, United States

Google, United States

0000-0003-0619-4718
View Profile

,
Xiuxiu Yuan

Google, United States

Google, United States

0000-0002-9341-993X
View Profile

,
Peggy Chi

Google Research, United States

Google Research, United States

0000-0001-8511-2834
View Profile

,
Alex Olwal

Google Inc., United States

Google Inc., United States

0000-0001-7772-0530
View Profile

,
Xiang 'Anthony' Chen

HCI Research, UCLA, United States

HCI Research, UCLA, United States

0000-0002-8527-1744
View Profile

,
Ruofei Du

Google, United States

Google, United States

0000-0003-2471-9776
View Profile

UIST '23 Adjunct: Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and TechnologyOctober 2023Article No.: 85Pages 1–4https://doi.org/10.1145/3586182.3615978

Published:29 October 2023Publication History

UIST '23 Adjunct: Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology

Pages 1–4

ABSTRACT

We demonstrate Visual Captions, a real-time system that integrates with a video conferencing platform to enrich verbal communication. Visual Captions leverages a fine-tuned large language model to proactively suggest visuals that are relevant to the context of the ongoing conversation. We implemented Visual Captions as a user-customizable Chrome plugin with three levels of AI proactivity: Auto-display (AI autonomously adds visuals), Auto-suggest (AI proactively recommends visuals), and On-demand-suggest (AI suggests visuals when prompted). We showcase the usage of Visual Captions in open-vocabulary settings, and how the addition of visuals based on the context of conversations could improve comprehension of complex or unfamiliar concepts. In addition, we demonstrate three approaches people can interact with the system with different levels of AI proactivity. Visual Captions is open-sourced at https://github.com/google/archat.

Supplemental Material

Available for Download

zip

Supplemental File (3.8 MB)

References

Xingyu "Bruce" Liu, Vladimir Kirilyuk, Xiuxiu Yuan, Alex Olwal, Peggy Chi, Xiang "Anthony" Chen, and Ruofei Du. 2023. Visual Captions: Augmenting Verbal Communication with On-the-Fly Visuals. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 108, 20 pages. https://doi.org/10.1145/3544548.3581566Google ScholarDigital Library

Index Terms

Experiencing Visual Captions: Augmented Communication with Real-time Visuals using Large Language Models
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction paradigms
      1. Mixed / augmented reality
    2. Interactive systems and tools

Recommendations

Visual Captions: Augmenting Verbal Communication with On-the-fly Visuals
CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems

Video conferencing solutions like Zoom, Google Meet, and Microsoft Teams are becoming increasingly popular for facilitating conversations, and recent advancements such as live captioning help people better understand each other. We believe that the ...
Read More
Saliency in Augmented Reality
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

With the rapid development of multimedia technology, Augmented Reality (AR) has become a promising next-generation mobile platform. The primary theory underlying AR is human visual confusion, which allows users to perceive the real-world scenes and ...
Read More
Subtle cueing for visual search in augmented reality
ISMAR '12: Proceedings of the 2012 IEEE International Symposium on Mixed and Augmented Reality (ISMAR)

Visual search in augmented reality environments is an important task that can be facilitated through different cueing methods. Current cueing methods rely on explicit cueing, which can potentially reduce visual search performance. In comparison, this ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
UIST '23 Adjunct: Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology
October 2023
424 pages
ISBN:9798400700965
DOI:10.1145/3586182
Editors:
Sean Follmer
Stanford University, USA
,
Jeff Han,
Jürgen Steimle
Saarland University, Germany
,
Nathalie Henry Riche
Microsoft Research, USA
Copyright © 2023 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 October 2023
Check for updates
Author Tags
AI agent
augmented communication
augmented reality
collaborative work
dataset
large language models
online meeting
text-to-visual
video-mediated communication
Qualifiers
- demonstration
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate842of3,967submissions,21%
Upcoming Conference
UIST '24

Sponsor:

sigchi

sigchi

UIST '24: The 37th Annual ACM Symposium on User Interface Software and Technology

October 13 - 16, 2024

Pittsburgh , PA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 195
  Total Downloads
- Downloads (Last 12 months)195
- Downloads (Last 6 weeks)35
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Experiencing Visual Captions: Augmented Communication with Real-time Visuals using Large Language Models

UIST '23 Adjunct: Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Visual Captions: Augmenting Verbal Communication with On-the-fly Visuals

Saliency in Augmented Reality

Subtle cueing for visual search in augmented reality

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media