ABSTRACT
We demonstrate Visual Captions, a real-time system that integrates with a video conferencing platform to enrich verbal communication. Visual Captions leverages a fine-tuned large language model to proactively suggest visuals that are relevant to the context of the ongoing conversation. We implemented Visual Captions as a user-customizable Chrome plugin with three levels of AI proactivity: Auto-display (AI autonomously adds visuals), Auto-suggest (AI proactively recommends visuals), and On-demand-suggest (AI suggests visuals when prompted). We showcase the usage of Visual Captions in open-vocabulary settings, and how the addition of visuals based on the context of conversations could improve comprehension of complex or unfamiliar concepts. In addition, we demonstrate three approaches people can interact with the system with different levels of AI proactivity. Visual Captions is open-sourced at https://github.com/google/archat.
Supplemental Material
Available for Download
- Xingyu "Bruce" Liu, Vladimir Kirilyuk, Xiuxiu Yuan, Alex Olwal, Peggy Chi, Xiang "Anthony" Chen, and Ruofei Du. 2023. Visual Captions: Augmenting Verbal Communication with On-the-Fly Visuals. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 108, 20 pages. https://doi.org/10.1145/3544548.3581566Google ScholarDigital Library
Index Terms
- Experiencing Visual Captions: Augmented Communication with Real-time Visuals using Large Language Models
Recommendations
Visual Captions: Augmenting Verbal Communication with On-the-fly Visuals
CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing SystemsVideo conferencing solutions like Zoom, Google Meet, and Microsoft Teams are becoming increasingly popular for facilitating conversations, and recent advancements such as live captioning help people better understand each other. We believe that the ...
Saliency in Augmented Reality
MM '22: Proceedings of the 30th ACM International Conference on MultimediaWith the rapid development of multimedia technology, Augmented Reality (AR) has become a promising next-generation mobile platform. The primary theory underlying AR is human visual confusion, which allows users to perceive the real-world scenes and ...
Subtle cueing for visual search in augmented reality
ISMAR '12: Proceedings of the 2012 IEEE International Symposium on Mixed and Augmented Reality (ISMAR)Visual search in augmented reality environments is an important task that can be facilitated through different cueing methods. Current cueing methods rely on explicit cueing, which can potentially reduce visual search performance. In comparison, this ...
Comments