skip to main content
10.1145/3686215.3690153acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
short-paper

Levels of Multimodal Interaction

Published: 04 November 2024 Publication History

Abstract

Large Multimodal Models (LMMs) like OpenAI's GPT4o and Google's Gemini, introduced in 2024, process multiple modalities, enabling significant advances in multimodal interaction. Inspired by frameworks for self-driving cars and AGI, this paper proposes "Levels of Multimodal Interaction" to guide research and development. The four levels are: basic multimodality (0), single modalities in turn-taking; combined multimodality (1), fused interpretation of multiple modalities; humanlike (2), natural interaction flow with additional communication signals; and beyond humanlike (3), surpassing human capabilities and include underlying hidden signals with the potential for transformational human-AI integration. LMMs have progressed from Level 0 to 1, with Level 2 next.
Level 3 sets a speculative target that multimodal interaction research could help achieve, where interaction becomes more natural and ultimately surpasses human capabilities. Eventually, such Level 3 multimodal interaction could lead to greater human-AI integration and transform human performance. This anticipated shift, in turn, introduces considerations, particularly around safety, agency and control of AI systems.

References

[1]
Adobe. 2024. Adobe Firefly. https://www.adobe.com/products/firefly.html
[2]
Relja Arandjelovic and Andrew Zisserman. 2017. Look, Listen and Learn. In 2017 IEEE International Conference on Computer Vision (ICCV), 609–617. https://doi.org/10.1109/ICCV.2017.73
[3]
Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2019. Multimodal Machine Learning: A Survey and Taxonomy. IEEE transactions on pattern analysis and machine intelligence 41, 2: 423–443. https://doi.org/10.1109/TPAMI.2018.2798607
[4]
Richard A. Bolt. 1980. “Put-that-there”: Voice and gesture at the graphics interface. In Proceedings of the 7th annual conference on Computer graphics and interactive techniques (SIGGRAPH ’80), 262–270. https://doi.org/10.1145/800250.807503
[5]
Philip R. Cohen, Michael Johnston, David McGee, Sharon Oviatt, Jay Pittman, Ira Smith, Liang Chen, and Josh Clow. 1997. QuickSet: multimodal interaction for distributed applications. In Proceedings of the fifth ACM international conference on Multimedia (MULTIMEDIA ’97), 31–40. https://doi.org/10.1145/266180.266328
[6]
Gemini Team. 2023. Gemini: A Family of Highly Capable Multimodal Models. arXiv [cs.CL]. Retrieved from http://arxiv.org/abs/2312.11805
[7]
Christoph Guger, Nuri Firat Ince, Milena Korostenskaja, and Brendan Z. Allison. 2024. Brain-Computer Interface Research: A State-of-the-Art Summary 11. In Brain-Computer Interface Research: A State-of-the-Art Summary 11, Christoph Guger, Brendan Allison, Tomasz M. Rutkowski and Milena Korostenskaja (eds.). Springer Nature Switzerland, Cham, 1–11. https://doi.org/10.1007/978-3-031-49457-4_1
[8]
Douglas B. Moran, Adam J. Cheyer, Luc E. Julia, David L. Martin, and Sangkyu Park. 1997. Multimodal user interfaces in the Open Agent Architecture. In Proceedings of the 2nd international conference on Intelligent user interfaces, 61–68.
[9]
Thomas Moran. 1997. Multimodal Interfaces: A Special Issue of Human Computer Interaction. CRC Press. Retrieved from https://books.google.com/books/about/Multimodal_Interfaces.html?hl=&id=lQsUPQAACAAJ
[10]
Meredith Ringel Morris and Jed R. Brubaker. 2024. Generative Ghosts: Anticipating Benefits and Risks of AI Afterlives. arXiv [cs.CY]. Retrieved from http://arxiv.org/abs/2402.01662
[11]
Meredith Ringel Morris, Jascha Sohl-dickstein, Noah Fiedel, Tris Warkentin, Allan Dafoe, Aleksandra Faust, Clement Farabet, and Shane Legg. 2023. Levels of AGI: Operationalizing Progress on the Path to AGI. arXiv [cs.AI]. Retrieved from http://arxiv.org/abs/2311.02462
[12]
OpenAI. 2023. GPT-4 Technical Report. arXiv [cs.CL]. Retrieved from http://arxiv.org/abs/2303.08774
[13]
Sharon Oviatt and Philip R. Cohen. 2015. The paradigm shift to multimodality in contemporary computer interfaces. Springer Nature.
[14]
Seymour A. Papert. 1966. The Summer Vision Project. Retrieved May 21, 2024 from https://dspace.mit.edu/handle/1721.1/6125?show=full
[15]
Rosalind W. Picard. 2000. Affective computing. MIT press.
[16]
Lawrence G. Roberts. 1963. Machine perception of three-dimensional solids. Massachusetts Institute of Technology. Retrieved May 21, 2024 from https://dspace.mit.edu/handle/1721.1/11589?show=full
[17]
Ivan E. Sutherland. 1963. Sketchpad: a man-machine graphical communication system. In Proceedings of the May 21-23, 1963, spring joint computer conference (AFIPS ’63 (Spring)), 329–346. https://doi.org/10.1145/1461551.1461591
[18]
Yudai Tanaka, Angela Vujic, Pattie Maes, Robert J. K. Jacob, Olaf Blanke, Sho Nakagome, and Pedro Lopes. Panel: NeuroCHI: Are We Prepared for the Integration of the Brain with Computing? Retrieved from https://www.youtube.com/watch?v=IDGRn0PHEgo
[19]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. arXiv [cs.CL]. Retrieved from http://arxiv.org/abs/1706.03762
[20]
Wikipedia contributors. 2023. Timeline of speech and voice recognition. Wikipedia, The Free Encyclopedia. Retrieved from https://en.wikipedia.org/w/index.php?title=Timeline_of_speech_and_voice_recognition&oldid=1168514805
[21]
2002. International Conference on Multimodal Interaction. ICMI. Retrieved from https://dl.acm.org/conference/icmi-mlmi
[22]
2023. GPT4. OpenAI. https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4
[23]
2024. GPT-4o. Retrieved from https://openai.com/index/hello-gpt-4o/
[24]
2024. Gemini Flash. Google DeepMind. Retrieved from https://deepmind.google/technologies/gemini/flash/
[25]
J3016_202104: Taxonomy and Definitions for Terms Related to Driving Automation Systems for On-Road Motor Vehicles - SAE International. Retrieved from https://www.sae.org/standards/content/j3016_202104/
[26]
Gemini Ultra. Google DeepMind. https://deepmind.google/technologies/gemini/ultra/
[27]
Nass, C., Steuer, J., & Tauber, E. R. (1994, April). Computers are social actors. In Proceedings of the SIGCHI conference on Human factors in computing systems (pp. 72-78).
[28]
Hohenstein, J., Kizilcec, R. F., DiFranzo, D., Aghajari, Z., Mieczkowski, H., Levy, K., Naaman, M., Hancock, J. & Jung, M. F. (2023). Artificial intelligence in communication impacts language and social relationships. Nature - Scientific Reports, 13(1), 5487.
[29]
Chang, M., Druga, S., Fiannaca, A. J., Vergani, P., Kulkarni, C., Cai, C. J., & Terry, M. (2023, June). The prompt artists. In Proceedings of the 15th Conference on Creativity and Cognition (pp. 75-87).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMI Companion '24: Companion Proceedings of the 26th International Conference on Multimodal Interaction
November 2024
252 pages
ISBN:9798400704635
DOI:10.1145/3686215
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 November 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Large Language Models (LLMs)
  2. Large Multimodal Models (LMMs)
  3. Multimodal Interaction

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Conference

ICMI '24
Sponsor:
ICMI '24: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION
November 4 - 8, 2024
San Jose, Costa Rica

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 158
    Total Downloads
  • Downloads (Last 12 months)158
  • Downloads (Last 6 weeks)36
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media