abstract

MIRACLE: An Online, Explainable Multimodal Interactive Concept Learning System

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 11252 - 11254

https://doi.org/10.1145/3664647.3684993

Published: 28 October 2024 Publication History

Abstract

We present MIRACLE, a system for online, interpretable visual concept and video action recognition. Through a chat interface, users query the recognition system with an uploaded image or video. For images, MIRACLE returns concept predictions from its structured knowledge base, justifying its predictions with heatmaps and natural language-based attribute detections. For videos, MIRACLE predicts an action and justifies its prediction with time varying entity-entity relations. With its ability to learn new concepts in an online, few-shot manner and its support of dynamic changes to its knowledge base, MIRACLE represents a step forward in interpretable multimodal learning systems.

Supplemental Material

MP4 File - Presentation Video

Demonstration of MIRACLE: An Online, Explainable Multimodal Interactive Concept Learning System

Download
28.55 MB

References

[1]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. [n.,d.]. ImageNet: A Large-Scale Hierarchical Image Database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (2009-06). 248--255. https://doi.org/10.1109/CVPR.2009.5206848

[2]

Chelsea Finn, Pieter Abbeel, and Sergey Levine. [n.,d.]. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In International Conference on Machine Learning (2017). PMLR, 1126--1135.

[3]

Daniel Gatis. [n.,d.]. Danielgatis/Rembg. https://github.com/danielgatis/rembg

[4]

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. [n.,d.]. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022). 16000--16009.

[5]

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. 2023. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4015--4026.

[6]

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. [n.,d.]. Grounded Language-Image Pre-Training. 10965--10975. https://openaccess.thecvf.com/content/CVPR2022/html/Li_Grounded_Language-Image_Pre-Training_CVPR_2022_paper.html?ref=blog.roboflow.com

[7]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. [n.,d.]. Visual Instruction Tuning., Vol. 36 ( [n.,d.]), 34892--34916. https://proceedings.neurips.cc/paper_files/paper/2023/hash/6dcf277ea32ce3288914faf369fe6de0-Abstract-Conference.html

[8]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2024. Visual instruction tuning. Advances in neural information processing systems, Vol. 36 (2024).

[9]

Sachit Menon and Carl Vondrick. 2022. Visual classification via description from large language models. arXiv preprint arXiv:2210.07183 (2022).

[10]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. [n.,d.]. DINOv2: Learning Robust Visual Features without Supervision. https://doi.org/10.48550/arXiv.2304.07193 arxiv: 2304.07193 [cs]

[11]

Khoi Pham, Kushal Kafle, Zhe Lin, Zhihong Ding, Scott Cohen, Quan Tran, and Abhinav Shrivastava. 2021. Learning To Predict Visual Attributes in the Wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13018--13028.

[12]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. [n.,d.]. Learning Transferable Visual Models from Natural Language Supervision. In International Conference on Machine Learning (2021). PMLR, 8748--8763.

[13]

Daniel Shalam and Simon Korman. [n.,d.]. The Self-Optimal-Transport Feature Transform. ( [n.,d.]). arxiv: 2204.03065 [cs] http://arxiv.org/abs/2204.03065

[14]

Michal Shlapentokh-Rothman, Ansel Blume, Yao Xiao, Yuqun Wu, Sethuraman TV, Heyi Tao, Jae Yong Lee, Wilfredo Torres, Yu-Xiong Wang, and Derek Hoiem. 2024. Region-Based Representations Revisited. arXiv preprint arXiv:2402.02352 (2024).

[15]

Jake Snell, Kevin Swersky, and Richard S. Zemel. [n.,d.]. Prototypical Networks for Few-shot Learning., Vol. 30 ( [n.,d.]). arxiv: 1703.05175 http://arxiv.org/abs/1703.05175

[16]

Zhenhailong Wang, Ansel Blume, Sha Li, Genglin Liu, Jaemin Cho, Zineng Tang, Mohit Bansal, and Heng Ji. 2024. Paxion: Patching action knowledge in video-language foundation models. Advances in Neural Information Processing Systems, Vol. 36 (2024).

Index Terms

MIRACLE: An Online, Explainable Multimodal Interactive Concept Learning System
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding
        Video summarization
        Visual inspection
    2. Natural language processing
      1. Discourse, dialogue and pragmatics

Recommendations

Prediction of Various Backchannel Utterances Based on Multimodal Information
IVA '23: Proceedings of the 23rd ACM International Conference on Intelligent Virtual Agents

The listener's backchannels are an important part of dialogues. With appropriate backchannels, people are able to smoothly promote dialogues. Thus, backchannels are considered to be important in dialogues between not only humans but also humans and ...
Extending chatterbot system into multimodal interaction framework with embodied contextual understanding
HRI '12: Proceedings of the seventh annual ACM/IEEE international conference on Human-Robot Interaction

This work aims to realize multimodal interaction with embodied contextual understanding based on the simple chatterbot system. A system framework is proposed to integrate the dialogue system into a 3D simulation platform, SIGVerse to attain multimodal ...
Exploring the affordances of online chat for learning

Text-based interaction is one of the frequently used modes of computer mediated communication connecting users via computer networks. Text chat has been and is still an important technology component of online distributed learning environments where ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Check for updates

Author Tags

Qualifiers

Abstract

Funding Sources

DARPA

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
473
Total Downloads

Downloads (Last 12 months)473
Downloads (Last 6 weeks)225

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten