skip to main content
10.1145/3664647.3684993acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
abstract

MIRACLE: An Online, Explainable Multimodal Interactive Concept Learning System

Published: 28 October 2024 Publication History

Abstract

We present MIRACLE, a system for online, interpretable visual concept and video action recognition. Through a chat interface, users query the recognition system with an uploaded image or video. For images, MIRACLE returns concept predictions from its structured knowledge base, justifying its predictions with heatmaps and natural language-based attribute detections. For videos, MIRACLE predicts an action and justifies its prediction with time varying entity-entity relations. With its ability to learn new concepts in an online, few-shot manner and its support of dynamic changes to its knowledge base, MIRACLE represents a step forward in interpretable multimodal learning systems.

Supplemental Material

MP4 File - Presentation Video
Demonstration of MIRACLE: An Online, Explainable Multimodal Interactive Concept Learning System

References

[1]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. [n.,d.]. ImageNet: A Large-Scale Hierarchical Image Database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (2009-06). 248--255. https://doi.org/10.1109/CVPR.2009.5206848
[2]
Chelsea Finn, Pieter Abbeel, and Sergey Levine. [n.,d.]. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In International Conference on Machine Learning (2017). PMLR, 1126--1135.
[3]
Daniel Gatis. [n.,d.]. Danielgatis/Rembg. https://github.com/danielgatis/rembg
[4]
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. [n.,d.]. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022). 16000--16009.
[5]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. 2023. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4015--4026.
[6]
Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. [n.,d.]. Grounded Language-Image Pre-Training. 10965--10975. https://openaccess.thecvf.com/content/CVPR2022/html/Li_Grounded_Language-Image_Pre-Training_CVPR_2022_paper.html?ref=blog.roboflow.com
[7]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. [n.,d.]. Visual Instruction Tuning., Vol. 36 ( [n.,d.]), 34892--34916. https://proceedings.neurips.cc/paper_files/paper/2023/hash/6dcf277ea32ce3288914faf369fe6de0-Abstract-Conference.html
[8]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2024. Visual instruction tuning. Advances in neural information processing systems, Vol. 36 (2024).
[9]
Sachit Menon and Carl Vondrick. 2022. Visual classification via description from large language models. arXiv preprint arXiv:2210.07183 (2022).
[10]
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. [n.,d.]. DINOv2: Learning Robust Visual Features without Supervision. https://doi.org/10.48550/arXiv.2304.07193 arxiv: 2304.07193 [cs]
[11]
Khoi Pham, Kushal Kafle, Zhe Lin, Zhihong Ding, Scott Cohen, Quan Tran, and Abhinav Shrivastava. 2021. Learning To Predict Visual Attributes in the Wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13018--13028.
[12]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. [n.,d.]. Learning Transferable Visual Models from Natural Language Supervision. In International Conference on Machine Learning (2021). PMLR, 8748--8763.
[13]
Daniel Shalam and Simon Korman. [n.,d.]. The Self-Optimal-Transport Feature Transform. ( [n.,d.]). arxiv: 2204.03065 [cs] http://arxiv.org/abs/2204.03065
[14]
Michal Shlapentokh-Rothman, Ansel Blume, Yao Xiao, Yuqun Wu, Sethuraman TV, Heyi Tao, Jae Yong Lee, Wilfredo Torres, Yu-Xiong Wang, and Derek Hoiem. 2024. Region-Based Representations Revisited. arXiv preprint arXiv:2402.02352 (2024).
[15]
Jake Snell, Kevin Swersky, and Richard S. Zemel. [n.,d.]. Prototypical Networks for Few-shot Learning., Vol. 30 ( [n.,d.]). arxiv: 1703.05175 http://arxiv.org/abs/1703.05175
[16]
Zhenhailong Wang, Ansel Blume, Sha Li, Genglin Liu, Jaemin Cho, Zineng Tang, Mohit Bansal, and Heng Ji. 2024. Paxion: Patching action knowledge in video-language foundation models. Advances in Neural Information Processing Systems, Vol. 36 (2024).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
October 2024
11719 pages
ISBN:9798400706868
DOI:10.1145/3664647
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Check for updates

Author Tags

  1. few-shot learning
  2. interpretability
  3. multimodal interaction
  4. object recognition
  5. online learning
  6. video action detection

Qualifiers

  • Abstract

Funding Sources

  • DARPA

Conference

MM '24
Sponsor:
MM '24: The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 473
    Total Downloads
  • Downloads (Last 12 months)473
  • Downloads (Last 6 weeks)225
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media