research-article

Generating Explanations for Embodied Action Decision from Visual Observation

Authors:
Xiaohan Wang

Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, Xi'an, China

Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, Xi'an, China

0000-0003-2396-0824
View Profile

,
Yuehu Liu

Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, Xi'an, China

Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, Xi'an, China

0000-0002-1048-5115
View Profile

,
Xinhang Song

Institute of Computing Technology, Chinese Academy of Sciences & University of Chinese Academy of Sciences, Beijing, China

Institute of Computing Technology, Chinese Academy of Sciences & University of Chinese Academy of Sciences, Beijing, China

0000-0002-0895-1076
View Profile

,
Beibei Wang

Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, Xi'an, China

Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, Xi'an, China

0009-0008-5537-5869
View Profile

,
Shuqiang Jiang

Institute of Computing Technology, Chinese Academy of Sciences & University of Chinese Academy of Sciences, Beijing, China

Institute of Computing Technology, Chinese Academy of Sciences & University of Chinese Academy of Sciences, Beijing, China

0000-0002-1596-4326
View Profile

MM '23: Proceedings of the 31st ACM International Conference on MultimediaOctober 2023Pages 2838–2846https://doi.org/10.1145/3581783.3612351

Published:27 October 2023Publication History

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 2838–2846

ABSTRACT

Getting trust is crucial for embodied agents (such as robots and autonomous vehicles) to collaborate with human beings, especially non-experts. The most direct way for mutual understanding is through natural language explanation. Existing researches consider generating visual explanations for object recognition, while the exploration of explaining embodied decisions remains vacant. In this paper, we study generating action decisions and explanations based on visual observation. Distinct to explanations for recognition, justifying an action needs to show why it's better than other actions. Besides, the understanding of scene structure is required since the agent needs to interact with the environment (e.g. navigation, moving objects). We introduce a new dataset THOR-EAE (Embodied Action Explanation) collected based on AI2-THOR simulator. The dataset consists of over 840,000 egocentric images of indoor embodied observation which are annotated with the optimal action labels and explanation sentences. An explainable decision-making criterion is developed considering scene layout and action attributes for efficient annotation. We propose a graph action justification model, exploiting graph neural networks for obstacle-surroundings relations representation and justifying the actions under the guidance of decision results. Experimental results on THOR-EAE dataset showcase its challenge and the effectiveness of the proposed method.

References

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. 2018. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3674--3683.Google ScholarCross Ref
Shane Barratt. 2017. Interpnet: Neural introspection for interpretable deep learning. arXiv preprint arXiv:1710.09511 (2017).Google Scholar
Or Biran and Kathleen McKeown. 2014. Justification narratives for individual classifications. In Proceedings of the AutoML workshop at ICML, Vol. 2014. 1--7.Google Scholar
Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. 2017. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158 (2017).Google Scholar
Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10578--10587.Google ScholarCross Ref
Matt Deitke, Winson Han, Alvaro Herrasti, Aniruddha Kembhavi, Eric Kolve, Roozbeh Mottaghi, Jordi Salvador, Dustin Schwenk, Eli VanderBilt, Matthew Wallingford, Luca Weihs, Mark Yatskar, and Ali Farhadi. 2020. RoboTHOR: An Open Simulation-to-Real Embodied AI Platform. In CVPR.Google Scholar
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2625--2634.Google ScholarCross Ref
Longteng Guo, Jing Liu, Jinhui Tang, Jiangwei Li, Wei Luo, and Hanqing Lu. 2019. Aligning linguistic words and visual semantic units for image captioning. In Proceedings of the 27th ACM international conference on multimedia. 765--773.Google ScholarDigital Library
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In 2017 IEEE International Conference on Computer Vision (ICCV). 2980--2988. https://doi.org/10.1109/ICCV.2017.322Google ScholarCross Ref
Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, and Trevor Darrell. 2016. Generating visual explanations. In European conference on computer vision. Springer, 3--19.Google ScholarCross Ref
Lisa Anne Hendricks, Ronghang Hu, Trevor Darrell, and Zeynep Akata. 2018. Grounding visual explanations. In Proceedings of the European Conference on Computer Vision (ECCV). 264--279.Google ScholarDigital Library
Luis Herranz, Shuqiang Jiang, and Xiangyang Li. 2016. Scene recognition with cnns: objects, scales and dataset bias. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 571--579.Google ScholarCross Ref
Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, Vol. 47 (2013), 853--899.Google ScholarDigital Library
Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).Google Scholar
Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. 2017. AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv (2017).Google Scholar
Juncheng Li, Siliang Tang, Fei Wu, and Yueting Zhuang. 2019. Walking with MIND: Mental Imagery eNhanceD Embodied QA. In Proceedings of the 27th ACM International Conference on Multimedia, MM 2019, Nice, France, October 21-25, 2019, Laurent Amsaleg, Benoit Huet, Martha A. Larson, Guillaume Gravier, Hayley Hung, Chong-Wah Ngo, and Wei Tsang Ooi (Eds.). ACM, 1211--1219. https://doi.org/10.1145/3343031.3351017Google ScholarDigital Library
Weijie Li, Xinhang Song, Yubing Bai, Sixian Zhang, and Shuqiang Jiang. 2021. Ion: Instance-level object navigation. In Proceedings of the 29th ACM International Conference on Multimedia. 4343--4352.Google ScholarDigital Library
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740--755.Google ScholarCross Ref
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012--10022.Google ScholarCross Ref
Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing. IEEE, 722--729.Google ScholarDigital Library
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311--318.Google ScholarDigital Library
Scott Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele. 2016. Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 49--58.Google ScholarCross Ref
Manolis Savva, Angel X. Chang, Alexey Dosovitskiy, Thomas Funkhouser, and Vladlen Koltun. 2017. MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments. arXiv:1712.03931 (2017).Google Scholar
Bokui Shen, Fei Xia, Chengshu Li, Roberto Martín-Martín, Linxi Fan, Guanzhi Wang, Shyamal Buch, Claudia D'Arpino, Sanjana Srivastava, Lyne P Tchapmi, et al. 2020. iGibson, a Simulation Environment for Interactive Tasks in Large Realistic Scenes. arXiv preprint arXiv:2012.02924 (2020).Google Scholar
Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. 2020. Textcaps: a dataset for image captioning with reading comprehension. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part II 16. Springer, 742--758.Google ScholarDigital Library
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4566--4575.Google ScholarCross Ref
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3156--3164.Google ScholarCross Ref
Cheng Wang, Haojin Yang, Christian Bartz, and Christoph Meinel. 2016. Image captioning with deep bidirectional LSTMs. In Proceedings of the 24th ACM international conference on Multimedia. 988--997.Google ScholarDigital Library
Peter Welinder, Steve Branson, Takeshi Mita, Catherine Wah, Florian Schroff, Serge Belongie, and Pietro Perona. 2010. Caltech-UCSD birds 200. (2010).Google Scholar
Fei Xia, William B Shen, Chengshu Li, Priya Kasimbeg, Micael Edmond Tchapmi, Alexander Toshev, Roberto Mart'in-Mart'in, and Silvio Savarese. 2020. Interactive gibson benchmark: A benchmark for interactive navigation in cluttered environments. IEEE Robotics and Automation Letters, Vol. 5, 2 (2020), 713--720.Google ScholarCross Ref
Fei Xia, Amir R Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. 2018. Gibson env: Real-world perception for embodied agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9068--9079.Google ScholarCross Ref
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, Vol. 2 (2014), 67--78.Google ScholarCross Ref
Kuo-Hao Zeng, Luca Weihs, Ali Farhadi, and Roozbeh Mottaghi. 2021. Pushing it out of the Way: Interactive Visual Navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9868--9877.Google ScholarCross Ref
Sixian Zhang, Weijie Li, Xinhang Song, Yubing Bai, and Shuqiang Jiang. [n.,d.]. Generative Meta-Adversarial Network for Unseen Object Navigation. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXIX (Lecture Notes in Computer Science, Vol. 13699). 301--320.Google Scholar
Sixian Zhang, Xinhang Song, Yubing Bai, Weijie Li, Yakui Chu, and Shuqiang Jiang. 2021. Hierarchical Object-to-Zone Graph for Object Navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15130--15140.Google ScholarCross Ref
Sixian Zhang, Xinhang Song, Weijie Li, Yubing Bai, Xinyao Yu, and Shuqiang Jiang. 2023. Layout-Based Causal Inference for Object Navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10792--10802.Google ScholarCross Ref
Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence, Vol. 40, 6 (2017), 1452--1464.Google Scholar

Index Terms

Generating Explanations for Embodied Action Decision from Visual Observation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding
        Vision for robotics
    2. Natural language processing
      1. Natural language generation

Recommendations

Generating visual explanations with natural language
Abstract
We generate natural language explanations for a fine‐grained visual recognition task. Our explanations fulfill two criteria. First, explanations are class discriminative, meaning they mention attributes in an image which are important to ...
Read More
Do Explanations Improve the Quality of AI-assisted Human Decisions? An Algorithm-in-the-Loop Analysis of Factual & Counterfactual Explanations
AAMAS '23: Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems

The increased use of AI algorithmic aids in high-stakes decision making has prompted interest in explainable AI (xAI), and the role of counterfactual explanations to increase trust in human-algorithm collaborations and to mitigate unfair outcomes. ...
Read More
Minimalistic Explanations: Capturing the Essence of Decisions
CHI EA '19: Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems

The use of complex machine learning models can make systems opaque to users. Machine learning research proposes the use of post-hoc explanations. However, it is unclear if they give users insights into otherwise uninterpretable models. One minimalistic ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 October 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
action decision
embodied vision
explanations
graph representation
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 105
  Total Downloads
- Downloads (Last 12 months)105
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Generating Explanations for Embodied Action Decision from Visual Observation

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Generating visual explanations with natural language

Do Explanations Improve the Quality of AI-assisted Human Decisions? An Algorithm-in-the-Loop Analysis of Factual & Counterfactual Explanations

Minimalistic Explanations: Capturing the Essence of Decisions