research-article

Visual Dialog for Spotting the Differences between Pairs of Similar Images

Authors:
Duo Zheng

Beijing University of Posts and Telecommunications, Beijing, China

Beijing University of Posts and Telecommunications, Beijing, China
View Profile

,
Fandong Meng

WeChat AI, Tencent Inc, Beijing, China

WeChat AI, Tencent Inc, Beijing, China
View Profile

,
Qingyi Si

Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China

Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
View Profile

,
Hairun Fan

Beijing University of Posts and Telecommunications, Beijing, China

Beijing University of Posts and Telecommunications, Beijing, China
View Profile

,
Zipeng Xu

University of Trento, Trento, Italy

University of Trento, Trento, Italy
View Profile

,
Jie Zhou

WeChat AI, Tencent Inc, Beijing, China

WeChat AI, Tencent Inc, Beijing, China
View Profile

,
Fangxiang Feng

Beijing University of Posts and Telecommunications, Beijing, China

Beijing University of Posts and Telecommunications, Beijing, China
View Profile

,
Xiaojie Wang

Beijing University of Posts and Telecommunications, Beijing, China

Beijing University of Posts and Telecommunications, Beijing, China
View Profile

MM '22: Proceedings of the 30th ACM International Conference on MultimediaOctober 2022Pages 5698–5709https://doi.org/10.1145/3503161.3548170

Published:10 October 2022Publication History

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 5698–5709

ABSTRACT

Visual dialog has witnessed great progress after introducing various vision-oriented goals into the conversation. Much of previous work focuses on tasks where only one image can be accessed by two interlocutors, such as VisDial and GuessWhat. The work on situations where two interlocutors access different images has received less attention. Those situations are common in real world and bring some different challenges compared with one-image tasks. The lack of such types of dialog tasks and corresponding large-scale datasets makes it impossible to carry out in-depth research. This paper therefore first proposes a new visual dialog task named Dial-the-Diff, where two interlocutors accessing two similar images respectively try to spot the difference between the images through conversing in natural language. The task raises new challenges to the dialog strategy and the ability of categorizing objects. We then build a large-scale multi-modal dataset for the task, named DialDiff, which contains 87k Virtual Reality images and 78k dialogs. Some details of the data are given and analyzed to highlight the challenges behind the task. Finally, we propose benchmark models for this task, and conduct extensive experiments to evaluate their performance as well as its problems remained.

Supplemental Material

Available for Download

mp4

MM22-fp1738.mp4 (78.3 MB)

References

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18--22, 2018. IEEE Computer Society, 6077--6086. https://doi.org/10.1109/CVPR.2018.00636Google Scholar
Jeffrey P. Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller, Rob Miller, Rob Miller, Aubrey Tatarowicz, Brandyn Allen White, Samuel White, and Tom Yeh. 2010. VizWiz: nearly real-time answers to visual questions. Proceedings of the 23nd annual ACM symposium on User interface software and technology (2010).Google ScholarDigital Library
Feilong Chen, Fandong Meng, Jiaming Xu, Peng Li, Bo Xu, and Jie Zhou. 2020. DMRM: A Dual-Channel Multi-Hop Reasoning Model for Visual Dialog. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7--12, 2020. AAAI Press, 7504--7511. https://aaai.org/ojs/index.php/AAAI/article/view/6248Google Scholar
Michael Cogswell, Jiasen Lu, Rishabh Jain, Stefan Lee, Devi Parikh, and Dhruv Batra. 2020. Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6--12, 2020, virtual, Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/e7023ba77a45f7e84c5ee8a28dd63585-Abstract.htmlGoogle Scholar
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, and Dhruv Batra. 2017a. Visual Dialog. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. IEEE Computer Society, 1080--1089. https://doi.org/10.1109/CVPR.2017.121Google Scholar
Abhishek Das, Satwik Kottur, José M. F. Moura, Stefan Lee, and Dhruv Batra. 2017b. Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22--29, 2017. IEEE Computer Society, 2970--2979. https://doi.org/10.1109/ICCV.2017.321Google Scholar
Harm de Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron C. Courville. 2017. GuessWhat?! Visual Object Discovery through Multi-modal Dialogue. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. IEEE Computer Society, 4466--4475. https://doi.org/10.1109/CVPR.2017.475Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186. https://doi.org/10.18653/v1/N19--1423Google Scholar
Daniel Fried, Justin T. Chiu, and Dan Klein. 2021. Reference-Centric Models for Grounded Collaborative Dialogue. In Empirical Methods in Natural Language Processing.Google Scholar
Carolina Galleguillos, Andrew Rabinovich, and Serge J. Belongie. 2008. Object categorization using co-occurrence, location and appearance. In 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008), 24--26 June 2008, Anchorage, Alaska, USA. IEEE Computer Society. https://doi.org/10.1109/CVPR.2008.4587799Google Scholar
Zhe Gan, Yu Cheng, Ahmed Kholy, Linjie Li, Jingjing Liu, and Jianfeng Gao. 2019. Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 6463--6474. https://doi.org/10.18653/v1/P19--1648Google ScholarCross Ref
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. IEEE Computer Society, 6325--6334. https://doi.org/10.1109/CVPR.2017.670Google Scholar
Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogé rio Schmidt Feris. 2018. Dialog-based Interactive Image Retrieval. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3--8, 2018, Montré al, Canada, Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (Eds.). 676--686. https://proceedings.neurips.cc/paper/2018/hash/a01a0380ca3c61428c26a231f0e49a09-Abstract.htmlGoogle Scholar
Janosch Haber, Tim Baumg"artner, Ece Takmaz, Lieke Gelderloos, Elia Bruni, and Raquel Fernández. 2019. The PhotoBook Dataset: Building Common Ground through Visually-Grounded Dialogue. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 1895--1910. https://doi.org/10.18653/v1/P19--1184Google ScholarCross Ref
He He, Anusha Balakrishnan, Mihail Eric, and Percy Liang. 2017. Learning Symmetric Collaborative Dialogue Agents with Dynamic Knowledge Graph Embeddings. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 1766--1776. https://doi.org/10.18653/v1/P17--1162Google ScholarCross Ref
Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16--20, 2019. Computer Vision Foundation / IEEE, 6700--6709. https://doi.org/10.1109/CVPR.2019.00686Google Scholar
Nikolai Ilinykh, Sina Zarrieß, and David Schlangen. 2019. MeetUp! A Corpus of Joint Activity Dialogues in a Visual Environment. arxiv: 1907.05084 [cs.CL]Google Scholar
Harsh Jhamtani and Taylor Berg-Kirkpatrick. 2018. Learning to Describe Differences Between Pairs of Similar Images. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 4024--4034. https://doi.org/10.18653/v1/D18--1436Google ScholarCross Ref
Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross B. Girshick. 2017. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. IEEE Computer Society, 1988--1997. https://doi.org/10.1109/CVPR.2017.215Google Scholar
Satwik Kottur, Seungwhan Moon, Alborz Geramifard, and Babak Damavandi. 2021. SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations. arxiv: 2104.08667 [cs.CL]Google Scholar
Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra, and Marcus Rohrbach. 2019. CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 582--595. https://doi.org/10.18653/v1/N19--1058Google Scholar
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Fei-Fei Li. 2016. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. arxiv: 1602.07332 [cs.CV]Google Scholar
Hung Le, Chinnadhurai Sankar, Seungwhan Moon, Ahmad Beirami, Alborz Geramifard, and Satwik Kottur. 2021. DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 5651--5665. https://doi.org/10.18653/v1/2021.acl-long.439Google Scholar
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020b. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. ArXiv preprint, Vol. abs/2004.06165 (2020). https://arxiv.org/abs/2004.06165Google Scholar
Zhuowan Li, Quan Tran, Long Mai, Zhe Lin, and Alan L. Yuille. 2020a. Context-Aware Group Captioning via Self-Attention and Contrastive Features. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13--19, 2020. IEEE, 3437--3447. https://doi.org/10.1109/CVPR42600.2020.00350Google Scholar
Zujie Liang, Huang Hu, Can Xu, Chongyang Tao, Xiubo Geng, Yining Chen, Fan Liang, and Daxin Jiang. 2021. Maria: A Visual Experience Powered Conversational Agent. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 5596--5611. https://doi.org/10.18653/v1/2021.acl-long.435Google Scholar
Runtao Liu, Chenxi Liu, Yutong Bai, and Alan L. Yuille. 2019. CLEVR-Ref: Diagnosing Visual Reasoning With Referring Expressions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16--20, 2019. Computer Vision Foundation / IEEE, 4185--4194. https://doi.org/10.1109/CVPR.2019.00431Google Scholar
José Lopes, Nils Hemmingsson, and Oliver Åstrand. 2018. The Spot the Difference corpus: a multi-modal corpus of spontaneous task oriented spoken interactions. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan. https://aclanthology.org/L18--1305Google Scholar
Seungwhan Moon, Satwik Kottur, Paul Crook, Ankita De, Shivani Poddar, Theodore Levin, David Whitney, Daniel Difranco, Ahmad Beirami, Eunjoon Cho, Rajen Subba, and Alborz Geramifard. 2020a. Situated and Interactive Multimodal Conversations. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 1103--1121. https://doi.org/10.18653/v1/2020.coling-main.96Google ScholarCross Ref
Seungwhan Moon, Satwik Kottur, Paul Crook, Ankita De, Shivani Poddar, Theodore Levin, David Whitney, Daniel Difranco, Ahmad Beirami, Eunjoon Cho, Rajen Subba, and Alborz Geramifard. 2020b. Situated and Interactive Multimodal Conversations. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 1103--1121. https://doi.org/10.18653/v1/2020.coling-main.96Google ScholarCross Ref
Yulei Niu, Hanwang Zhang, Manli Zhang, Jianhong Zhang, Zhiwu Lu, and Ji-Rong Wen. 2019. Recursive Visual Attention in Visual Dialog. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16--20, 2019. Computer Vision Foundation / IEEE, 6679--6688. https://doi.org/10.1109/CVPR.2019.00684Google Scholar
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019).Google Scholar
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7--12, 2015, Montreal, Quebec, Canada, Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett (Eds.). 91--99. https://proceedings.neurips.cc/paper/2015/hash/14bfa6bb14875e45bba028a21ed38046-Abstract.htmlGoogle ScholarDigital Library
Shailaja Keyur Sampat, Akshay Kumar, Yezhou Yang, and Chitta Baral. 2021. CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 3692--3709. https://doi.org/10.18653/v1/2021.naacl-main.289Google Scholar
Ravi Shekhar, Aashish Venkatesh, Tim Baumg"artner, Elia Bruni, Barbara Plank, Raffaella Bernardi, and Raquel Fernández. 2019. Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 2578--2587. https://doi.org/10.18653/v1/N19--1265Google ScholarCross Ref
Pushkar Shukla, Carlos Elmadjian, Richika Sharan, Vivek Kulkarni, Matthew Turk, and William Yang Wang. 2019. What Should I Ask? Using Conversationally Informative Rewards for Goal-oriented Visual Dialog.. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 6442--6451. https://doi.org/10.18653/v1/P19--1646Google ScholarCross Ref
Florian Strub, Harm de Vries, Jé ré mie Mary, Bilal Piot, Aaron C. Courville, and Olivier Pietquin. 2017a. End-to-end optimization of goal-driven and visually grounded dialogue systems. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19--25, 2017, Carles Sierra (Ed.). ijcai.org, 2765--2771. https://doi.org/10.24963/ijcai.2017/385Google Scholar
Florian Strub, Harm de Vries, Jé ré mie Mary, Bilal Piot, Aaron C. Courville, and Olivier Pietquin. 2017b. End-to-end optimization of goal-driven and visually grounded dialogue systems. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19--25, 2017, Carles Sierra (Ed.). ijcai.org, 2765--2771. https://doi.org/10.24963/ijcai.2017/385Google Scholar
Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. 2019. A Corpus for Reasoning about Natural Language Grounded in Photographs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 6418--6428. https://doi.org/10.18653/v1/P19--1644Google ScholarCross Ref
Ece Takmaz, Mario Giulianelli, Sandro Pezzelle, Arabella Sinclair, and Raquel Fernández. 2020. Refer, Reuse, Reduce: Generating Subsequent References in Visual and Conversational Contexts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 4350--4368. https://doi.org/10.18653/v1/2020.emnlp-main.353Google ScholarCross Ref
Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 5100--5111. https://doi.org/10.18653/v1/D19--1514Google Scholar
Hao Tan, Franck Dernoncourt, Zhe Lin, Trung Bui, and Mohit Bansal. 2019. Expressing Visual Relationships via Language. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 1873--1883. https://doi.org/10.18653/v1/P19--1182Google ScholarCross Ref
Takuma Udagawa and Akiko Aizawa. 2019. A Natural Language Corpus of Common Grounding under Continuous and Partially-Observable Context. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press, 7120--7127. https://doi.org/10.1609/aaai.v33i01.33017120Google ScholarDigital Library
Takuma Udagawa and Akiko Aizawa. 2020. An Annotated Corpus of Reference Resolution for Interpreting Common Grounding. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7--12, 2020. AAAI Press, 9081--9089. https://aaai.org/ojs/index.php/AAAI/article/view/6442Google ScholarCross Ref
Unity Technologies. 2019. Unity. https://unity.com/.Google Scholar
Unity Technologies. 2020. Unity Perception Package. https://github.com/Unity-Technologies/com.unity.perception.Google Scholar
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7--12, 2015. IEEE Computer Society, 3156--3164. https://doi.org/10.1109/CVPR.2015.7298935Google ScholarCross Ref
Yue Wang, Shafiq Joty, Michael Lyu, Irwin King, Caiming Xiong, and Steven C.H. Hoi. 2020. VD-BERT: A Unified Vision and Dialog Transformer with BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 3325--3338. https://doi.org/10.18653/v1/2020.emnlp-main.269Google Scholar
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arxiv: 1609.08144 [cs.CL]Google Scholar
Zipeng Xu, Fangxiang Feng, Xiaojie Wang, Yushu Yang, Huixing Jiang, and Zhongyuan Wang. 2020. Answer-Driven Visual State Estimator for Goal-Oriented Visual Dialogue. In Proceedings of the 28th ACM International Conference on Multimedia. 4271--4279.Google ScholarDigital Library
Zipeng Xu, Fandong Meng, Xiaojie Wang, Duo Zheng, Chenxu Lv, and Jie Zhou. 2021. modeling Explicit Concerning States for Reinforcement Learning in Visual Dialogue. In Proceedings of the 32nd British Machine Vision Conference (BMVC).Google Scholar
Zhou Yu, Jing Li, Tongan Luo, and Jun Yu. 2020. A PyTorch Implementation of Bottom-Up-Attention. https://github.com/MILVLG/bottom-up-attention.pytorch.Google Scholar
Junjie Zhang, Qi Wu, Chunhua Shen, Jian Zhang, Jianfeng Lu, and Anton van den Hengel. 2017. Asking the Difficult Questions: Goal-Oriented Visual Question Generation via Intermediate Rewards. ArXiv preprint, Vol. abs/1711.07614 (2017). https://arxiv.org/abs/1711.07614Google Scholar
Rui Zhao and Volker Tresp. 2018. Learning Goal-Oriented Visual Dialog via Tempered Policy Gradient. In 2018 IEEE Spoken Language Technology Workshop (SLT). 868--875. https://doi.org/10.1109/SLT.2018.8639546Google ScholarCross Ref
Duo Zheng, Zipeng Xu, Fandong Meng, Xiaojie Wang, Jiaan Wang, and Jie Zhou. 2021. Enhancing Visual Dialog Questioner with Entity-based Strategy Learning and Augmented Guesser. In Empirical Methods in Natural Language Processing.Google Scholar

Index Terms

Visual Dialog for Spotting the Differences between Pairs of Similar Images
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Discourse, dialogue and pragmatics
      2. Natural language generation

Recommendations

Unsupervised and Pseudo-Supervised Vision-Language Alignment in Visual Dialog
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Visual dialog requires models to give reasonable answers according to a series of coherent questions and related visual concepts in images. However, most current work either focuses on attention-based fusion or pre-training on large-scale image-text ...
Read More
Multimodal Fusion of Visual Dialog: A Survey
RICAI '20: Proceedings of the 2020 2nd International Conference on Robotics, Intelligent Control and Artificial Intelligence

Visual Dialog: aiming at holding a meaningful conversation with humans based on natural images, is a 'high-level' AI task of multimodal fusion. Since the challenge for visual dialog was proposed in 2017, multimodal fusion has been developed and made ...
Read More
Visual Dialog with Multi-turn Attentional Memory Network
Advances in Multimedia Information Processing – PCM 2018
Abstract
Visual dialog is a task of answering a question given an input image, a historical dialog about the image and often requires to retrieve visual and textual facts about the question. This problem is different from visual question answering (VQA), ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
October 2022
7537 pages
ISBN:9781450392037
DOI:10.1145/3503161
General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 October 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
categorization
dataset
visual dialog
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 169
  Total Downloads
- Downloads (Last 12 months)58
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Visual Dialog for Spotting the Differences between Pairs of Similar Images

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Unsupervised and Pseudo-Supervised Vision-Language Alignment in Visual Dialog

Multimodal Fusion of Visual Dialog: A Survey

Visual Dialog with Multi-turn Attentional Memory Network

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media