skip to main content
10.1145/3503161.3548170acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Visual Dialog for Spotting the Differences between Pairs of Similar Images

Authors Info & Claims
Published:10 October 2022Publication History

ABSTRACT

Visual dialog has witnessed great progress after introducing various vision-oriented goals into the conversation. Much of previous work focuses on tasks where only one image can be accessed by two interlocutors, such as VisDial and GuessWhat. The work on situations where two interlocutors access different images has received less attention. Those situations are common in real world and bring some different challenges compared with one-image tasks. The lack of such types of dialog tasks and corresponding large-scale datasets makes it impossible to carry out in-depth research. This paper therefore first proposes a new visual dialog task named Dial-the-Diff, where two interlocutors accessing two similar images respectively try to spot the difference between the images through conversing in natural language. The task raises new challenges to the dialog strategy and the ability of categorizing objects. We then build a large-scale multi-modal dataset for the task, named DialDiff, which contains 87k Virtual Reality images and 78k dialogs. Some details of the data are given and analyzed to highlight the challenges behind the task. Finally, we propose benchmark models for this task, and conduct extensive experiments to evaluate their performance as well as its problems remained.

Skip Supplemental Material Section

Supplemental Material

References

  1. Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18--22, 2018. IEEE Computer Society, 6077--6086. https://doi.org/10.1109/CVPR.2018.00636Google ScholarGoogle Scholar
  2. Jeffrey P. Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller, Rob Miller, Rob Miller, Aubrey Tatarowicz, Brandyn Allen White, Samuel White, and Tom Yeh. 2010. VizWiz: nearly real-time answers to visual questions. Proceedings of the 23nd annual ACM symposium on User interface software and technology (2010).Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Feilong Chen, Fandong Meng, Jiaming Xu, Peng Li, Bo Xu, and Jie Zhou. 2020. DMRM: A Dual-Channel Multi-Hop Reasoning Model for Visual Dialog. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7--12, 2020. AAAI Press, 7504--7511. https://aaai.org/ojs/index.php/AAAI/article/view/6248Google ScholarGoogle Scholar
  4. Michael Cogswell, Jiasen Lu, Rishabh Jain, Stefan Lee, Devi Parikh, and Dhruv Batra. 2020. Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6--12, 2020, virtual, Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/e7023ba77a45f7e84c5ee8a28dd63585-Abstract.htmlGoogle ScholarGoogle Scholar
  5. Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, and Dhruv Batra. 2017a. Visual Dialog. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. IEEE Computer Society, 1080--1089. https://doi.org/10.1109/CVPR.2017.121Google ScholarGoogle Scholar
  6. Abhishek Das, Satwik Kottur, José M. F. Moura, Stefan Lee, and Dhruv Batra. 2017b. Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22--29, 2017. IEEE Computer Society, 2970--2979. https://doi.org/10.1109/ICCV.2017.321Google ScholarGoogle Scholar
  7. Harm de Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron C. Courville. 2017. GuessWhat?! Visual Object Discovery through Multi-modal Dialogue. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. IEEE Computer Society, 4466--4475. https://doi.org/10.1109/CVPR.2017.475Google ScholarGoogle Scholar
  8. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186. https://doi.org/10.18653/v1/N19--1423Google ScholarGoogle Scholar
  9. Daniel Fried, Justin T. Chiu, and Dan Klein. 2021. Reference-Centric Models for Grounded Collaborative Dialogue. In Empirical Methods in Natural Language Processing.Google ScholarGoogle Scholar
  10. Carolina Galleguillos, Andrew Rabinovich, and Serge J. Belongie. 2008. Object categorization using co-occurrence, location and appearance. In 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008), 24--26 June 2008, Anchorage, Alaska, USA. IEEE Computer Society. https://doi.org/10.1109/CVPR.2008.4587799Google ScholarGoogle Scholar
  11. Zhe Gan, Yu Cheng, Ahmed Kholy, Linjie Li, Jingjing Liu, and Jianfeng Gao. 2019. Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 6463--6474. https://doi.org/10.18653/v1/P19--1648Google ScholarGoogle ScholarCross RefCross Ref
  12. Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. IEEE Computer Society, 6325--6334. https://doi.org/10.1109/CVPR.2017.670Google ScholarGoogle Scholar
  13. Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogé rio Schmidt Feris. 2018. Dialog-based Interactive Image Retrieval. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3--8, 2018, Montré al, Canada, Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (Eds.). 676--686. https://proceedings.neurips.cc/paper/2018/hash/a01a0380ca3c61428c26a231f0e49a09-Abstract.htmlGoogle ScholarGoogle Scholar
  14. Janosch Haber, Tim Baumg"artner, Ece Takmaz, Lieke Gelderloos, Elia Bruni, and Raquel Fernández. 2019. The PhotoBook Dataset: Building Common Ground through Visually-Grounded Dialogue. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 1895--1910. https://doi.org/10.18653/v1/P19--1184Google ScholarGoogle ScholarCross RefCross Ref
  15. He He, Anusha Balakrishnan, Mihail Eric, and Percy Liang. 2017. Learning Symmetric Collaborative Dialogue Agents with Dynamic Knowledge Graph Embeddings. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 1766--1776. https://doi.org/10.18653/v1/P17--1162Google ScholarGoogle ScholarCross RefCross Ref
  16. Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16--20, 2019. Computer Vision Foundation / IEEE, 6700--6709. https://doi.org/10.1109/CVPR.2019.00686Google ScholarGoogle Scholar
  17. Nikolai Ilinykh, Sina Zarrieß, and David Schlangen. 2019. MeetUp! A Corpus of Joint Activity Dialogues in a Visual Environment. arxiv: 1907.05084 [cs.CL]Google ScholarGoogle Scholar
  18. Harsh Jhamtani and Taylor Berg-Kirkpatrick. 2018. Learning to Describe Differences Between Pairs of Similar Images. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 4024--4034. https://doi.org/10.18653/v1/D18--1436Google ScholarGoogle ScholarCross RefCross Ref
  19. Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross B. Girshick. 2017. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. IEEE Computer Society, 1988--1997. https://doi.org/10.1109/CVPR.2017.215Google ScholarGoogle Scholar
  20. Satwik Kottur, Seungwhan Moon, Alborz Geramifard, and Babak Damavandi. 2021. SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations. arxiv: 2104.08667 [cs.CL]Google ScholarGoogle Scholar
  21. Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra, and Marcus Rohrbach. 2019. CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 582--595. https://doi.org/10.18653/v1/N19--1058Google ScholarGoogle Scholar
  22. Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Fei-Fei Li. 2016. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. arxiv: 1602.07332 [cs.CV]Google ScholarGoogle Scholar
  23. Hung Le, Chinnadhurai Sankar, Seungwhan Moon, Ahmad Beirami, Alborz Geramifard, and Satwik Kottur. 2021. DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 5651--5665. https://doi.org/10.18653/v1/2021.acl-long.439Google ScholarGoogle Scholar
  24. Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020b. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. ArXiv preprint, Vol. abs/2004.06165 (2020). https://arxiv.org/abs/2004.06165Google ScholarGoogle Scholar
  25. Zhuowan Li, Quan Tran, Long Mai, Zhe Lin, and Alan L. Yuille. 2020a. Context-Aware Group Captioning via Self-Attention and Contrastive Features. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13--19, 2020. IEEE, 3437--3447. https://doi.org/10.1109/CVPR42600.2020.00350Google ScholarGoogle Scholar
  26. Zujie Liang, Huang Hu, Can Xu, Chongyang Tao, Xiubo Geng, Yining Chen, Fan Liang, and Daxin Jiang. 2021. Maria: A Visual Experience Powered Conversational Agent. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 5596--5611. https://doi.org/10.18653/v1/2021.acl-long.435Google ScholarGoogle Scholar
  27. Runtao Liu, Chenxi Liu, Yutong Bai, and Alan L. Yuille. 2019. CLEVR-Ref: Diagnosing Visual Reasoning With Referring Expressions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16--20, 2019. Computer Vision Foundation / IEEE, 4185--4194. https://doi.org/10.1109/CVPR.2019.00431Google ScholarGoogle Scholar
  28. José Lopes, Nils Hemmingsson, and Oliver Åstrand. 2018. The Spot the Difference corpus: a multi-modal corpus of spontaneous task oriented spoken interactions. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan. https://aclanthology.org/L18--1305Google ScholarGoogle Scholar
  29. Seungwhan Moon, Satwik Kottur, Paul Crook, Ankita De, Shivani Poddar, Theodore Levin, David Whitney, Daniel Difranco, Ahmad Beirami, Eunjoon Cho, Rajen Subba, and Alborz Geramifard. 2020a. Situated and Interactive Multimodal Conversations. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 1103--1121. https://doi.org/10.18653/v1/2020.coling-main.96Google ScholarGoogle ScholarCross RefCross Ref
  30. Seungwhan Moon, Satwik Kottur, Paul Crook, Ankita De, Shivani Poddar, Theodore Levin, David Whitney, Daniel Difranco, Ahmad Beirami, Eunjoon Cho, Rajen Subba, and Alborz Geramifard. 2020b. Situated and Interactive Multimodal Conversations. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 1103--1121. https://doi.org/10.18653/v1/2020.coling-main.96Google ScholarGoogle ScholarCross RefCross Ref
  31. Yulei Niu, Hanwang Zhang, Manli Zhang, Jianhong Zhang, Zhiwu Lu, and Ji-Rong Wen. 2019. Recursive Visual Attention in Visual Dialog. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16--20, 2019. Computer Vision Foundation / IEEE, 6679--6688. https://doi.org/10.1109/CVPR.2019.00684Google ScholarGoogle Scholar
  32. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019).Google ScholarGoogle Scholar
  33. Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7--12, 2015, Montreal, Quebec, Canada, Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett (Eds.). 91--99. https://proceedings.neurips.cc/paper/2015/hash/14bfa6bb14875e45bba028a21ed38046-Abstract.htmlGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  34. Shailaja Keyur Sampat, Akshay Kumar, Yezhou Yang, and Chitta Baral. 2021. CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 3692--3709. https://doi.org/10.18653/v1/2021.naacl-main.289Google ScholarGoogle Scholar
  35. Ravi Shekhar, Aashish Venkatesh, Tim Baumg"artner, Elia Bruni, Barbara Plank, Raffaella Bernardi, and Raquel Fernández. 2019. Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 2578--2587. https://doi.org/10.18653/v1/N19--1265Google ScholarGoogle ScholarCross RefCross Ref
  36. Pushkar Shukla, Carlos Elmadjian, Richika Sharan, Vivek Kulkarni, Matthew Turk, and William Yang Wang. 2019. What Should I Ask? Using Conversationally Informative Rewards for Goal-oriented Visual Dialog.. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 6442--6451. https://doi.org/10.18653/v1/P19--1646Google ScholarGoogle ScholarCross RefCross Ref
  37. Florian Strub, Harm de Vries, Jé ré mie Mary, Bilal Piot, Aaron C. Courville, and Olivier Pietquin. 2017a. End-to-end optimization of goal-driven and visually grounded dialogue systems. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19--25, 2017, Carles Sierra (Ed.). ijcai.org, 2765--2771. https://doi.org/10.24963/ijcai.2017/385Google ScholarGoogle Scholar
  38. Florian Strub, Harm de Vries, Jé ré mie Mary, Bilal Piot, Aaron C. Courville, and Olivier Pietquin. 2017b. End-to-end optimization of goal-driven and visually grounded dialogue systems. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19--25, 2017, Carles Sierra (Ed.). ijcai.org, 2765--2771. https://doi.org/10.24963/ijcai.2017/385Google ScholarGoogle Scholar
  39. Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. 2019. A Corpus for Reasoning about Natural Language Grounded in Photographs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 6418--6428. https://doi.org/10.18653/v1/P19--1644Google ScholarGoogle ScholarCross RefCross Ref
  40. Ece Takmaz, Mario Giulianelli, Sandro Pezzelle, Arabella Sinclair, and Raquel Fernández. 2020. Refer, Reuse, Reduce: Generating Subsequent References in Visual and Conversational Contexts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 4350--4368. https://doi.org/10.18653/v1/2020.emnlp-main.353Google ScholarGoogle ScholarCross RefCross Ref
  41. Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 5100--5111. https://doi.org/10.18653/v1/D19--1514Google ScholarGoogle Scholar
  42. Hao Tan, Franck Dernoncourt, Zhe Lin, Trung Bui, and Mohit Bansal. 2019. Expressing Visual Relationships via Language. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 1873--1883. https://doi.org/10.18653/v1/P19--1182Google ScholarGoogle ScholarCross RefCross Ref
  43. Takuma Udagawa and Akiko Aizawa. 2019. A Natural Language Corpus of Common Grounding under Continuous and Partially-Observable Context. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press, 7120--7127. https://doi.org/10.1609/aaai.v33i01.33017120Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Takuma Udagawa and Akiko Aizawa. 2020. An Annotated Corpus of Reference Resolution for Interpreting Common Grounding. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7--12, 2020. AAAI Press, 9081--9089. https://aaai.org/ojs/index.php/AAAI/article/view/6442Google ScholarGoogle ScholarCross RefCross Ref
  45. Unity Technologies. 2019. Unity. https://unity.com/.Google ScholarGoogle Scholar
  46. Unity Technologies. 2020. Unity Perception Package. https://github.com/Unity-Technologies/com.unity.perception.Google ScholarGoogle Scholar
  47. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7--12, 2015. IEEE Computer Society, 3156--3164. https://doi.org/10.1109/CVPR.2015.7298935Google ScholarGoogle ScholarCross RefCross Ref
  48. Yue Wang, Shafiq Joty, Michael Lyu, Irwin King, Caiming Xiong, and Steven C.H. Hoi. 2020. VD-BERT: A Unified Vision and Dialog Transformer with BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 3325--3338. https://doi.org/10.18653/v1/2020.emnlp-main.269Google ScholarGoogle Scholar
  49. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arxiv: 1609.08144 [cs.CL]Google ScholarGoogle Scholar
  50. Zipeng Xu, Fangxiang Feng, Xiaojie Wang, Yushu Yang, Huixing Jiang, and Zhongyuan Wang. 2020. Answer-Driven Visual State Estimator for Goal-Oriented Visual Dialogue. In Proceedings of the 28th ACM International Conference on Multimedia. 4271--4279.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Zipeng Xu, Fandong Meng, Xiaojie Wang, Duo Zheng, Chenxu Lv, and Jie Zhou. 2021. modeling Explicit Concerning States for Reinforcement Learning in Visual Dialogue. In Proceedings of the 32nd British Machine Vision Conference (BMVC).Google ScholarGoogle Scholar
  52. Zhou Yu, Jing Li, Tongan Luo, and Jun Yu. 2020. A PyTorch Implementation of Bottom-Up-Attention. https://github.com/MILVLG/bottom-up-attention.pytorch.Google ScholarGoogle Scholar
  53. Junjie Zhang, Qi Wu, Chunhua Shen, Jian Zhang, Jianfeng Lu, and Anton van den Hengel. 2017. Asking the Difficult Questions: Goal-Oriented Visual Question Generation via Intermediate Rewards. ArXiv preprint, Vol. abs/1711.07614 (2017). https://arxiv.org/abs/1711.07614Google ScholarGoogle Scholar
  54. Rui Zhao and Volker Tresp. 2018. Learning Goal-Oriented Visual Dialog via Tempered Policy Gradient. In 2018 IEEE Spoken Language Technology Workshop (SLT). 868--875. https://doi.org/10.1109/SLT.2018.8639546Google ScholarGoogle ScholarCross RefCross Ref
  55. Duo Zheng, Zipeng Xu, Fandong Meng, Xiaojie Wang, Jiaan Wang, and Jie Zhou. 2021. Enhancing Visual Dialog Questioner with Entity-based Strategy Learning and Augmented Guesser. In Empirical Methods in Natural Language Processing.Google ScholarGoogle Scholar

Index Terms

  1. Visual Dialog for Spotting the Differences between Pairs of Similar Images

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MM '22: Proceedings of the 30th ACM International Conference on Multimedia
        October 2022
        7537 pages
        ISBN:9781450392037
        DOI:10.1145/3503161

        Copyright © 2022 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 10 October 2022

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate995of4,171submissions,24%

        Upcoming Conference

        MM '24
        MM '24: The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia
      • Article Metrics

        • Downloads (Last 12 months)58
        • Downloads (Last 6 weeks)3

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader