skip to main content
10.1145/3581783.3611721acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Suspected Objects Matter: Rethinking Model's Prediction for One-stage Visual Grounding

Published: 27 October 2023 Publication History

Abstract

Recently, one-stage visual grounders attract high attention due to their comparable accuracy but significantly higher efficiency than two-stage grounders. However, inter-object relation modeling has not been well studied for one-stage grounders. Inter-object relationship modeling, though important, is not necessarily performed among all objects, as only part of them are related to the text query and may confuse the model. We call these objects "suspected objects". However, exploring their relationships in the one-stage paradigm is non-trivial because: (1) no object proposals are available as the basis on which to select suspected objects and perform relationship modeling; (2) suspected objects are more confusing than others, as they may share similar semantics, be entangled with certain relationships, etc, and thereby more easily mislead the model's prediction. Toward this end, we propose a Suspected Object Transformation mechanism (SOT), which can be seamlessly integrated into existing CNN and Transformer-based one-stage visual grounders to encourage the target object selection among the suspected ones. Suspected objects are dynamically discovered from a learned activation map adapted to the model's current discrimination ability during training. Afterward, on top of suspected objects, a Keyword-Aware Discrimination module (KAD) and an Exploration by Random Connection strategy (ERC) are concurrently proposed to help the model rethink its initial prediction. On the one hand, KAD leverages keywords contributing high to suspected object discrimination. On the other hand, ERC allows the model to seek the correct object instead of being trapped in a situation that always exploits the current false prediction. Extensive experiments demonstrate the effectiveness of our proposed method.

References

[1]
Long Chen, Wenbo Ma, Jun Xiao, Hanwang Zhang, and Shih-Fu Chang. 2021. Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1036--1044.
[2]
Xinpeng Chen, Lin Ma, Jingyuan Chen, Zequn Jie, Wei Liu, and Jiebo Luo. 2018. Real-Time Referring Expression Comprehension by Single-Stage Grounding Network. CoRR, Vol. abs/1812.03426 (2018). [arXiv]1812.03426
[3]
Erik Conser, Kennedy Hahn, Chandler M. Watson, and Melanie Mitchell. 2019. Revisiting Visual Grounding. CoRR, Vol. abs/1904.02225 (2019). [arXiv]1904.02225
[4]
Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. 2017. Deformable Convolutional Networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017. IEEE Computer Society, 764--773. https://doi.org/10.1109/ICCV.2017.89
[5]
Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, and Houqiang Li. 2021. TransVG: End-to-End Visual Grounding with Transformers. arXiv preprint arXiv:2104.08541 (2021).
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171--4186. https://doi.org/10.18653/v1/n19-1423
[7]
Hugo Jair Escalante, Carlos A. Hernández, Jesús A. González, Aurelio Ló pez-López, Manuel Montes-y-Gómez, Eduardo F. Morales, Luis Enrique Sucar, Luis Villasen or Pineda, and Michael Grubinger. 2010. The segmented and annotated IAPR TC-12 benchmark. Comput. Vis. Image Underst., Vol. 114, 4 (2010), 419--428. https://doi.org/10.1016/j.cviu.2009.03.008
[8]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. 2017. Mask R-CNN. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017. IEEE Computer Society, 2980--2988. https://doi.org/10.1109/ICCV.2017.322
[9]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.
[10]
Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, and Kate Saenko. 2017. Modeling Relationships in Referential Expressions with Compositional Modular Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 4418--4427. https://doi.org/10.1109/CVPR.2017.470
[11]
Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell. 2016. Natural Language Object Retrieval. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 4555--4564. https://doi.org/10.1109/CVPR.2016.493
[12]
Binbin Huang, Dongze Lian, Weixin Luo, and Shenghua Gao. 2021. Look Before You Leap: Learning Landmark Features for One-Stage Visual Grounding. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[13]
Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, and Yu-Gang Jiang. 2022. More: Multi-order relation mining for dense captioning in 3d scenes. In European Conference on Computer Vision. Springer, 528--545.
[14]
Yang Jiao, Zequn Jie, Weixin Luo, Jingjing Chen, Yu-Gang Jiang, Xiaolin Wei, and Lin Ma. 2021. Two-stage Visual Cues Enhancement Network for Referring Image Segmentation. In Proceedings of the 29th ACM International Conference on Multimedia. 1331--1340.
[15]
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara L. Berg. 2014. ReferItGame: Referring to Objects in Photographs of Natural Scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). ACL, 787--798. https://doi.org/10.3115/v1/d14--1086
[16]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[17]
Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
[18]
Muchen Li and Leonid Sigal. 2021. Referring transformer: A one-step approach to multi-task visual grounding. Advances in Neural Information Processing Systems, Vol. 34 (2021), 19652--19664.
[19]
Yunxin Li, Baotian Hu, Xinyu Chen, Lin Ma, and Min Zhang. 2023 a. LMEye: An Interactive Perception Network for Large Language Models. arXiv preprint arXiv:2305.03701 (2023).
[20]
Yunxin Li, Baotian Hu, Yuxin Ding, Lin Ma, and Min Zhang. 2023 b. A Neural Divide-and-Conquer Reasoning Framework for Image Retrieval from Linguistically Complex Text. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 16464--16476. https://aclanthology.org/2023.acl-long.909
[21]
Yunxin Li, Baotian Hu, Chen Xinyu, Yuxin Ding, Lin Ma, and Min Zhang. 2023 c. A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 10757--10770. https://aclanthology.org/2023.acl-long.601
[22]
Yue Liao, Si Liu, Guanbin Li, Fei Wang, Yanjie Chen, Chen Qian, and Bo Li. 2020. A Real-Time Cross-Modality Correlation Filtering Method for Referring Expression Comprehension. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation / IEEE, 10877--10886. https://doi.org/10.1109/CVPR42600.2020.01089
[23]
Yue Liao, Aixi Zhang, Zhiyuan Chen, Tianrui Hui, and Si Liu. 2022. Progressive Language-customized Visual Feature Learning for One-stage Visual Grounding. IEEE Transactions on Image Processing (2022).
[24]
Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. 2017. Feature Pyramid Networks for Object Detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 936--944. https://doi.org/10.1109/CVPR.2017.106
[25]
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollá r, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V (Lecture Notes in Computer Science, Vol. 8693), David J. Fleet, Tomá s Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer, 740--755. https://doi.org/10.1007/978-3-319-10602-1_48
[26]
Daqing Liu, Hanwang Zhang, Zheng-Jun Zha, and Feng Wu. 2019b. Learning to Assemble Neural Module Tree Networks for Visual Grounding. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 4672--4681. https://doi.org/10.1109/ICCV.2019.00477
[27]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single Shot MultiBox Detector. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 9905), Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Springer, 21--37. https://doi.org/10.1007/978-3-319-46448-0_2
[28]
Xihui Liu, Zihao Wang, Jing Shao, Xiaogang Wang, and Hongsheng Li. 2019a. Improving Referring Expression Grounding With Cross-Modal Attention-Guided Erasing. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 1950--1959.
[29]
Yongfei Liu, Bo Wan, Xiaodan Zhu, and Xuming He. 2020. Learning Cross-Modal Context Graph for Visual Grounding. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press, 11645--11652.
[30]
Ilya Loshchilov and Frank Hutter. 2017. SGDR: Stochastic Gradient Descent with Warm Restarts. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
[31]
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L. Yuille, and Kevin Murphy. 2016. Generation and Comprehension of Unambiguous Object Descriptions. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 11--20. https://doi.org/10.1109/CVPR.2016.9
[32]
Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron C. Courville. 2018. FiLM: Visual Reasoning with a General Conditioning Layer. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2019, Sheila A. McIlraith and Kilian Q. Weinberger (Eds.). AAAI Press, 3942--3951.
[33]
Bryan A. Plummer, Paige Kordas, M. Hadi Kiapour, Shuai Zheng, Robinson Piramuthu, and Svetlana Lazebnik. 2018. Conditional Image-Text Embedding Networks. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XII (Lecture Notes in Computer Science, Vol. 11216), Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.). Springer, 258--274. https://doi.org/10.1007/978-3-030-01258-8_16
[34]
Brian L. Price and William A. Barrett. 2006. Object-based vectorization for interactive image editing. Vis. Comput., Vol. 22, 9--11 (2006), 661--670. https://doi.org/10.1007/s00371-006-0051-1
[35]
Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. 2020. Reverie: Remote embodied visual referring expression in real indoor environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9982--9991.
[36]
Yanyuan Qiao, Chaorui Deng, and Qi Wu. 2020. Referring expression comprehension: A survey of methods and datasets. IEEE Transactions on Multimedia, Vol. 23 (2020), 4426--4440.
[37]
Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. 2016. You Only Look Once: Unified, Real-Time Object Detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 779--788. https://doi.org/10.1109/CVPR.2016.91
[38]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, Vol. 28 (2015), 91--99.
[39]
Arka Sadhu, Kan Chen, and Ram Nevatia. 2019. Zero-Shot Grounding of Objects From Natural Language Queries. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 4693--4702. https://doi.org/10.1109/ICCV.2019.00479
[40]
Xue Song, Jingjing Chen, Zuxuan Wu, and Yu-Gang Jiang. 2021. Spatial-temporal Graphs for Cross-modal Text2Video Retrieval. IEEE Transactions on Multimedia (2021).
[41]
Mingjie Sun, Jimin Xiao, Eng Gee Lim, Si Liu, and John Y Goulermas. 2021. Discriminative triad matching and reconstruction for weakly referring expression grounding. IEEE transactions on pattern analysis and machine intelligence, Vol. 43, 11 (2021), 4189--4195.
[42]
Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. 2019. FCOS: Fully Convolutional One-Stage Object Detection. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 9626--9635. https://doi.org/10.1109/ICCV.2019.00972
[43]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).
[44]
Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2017. Graph Attention Networks. CoRR, Vol. abs/1710.10903 (2017). showeprint[arXiv]1710.10903
[45]
Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazebnik. 2019a. Learning Two-Branch Neural Networks for Image-Text Matching Tasks. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 41, 2 (2019), 394--407. https://doi.org/10.1109/TPAMI.2018.2797921
[46]
Peng Wang, Qi Wu, Jiewei Cao, Chunhua Shen, Lianli Gao, and Anton van den Hengel. 2019b. Neighbourhood Watch: Referring Expression Comprehension via Language-Guided Graph Attention Networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 1960--1968. https://doi.org/10.1109/CVPR.2019.00206
[47]
Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-Local Neural Networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18--22, 2018. Computer Vision Foundation / IEEE Computer Society, 7794--7803. https://doi.org/10.1109/CVPR.2018.00813
[48]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. 2019. HuggingFace's Transformers: State-of-the-art Natural Language Processing. CoRR, Vol. abs/1910.03771 (2019). [arXiv]1910.03771
[49]
Sibei Yang, Guanbin Li, and Yizhou Yu. 2019b. Dynamic Graph Attention for Referring Expression Comprehension. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 4643--4652. https://doi.org/10.1109/ICCV.2019.00474
[50]
Sibei Yang, Guanbin Li, and Yizhou Yu. 2020b. Graph-Structured Referring Expression Reasoning in The Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[51]
Zhengyuan Yang, Tianlang Chen, Liwei Wang, and Jiebo Luo. 2020a. Improving One-stage Visual Grounding by Recursive Sub-query Construction. In ECCV.
[52]
Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, and Jiebo Luo. 2019a. A Fast and Accurate One-Stage Approach to Visual Grounding. In ICCV.
[53]
Fisher Yu and Vladlen Koltun. 2016. Multi-Scale Context Aggregation by Dilated Convolutions. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.).
[54]
Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. 2018a. MAttNet: Modular Attention Network for Referring Expression Comprehension. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. Computer Vision Foundation / IEEE Computer Society, 1307--1315. https://doi.org/10.1109/CVPR.2018.00142
[55]
Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. 2016. Modeling Context in Referring Expressions. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 9906), Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Springer, 69--85. https://doi.org/10.1007/978-3-319-46475-6_5
[56]
Licheng Yu, Hao Tan, Mohit Bansal, and Tamara L. Berg. 2017. A Joint Speaker-Listener-Reinforcer Model for Referring Expressions. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 3521--3529. https://doi.org/10.1109/CVPR.2017.375
[57]
Zhou Yu, Jun Yu, Chenchao Xiang, Zhou Zhao, Qi Tian, and Dacheng Tao. 2018b. Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden, Jérôme Lang (Ed.). ijcai.org, 1114--1120. https://doi.org/10.24963/ijcai.2018/155
[58]
Hanwang Zhang, Yulei Niu, and Shih-Fu Chang. 2018. Grounding Referring Expressions in Images by Variational Context. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. Computer Vision Foundation / IEEE Computer Society, 4158--4166. https://doi.org/10.1109/CVPR.2018.00437
[59]
Bin Zhu, Chong-Wah Ngo, Jingjing Chen, and Yanbin Hao. 2019. R2GAN: Cross-Modal Recipe Retrieval With Generative Adversarial Network. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 11477--11486. https://doi.org/10.1109/CVPR.2019.01174
[60]
Chaoyang Zhu, Yiyi Zhou, Yunhang Shen, Gen Luo, Xingjia Pan, Mingbao Lin, Chao Chen, Liujuan Cao, Xiaoshuai Sun, and Rongrong Ji. 2022. Seqtr: A simple yet universal network for visual grounding. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXV. Springer, 598--615.
[61]
Bohan Zhuang, Qi Wu, Chunhua Shen, Ian D. Reid, and Anton van den Hengel. 2018. Parallel Attention: A Unified Framework for Visual Object Discovery Through Dialogs and Queries. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. Computer Vision Foundation / IEEE Computer Society, 4252--4261. https://doi.org/10.1109/CVPR.2018.00447

Cited By

View all
  • (2024)CSRef: Contrastive Semantic Alignment for Speech Referring Expression ComprehensionProceedings of the 2nd International Workshop on Methodologies for Multimedia10.1145/3689089.3689706(28-34)Online publication date: 28-Oct-2024
  • (2023)Hierarchical cross-modal contextual attention network for visual groundingMultimedia Systems10.1007/s00530-023-01097-829:4(2073-2083)Online publication date: 17-Apr-2023
  • (2022)CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text RetrievalComputer Vision – ECCV 202210.1007/978-3-031-20059-5_40(700-716)Online publication date: 23-Oct-2022
  • Show More Cited By

Index Terms

  1. Suspected Objects Matter: Rethinking Model's Prediction for One-stage Visual Grounding

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. one-stage paradigm
    2. suspected objects
    3. visual grounding

    Qualifiers

    • Research-article

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)210
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)CSRef: Contrastive Semantic Alignment for Speech Referring Expression ComprehensionProceedings of the 2nd International Workshop on Methodologies for Multimedia10.1145/3689089.3689706(28-34)Online publication date: 28-Oct-2024
    • (2023)Hierarchical cross-modal contextual attention network for visual groundingMultimedia Systems10.1007/s00530-023-01097-829:4(2073-2083)Online publication date: 17-Apr-2023
    • (2022)CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text RetrievalComputer Vision – ECCV 202210.1007/978-3-031-20059-5_40(700-716)Online publication date: 23-Oct-2022
    • (2022)MORE: Multi-Order RElation Mining for Dense Captioning in 3D ScenesComputer Vision – ECCV 202210.1007/978-3-031-19833-5_31(528-545)Online publication date: 23-Oct-2022

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media