skip to main content
10.1145/3503161.3547922acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Deep Evidential Learning with Noisy Correspondence for Cross-modal Retrieval

Published: 10 October 2022 Publication History

Abstract

Cross-modal retrieval has been a compelling topic in the multimodal community. Recently, to mitigate the high cost of data collection, the co-occurred pairs (e.g., image and text) could be collected from the Internet as a large-scaled cross-modal dataset, e.g., Conceptual Captions. However, it will unavoidably introduce noise (i.e., mismatched pairs) into training data, dubbed noisy correspondence. Unquestionably, such noise will make supervision information unreliable/uncertain and remarkably degrade the performance. Besides, most existing methods focus training on hard negatives, which will amplify the unreliability of noise. To address the issues, we propose a generalized Deep Evidential Cross-modal Learning framework (DECL), which integrates a novel Cross-modal Evidential Learning paradigm (CEL) and a Robust Dynamic Hinge loss (RDH) with positive and negative learning. CEL could capture and learn the uncertainty brought by noise to improve the robustness and reliability of cross-modal retrieval. Specifically, the bidirectional evidence based on cross-modal similarity is first modeled and parameterized into the Dirichlet distribution, which not only provides accurate uncertainty estimation but also imparts resilience to perturbations against noisy correspondence. To address the amplification problem, RDH smoothly increases the hardness of negatives focused on, thus embracing higher robustness against high noise. Extensive experiments are conducted on three image-text benchmark datasets, i.e., Flickr30K, MS-COCO, and Conceptual Captions, to verify the effectiveness and efficiency of the proposed method. The code is available at \urlhttps://github.com/QinYang79/DECL.

Supplementary Material

MP4 File (MM22-fp0732.mp4)
The paper studies a challenging paradigm of noisy labels, i.e., noisy correspondence in Cross-modal Retrieval, which will introduce mismatched pairs into the training data leading to performance degradation. To address this problem, we present a generalized Deep Evidential Cross-modal Learning framework (DECL) to capture the uncertainty of noise with the CEL and be immune to the noisy perturbation using the proposed RDH, thus embracing the robustness against noisy correspondence. Specifically, CEL is a proposed paradigm, i.e., Cross-modal Evidential Learning paradigm, to capture the uncertainty brought by noisy correspondence with help of Evidential Learning. RDH loss improves the robustness of hinge loss to noisy correspondence by gradually increasing the hardness of hard negative pairs.

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), 6077--6086.
[2]
Eric Arazo, Diego Ortego, Paul Albert, Noel O'Connor, and Kevin McGuinness. 2019. Unsupervised label noise modeling and loss correction. In International conference on machine learning. PMLR, 312--321.
[3]
Devansh Arpit, Stanisaw Jastrzbski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. 2017. A closer look at memorization in deep networks. In International conference on machine learning. PMLR, 233--242.
[4]
Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020. Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12655--12663.
[5]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597--1607.
[6]
Haiwen Diao, Ying Zhang, Lin Ma, and Huchuan Lu. 2021. Similarity reasoning and filtration for image-text matching. arXiv preprint arXiv:2101.01368 (2021).
[7]
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017).
[8]
Yarin Gal and Zoubin Ghahramani. 2015. Bayesian convolutional neural networks with Bernoulli approximate variational inference. arXiv preprint arXiv:1506.02158 (2015).
[9]
Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning. PMLR, 1050--1059.
[10]
Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. 2018. Co-teaching: Robust training of deep neural networks with extremely noisy labels. Advances in neural information processing systems 31 (2018).
[11]
Zongbo Han, Changqing Zhang, Huazhu Fu, and Joey Tianyi Zhou. 2021. Trusted multi-view classification. arXiv preprint arXiv:2102.02051 (2021).
[12]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729--9738.
[13]
Peng Hu, Hongyuan Zhu, Jie Lin, Dezhong Peng, Yin-Ping Zhao, and Xi Peng. 2022. Unsupervised Contrastive Cross-modal Hashing. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
[14]
Zhenyu Huang, Guocheng Niu, Xiao Liu, Wenbiao Ding, Xinyan Xiao, Hua Wu, and Xi Peng. 2021. Learning with Noisy Correspondence for Cross-modal Matching. Advances in Neural Information Processing Systems 34 (2021).
[15]
Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. 2018. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning. PMLR, 2304--2313.
[16]
Audun Jsang. 2016. Subjective Logic:A formalism for reasoning under uncertainty. Springer Verlag (2016).
[17]
Durk P Kingma, Tim Salimans, and Max Welling. 2015. Variational dropout and the local reparameterization trick. Advances in neural information processing systems 28 (2015).
[18]
Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Unifying visualsemantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014).
[19]
Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. 2017. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems 30 (2017).
[20]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV). 201--216.
[21]
Junnan Li, Richard Socher, and Steven CH Hoi. 2020. Dividemix: Learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:2002.07394 (2020).
[22]
Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE/CVF International conference on computer vision. 4654--4662.
[23]
Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, and Li-Jia Li. 2017. Learning from noisy labels with distillation. In Proceedings of the IEEE International Conference on Computer Vision. 1910--1918.
[24]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.
[25]
Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, and Yongdong Zhang. 2020. Graph structured network for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10921--10930.
[26]
Xingjun Ma, Hanxun Huang, Yisen Wang, Simone Romano, Sarah Erfani, and James Bailey. 2020. Normalized loss functions for deep learning with noisy labels. In International Conference on Machine Learning. PMLR, 6543--6553.
[27]
Andrey Malinin and Mark Gales. 2018. Predictive uncertainty estimation via prior networks. Advances in neural information processing systems 31 (2018).
[28]
Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. 2017. Variational dropout sparsifies deep neural networks. In International Conference on Machine Learning. PMLR, 2498--2507.
[29]
Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. 2013. Learning with noisy labels. Advances in neural information processing systems 26 (2013).
[30]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2015), 1137--1149.
[31]
Murat Sensoy, Lance Kaplan, and Melih Kandemir. 2018. Evidential deep learning to quantify classification uncertainty. Advances in Neural Information Processing Systems 31 (2018).
[32]
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2556--2565.
[33]
Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. 2022. Learning from noisy labels with deep neural networks: A survey. IEEE Transactions on Neural Networks and Learning Systems (2022).
[34]
Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiyoharu Aizawa. 2018. Joint optimization framework for learning with noisy labels. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5552--5560.
[35]
Andreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhinav Gupta, and Serge Belongie. 2017. Learning from noisy large-scale datasets with minimal supervision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 839--847.
[36]
Hao Wang, Bing Liu, Chaozhuo Li, Yan Yang, and Tianrui Li. 2019. Learning with noisy labels for sentence-level sentiment classification. arXiv preprint arXiv:1909.00124 (2019).
[37]
Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structurepreserving image-text embeddings. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5005--5013.
[38]
Zhenyu Wang, Ya-Li Li, Ye Guo, and Shengjin Wang. 2021. Combating Noise: Semi-supervised Learning by Region Uncertainty Quantification. Advances in Neural Information Processing Systems 34 (2021).
[39]
Jie Wen, Yong Xu, and Hong Liu. 2018. Incomplete multiview spectral clustering with adaptive graph learning. IEEE transactions on cybernetics 50, 4 (2018), 1418--1429.
[40]
Jie Wen, Zheng Zhang, Zhao Zhang, Lunke Fei, and Meng Wang. 2020. Generalized incomplete multiview clustering with flexible locality structure diffusion. IEEE transactions on cybernetics 51, 1 (2020), 101--114.
[41]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67--78.
[42]
Xingrui Yu, Bo Han, Jiangchao Yao, Gang Niu, Ivor Tsang, and Masashi Sugiyama. 2019. How does disagreement help generalization against label corruption?. In International Conference on Machine Learning. PMLR, 7164--7173.
[43]
Ji Zhang, Jingkuan Song, Lianli Gao, Ye Liu, and Heng Tao Shen. 2022. Progressive Meta-learning with Curriculum. IEEE Transactions on Circuits and Systems for Video Technology (2022).
[44]
Ji Zhang, Jingkuan Song, Yazhou Yao, and Lianli Gao. 2021. Curriculum-based meta-learning. In Proceedings of the 29th ACM International Conference on Multimedia. 1838--1846.
[45]
Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. 2019. Deep supervised cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10394--10403.

Cited By

View all
  • (2025)Contrastive Dual-Pool Feature Adaption for Domain Incremental Remote Sensing Scene ClassificationRemote Sensing10.3390/rs1702030817:2(308)Online publication date: 16-Jan-2025
  • (2025)UA-FER: Uncertainty-aware representation learning for facial expression recognitionNeurocomputing10.1016/j.neucom.2024.129261621(129261)Online publication date: Mar-2025
  • (2025)Multi-level semantics probability embedding for image–text matchingInformation Processing & Management10.1016/j.ipm.2024.10396862:2(103968)Online publication date: Mar-2025
  • Show More Cited By

Index Terms

  1. Deep Evidential Learning with Noisy Correspondence for Cross-modal Retrieval

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '22: Proceedings of the 30th ACM International Conference on Multimedia
    October 2022
    7537 pages
    ISBN:9781450392037
    DOI:10.1145/3503161
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 October 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cross-modal retrieval
    2. evidential learning
    3. image-text matching
    4. noisy correspondence

    Qualifiers

    • Research-article

    Funding Sources

    • Chengdu Science and Technology Project
    • Scu&Zigong Cooperation Project
    • the National Natural Science Foundation of China
    • China Postdoctoral Science Foundation
    • Sichuan Science and Technology Planning Project
    • Open Research Projects of Zhejiang Lab

    Conference

    MM '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)283
    • Downloads (Last 6 weeks)35
    Reflects downloads up to 28 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Contrastive Dual-Pool Feature Adaption for Domain Incremental Remote Sensing Scene ClassificationRemote Sensing10.3390/rs1702030817:2(308)Online publication date: 16-Jan-2025
    • (2025)UA-FER: Uncertainty-aware representation learning for facial expression recognitionNeurocomputing10.1016/j.neucom.2024.129261621(129261)Online publication date: Mar-2025
    • (2025)Multi-level semantics probability embedding for image–text matchingInformation Processing & Management10.1016/j.ipm.2024.10396862:2(103968)Online publication date: Mar-2025
    • (2025)Uncertainty-aware evidential learning for legal case retrieval with noisy correspondenceInformation Sciences10.1016/j.ins.2025.121915(121915)Online publication date: Jan-2025
    • (2024)Enhancing cross-modal retrieval via visual-textual prompt hashingProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/69(623-631)Online publication date: 3-Aug-2024
    • (2024)Trusted multi-view learning with label noiseProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/582(5263-5271)Online publication date: 3-Aug-2024
    • (2024)Assess and Guide: Multi-modal Fake News Detection via Decision UncertaintyProceedings of the 1st ACM Multimedia Workshop on Multi-modal Misinformation Governance in the Era of Foundation Models10.1145/3689090.3689389(37-44)Online publication date: 28-Oct-2024
    • (2024)CREST: Cross-modal Resonance through Evidential Deep Learning for Enhanced Zero-Shot LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681629(5181-5190)Online publication date: 28-Oct-2024
    • (2024)Dynamic Evidence Decoupling for Trusted Multi-view LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681404(7269-7277)Online publication date: 28-Oct-2024
    • (2024)PC2: Pseudo-Classification Based Pseudo-Captioning for Noisy Correspondence Learning in Cross-Modal RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680860(9397-9406)Online publication date: 28-Oct-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media