research-article

Deep Evidential Learning with Noisy Correspondence for Cross-modal Retrieval

Authors:

Yang Qin,

Dezhong Peng,

Xi Peng,

Xu Wang,

Peng HuAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 4948 - 4956

https://doi.org/10.1145/3503161.3547922

Published: 10 October 2022 Publication History

Get Access

Abstract

Cross-modal retrieval has been a compelling topic in the multimodal community. Recently, to mitigate the high cost of data collection, the co-occurred pairs (e.g., image and text) could be collected from the Internet as a large-scaled cross-modal dataset, e.g., Conceptual Captions. However, it will unavoidably introduce noise (i.e., mismatched pairs) into training data, dubbed noisy correspondence. Unquestionably, such noise will make supervision information unreliable/uncertain and remarkably degrade the performance. Besides, most existing methods focus training on hard negatives, which will amplify the unreliability of noise. To address the issues, we propose a generalized Deep Evidential Cross-modal Learning framework (DECL), which integrates a novel Cross-modal Evidential Learning paradigm (CEL) and a Robust Dynamic Hinge loss (RDH) with positive and negative learning. CEL could capture and learn the uncertainty brought by noise to improve the robustness and reliability of cross-modal retrieval. Specifically, the bidirectional evidence based on cross-modal similarity is first modeled and parameterized into the Dirichlet distribution, which not only provides accurate uncertainty estimation but also imparts resilience to perturbations against noisy correspondence. To address the amplification problem, RDH smoothly increases the hardness of negatives focused on, thus embracing higher robustness against high noise. Extensive experiments are conducted on three image-text benchmark datasets, i.e., Flickr30K, MS-COCO, and Conceptual Captions, to verify the effectiveness and efficiency of the proposed method. The code is available at \urlhttps://github.com/QinYang79/DECL.

Supplementary Material

MP4 File (MM22-fp0732.mp4)

The paper studies a challenging paradigm of noisy labels, i.e., noisy correspondence in Cross-modal Retrieval, which will introduce mismatched pairs into the training data leading to performance degradation. To address this problem, we present a generalized Deep Evidential Cross-modal Learning framework (DECL) to capture the uncertainty of noise with the CEL and be immune to the noisy perturbation using the proposed RDH, thus embracing the robustness against noisy correspondence. Specifically, CEL is a proposed paradigm, i.e., Cross-modal Evidential Learning paradigm, to capture the uncertainty brought by noisy correspondence with help of Evidential Learning. RDH loss improves the robustness of hinge loss to noisy correspondence by gradually increasing the hardness of hard negative pairs.

Download
56.22 MB

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), 6077--6086.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Cross-modal Retrieval with Label Completion

Deep Noisy Multi-label Learning for Robust Cross-Modal Retrieval

Breaking Through the Noisy Correspondence: A Robust Model for Image-Text Matching

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations