research-article

Selection and Reconstruction of Key Locals: A Novel Specific Domain Image-Text Retrieval Method

Authors:

Yu Liao,

Bai Liu,

Zeng ZhaoAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 5653 - 5662

https://doi.org/10.1145/3664647.3681421

Published: 28 October 2024 Publication History

Get Access

Abstract

In recent years, Vision-Language Pre-training (VLP) models have demonstrated rich prior knowledge for multimodal alignment, prompting investigations into their application in Specific Domain Image-Text Retrieval(SDITR) such as Text-Image Person Re-identification (TIReID) and Remote Sensing Image-Text Retrieval (RSITR). Due to the unique data characteristics in specific scenarios, the primary challenge is to leverage discriminative fine-grained local information for improved mapping of images and text into a shared space. Current approaches interact with all multimodal local features for alignment, implicitly focusing on discriminative local information to distinguish data differences, which may bring noise and uncertainty. Furthermore, their VLP feature extractors like CLIP often focus on instance-level representations, potentially reducing the discriminability of fine-grained local features. To alleviate these issues, we propose an Explicit Key Local information Selection and Reconstruction Framework (EKLSR), which explicitly selects key local information to enhance feature representation. Specifically, we introduce a Key Local information Selection and Fusion (KLSF) that utilizes hidden knowledge from the VLP model to select interpretably and fuse key local information. Secondly, we employ Key Local segment Reconstruction (KLR) based on multimodal interaction to reconstruct the key local segments of images (text), significantly enriching their discriminative information and enhancing both inter-modal and intra-modal interaction alignment. To demonstrate the effectiveness of our approach, we conducted experiments on five datasets across TIReID and RSITR. Notably, our EKLSR model achieves state-of-the-art performance on two RSITR datasets.

References

[1]

Taghreed Abdullah, Yakoub Bazi, Mohamad M. Al Rahhal, Mohamed L. Mekhalfi, Lalitha Rangarajan, and Mansour Zuair. 2020. TextRS: Deep Bidirectional Triplet Network for Matching Text to Remote Sensing Images. Remote Sensing (Jan 2020), 405. https://doi.org/10.3390/rs12030405

Abstract

References

Index Terms

Recommendations

Accurate and Lightweight Learning for Specific Domain Image-Text Retrieval

Domain specific information retrieval and text mining in medical document

Medical-Image Retrieval Based on Knowledge-Assisted Text and Image Indexing

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations