skip to main content
10.1145/3664647.3681421acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Selection and Reconstruction of Key Locals: A Novel Specific Domain Image-Text Retrieval Method

Published: 28 October 2024 Publication History

Abstract

In recent years, Vision-Language Pre-training (VLP) models have demonstrated rich prior knowledge for multimodal alignment, prompting investigations into their application in Specific Domain Image-Text Retrieval(SDITR) such as Text-Image Person Re-identification (TIReID) and Remote Sensing Image-Text Retrieval (RSITR). Due to the unique data characteristics in specific scenarios, the primary challenge is to leverage discriminative fine-grained local information for improved mapping of images and text into a shared space. Current approaches interact with all multimodal local features for alignment, implicitly focusing on discriminative local information to distinguish data differences, which may bring noise and uncertainty. Furthermore, their VLP feature extractors like CLIP often focus on instance-level representations, potentially reducing the discriminability of fine-grained local features. To alleviate these issues, we propose an Explicit Key Local information Selection and Reconstruction Framework (EKLSR), which explicitly selects key local information to enhance feature representation. Specifically, we introduce a Key Local information Selection and Fusion (KLSF) that utilizes hidden knowledge from the VLP model to select interpretably and fuse key local information. Secondly, we employ Key Local segment Reconstruction (KLR) based on multimodal interaction to reconstruct the key local segments of images (text), significantly enriching their discriminative information and enhancing both inter-modal and intra-modal interaction alignment. To demonstrate the effectiveness of our approach, we conducted experiments on five datasets across TIReID and RSITR. Notably, our EKLSR model achieves state-of-the-art performance on two RSITR datasets.

References

[1]
Taghreed Abdullah, Yakoub Bazi, Mohamad M. Al Rahhal, Mohamed L. Mekhalfi, Lalitha Rangarajan, and Mansour Zuair. 2020. TextRS: Deep Bidirectional Triplet Network for Matching Text to Remote Sensing Images. Remote Sensing (Jan 2020), 405. https://doi.org/10.3390/rs12030405
[2]
Jinwon An and Sungzoon Cho. 2015. Variational autoencoder based anomaly detection using reconstruction probability. Special lecture on IE, Vol. 2, 1 (2015), 1--18.
[3]
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. 2021. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021).
[4]
Yaxiong Chen, Jinghao Huang, Xiaoyu Li, Shengwu Xiong, and Xiaoqiang Lu. 2023. Multiscale Salient Alignment Learning for Remote Sensing Image-Text Retrieval. IEEE Transactions on Geoscience and Remote Sensing (2023).
[5]
Yuhao Chen, Guoqing Zhang, Yujiang Lu, Zhenxing Wang, and Yuhui Zheng. 2022. Tipcb: A simple but effective part-based convolutional baseline for text-based person search. Neurocomputing, Vol. 494 (2022), 171--181.
[6]
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In European conference on computer vision. Springer, 104--120.
[7]
Qimin Cheng, Yuzhuo Zhou, Peng Fu, Yuan Xu, and Liang Zhang. 2021. A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2021), 4284--4297. https://doi.org/10.1109/jstars.2021.3070872
[8]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[9]
Zefeng Ding, Changxing Ding, Zhiyin Shao, and Dacheng Tao. 2021. Semantically self-aligned network for text-to-image part-aware person re-identification. arXiv preprint arXiv:2107.12666 (2021).
[10]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[11]
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017).
[12]
Chenyang Gao, Guanyu Cai, Xinyang Jiang, Feng Zheng, Jun Zhang, Yifei Gong, Pai Peng, Xiaowei Guo, and Xing Sun. 2021. Contextual non-local alignment over full-scale representation for text-based person search. arXiv preprint arXiv:2101.03036 (2021).
[13]
Liying Gao, Kai Niu, Bingliang Jiao, Peng Wang, and Yanning Zhang. 2023. Addressing information inequality for text-based person search via pedestrian-centric visual denoising and bias-aware alignments. IEEE Transactions on Circuits and Systems for Video Technology (2023).
[14]
Xiao Han, Sen He, Li Zhang, and Tao Xiang. 2021. Text-Based Person Search with Limited Data. Cornell University - arXiv,Cornell University - arXiv (Oct 2021).
[15]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[16]
Peng Hu, Xi Peng, Hongyuan Zhu, Liangli Zhen, and Jie Lin. 2021. Learning cross-modal retrieval with noisy labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5403--5413.
[17]
Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. 2020. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv 2020. arXiv preprint arXiv:2004.00849 (2020).
[18]
Yuqi Huo, Manli Zhang, Guangzhen Liu, Haoyu Lu, Yizhao Gao, Guoxing Yang, Jingyuan Wen, Heng Zhang, Baogui Xu, Weihao Zheng, et al. 2021. WenLan: Bridging vision and language by large-scale multi-modal pre-training. arXiv preprint arXiv:2103.06561 (2021).
[19]
Zhong Ji, Changxu Meng, Yan Zhang, Yanwei Pang, and Xuelong Li. 2023. Knowledge-Aided Momentum Contrastive Learning for Remote-Sensing Image Text Retrieval. IEEE Transactions on Geoscience and Remote Sensing, Vol. 61 (2023), 1--13.
[20]
Ding Jiang and Mang Ye. 2023. Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2787--2797.
[21]
Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning. PMLR, 5583--5594.
[22]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European conference on computer vision (ECCV). 201--216.
[23]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023).
[24]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning. PMLR, 12888--12900.
[25]
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, Vol. 34 (2021), 9694--9705.
[26]
Shiping Li, Min Cao, and Min Zhang. 2022. Learning semantic-aligned feature representation for text-based person search. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2724--2728.
[27]
Shuang Li, Tong Xiao, Hongsheng Li, Wei Yang, and Xiaogang Wang. 2017. Identity-aware textual-visual matching with latent co-attention. In Proceedings of the IEEE International Conference on Computer Vision. 1890--1899.
[28]
Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. 2017. Person Search with Natural Language Description. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/cvpr.2017.551
[29]
Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, and Kaiming He. 2023. Scaling language-image pre-training via masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 23390--23400.
[30]
Yu Liao, Rui Yang, Tao Xie, Hantong Xing, Dou Quan, Shuang Wang, and Biao Hou. 2023. A fast and accurate method for remote sensing image-text retrieval based on large model knowledge distillation. In IGARSS 2023--2023 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 5077--5080.
[31]
An-An Liu, Bo Yang, Wenhui Li, Dan Song, Zhengya Sun, Tongwei Ren, and Zhiqiang Wei. 2024. Text-guided Knowledge Transfer for Remote Sensing Image-Text Retrieval. IEEE Geoscience and Remote Sensing Letters (2024).
[32]
Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, and Jun Zhou. 2023. RemoteCLIP: A Vision Language Foundation Model for Remote Sensing. (Jun 2023).
[33]
Jing Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. Neural Information Processing Systems,Neural Information Processing Systems (Aug 2019).
[34]
Xiaoqiang Lu, Binqiang Wang, Xiangtao Zheng, and Xuelong Li. 2017. Exploring models and data for remote sensing image caption generation. IEEE Transactions on Geoscience and Remote Sensing, Vol. 56, 4 (2017), 2183--2195.
[35]
Guo Mao, Yuan Yuan, and Lu Xiaoqiang. 2018. Deep Cross-Modal Retrieval for Remote Sensing Image and Audio. In 2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS). https://doi.org/10.1109/prrs.2018.8486338
[36]
Li Mi, Siran Li, Christel Chappuis, and Devis Tuia. 2022. Knowledge-aware cross-modal text-image retrieval for remote sensing images. In Proceedings of the Second Workshop on Complex Data Challenges in Earth Observation (CDCEO 2022).
[37]
Jiancheng Pan, Qing Ma, and Cong Bai. 2023. Reducing semantic confusion: Scene-aware aggregation network for remote sensing cross-modal retrieval. In Proceedings of the 2023 ACM International Conference on Multimedia Retrieval. 398--406.
[38]
Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, and Arun Sacheti. 2020. Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966 (2020).
[39]
Tingting Qiao, Jing Zhang, Dan Xu, and Dacheng Tao. 2019. MirrorGAN: Learning Text-to-image Generation by Redescription. Cornell University - arXiv,Cornell University - arXiv (Mar 2019).
[40]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.
[41]
Nikolaos Sarafianos, Xiang Xu, and Ioannis A Kakadiaris. 2019. Adversarial representation learning for text-to-image matching. In Proceedings of the IEEE/CVF international conference on computer vision. 5814--5824.
[42]
Xiujun Shu, Wei Wen, Haoqian Wu, Keyu Chen, Yiran Song, Ruizhi Qiao, Bo Ren, and Xiao Wang. 2022. See finer, see more: Implicit modality alignment for text-based person retrieval. In European Conference on Computer Vision. Springer, 624--641.
[43]
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2020. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. International Conference on Learning Representations,International Conference on Learning Representations (Apr 2020).
[44]
Shuang Wang, Qi Zang, Dong Zhao, Chaowei Fang, Dou Quan, Yutong Wan, Yanhe Guo, and Licheng Jiao. 2023. Select, purify, and exchange: A multisource unsupervised domain adaptation method for building extraction. IEEE Transactions on Neural Networks and Learning Systems (2023).
[45]
Shuang Wang, Dong Zhao, Chi Zhang, Yuwei Guo, Qi Zang, Yu Gu, Yi Li, and Licheng Jiao. 2022. Cluster alignment with target knowledge mining for unsupervised domain adaptation semantic segmentation. IEEE Transactions on Image Processing, Vol. 31 (2022), 7403--7418.
[46]
Ye Wang, Bowei Jiang, Changqing Zou, and Rui Ma. 2023. MXM-CLR: A Unified Framework for Contrastive Learning of Multifold Cross-Modal Representations. arXiv preprint arXiv:2303.10839 (2023).
[47]
Zhe Wang, Zhiyuan Fang, Jun Wang, and Yezhou Yang. 2020. Vitaa: Visual-textual attributes alignment in person search by natural language. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XII 16. Springer, 402--420.
[48]
Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, and Jing Shao. 2019. Camp: Cross-modal adaptive message passing for text-image retrieval. In Proceedings of the IEEE/CVF international conference on computer vision. 5764--5773.
[49]
Zijie Wang, Aichun Zhu, Jingyi Xue, Xili Wan, Chao Liu, Tian Wang, and Yifeng Li. 2022. Look before you leap: Improving text-based person retrieval by learning a consistent cross-modal common manifold. In Proceedings of the 30th ACM International Conference on Multimedia. 1984--1992.
[50]
Jing Ya, Chenyang Si, Junbo Wang, Wei Wang, Liang Wang, and Tieniu Tan. 2018. Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search. Cornell University - arXiv,Cornell University - arXiv (Sep 2018).
[51]
Shuanglin Yan, Neng Dong, Liyan Zhang, and Jinhui Tang. 2023. Clip-driven fine-grained text-image person re-identification. IEEE Transactions on Image Processing (2023).
[52]
Rui Yang, Shuang Wang, Yu Gu, Jihui Wang, Yingzhi Sun, Huan Zhang, Yu Liao, and Licheng Jiao. 2024. Continual learning for cross-modal image-text retrieval based on domain-selective attention. Pattern Recognition, Vol. 149 (2024), 110273.
[53]
Rui Yang, Shuang Wang, Yingping Han, Yuanheng Li, Dong Zhao, Dou Quan, Yanhe Guo, and Licheng Jiao. 2024. Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrieval. arXiv preprint arXiv:2405.18959 (2024).
[54]
Rui Yang, Shuang Wang, Yingzhi Sun, Huan Zhang, Yu Liao, Yu Gu, Biao Hou, and Licheng Jiao. 2022. Multimodal fusion remote sensing image--audio retrieval. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, Vol. 15 (2022), 6220--6235.
[55]
Rui Yang, Shuang Wang, Huan Zhang, Siyuan Xu, YanHe Guo, Xiutiao Ye, Biao Hou, and Licheng Jiao. 2023. Knowledge Decomposition and Replay: A Novel Cross-modal Image-Text Retrieval Continual Learning Method. In Proceedings of the 31st ACM International Conference on Multimedia. 6510--6519.
[56]
Rui Yang, Di Zhang, YanHe Guo, and Shuang Wang. 2023. A Texture and Saliency Enhanced Image Learning Method For Cross-Modal Remote Sensing Image-Text Retrieval. In IGARSS 2023--2023 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 4895--4898.
[57]
Zhiqiang Yuan, Wenkai Zhang, Kun Fu, Xuan Li, Chubo Deng, Hongqi Wang, and Xian Sun. 2022. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval. arXiv preprint arXiv:2204.09868 (2022).
[58]
Zhiqiang Yuan, Wenkai Zhang, Changyuan Tian, Xuee Rong, Zhengyuan Zhang, Hongqi Wang, Kun Fu, and Xian Sun. 2022. Remote sensing cross-modal text-image retrieval based on global and local information. IEEE Transactions on Geoscience and Remote Sensing, Vol. 60 (2022), 1--16.
[59]
Huan Zhang, Yingzhi Sun, Yu Liao, Siyuan Xu, Rui Yang, Shuang Wang, Biao Hou, and Licheng Jiao. [n.,d.]. A TRANSFORMER-BASED CROSS-MODAL IMAGE-TEXT RETRIEVAL METHOD USING FEATURE DECOUPLING AND RECONSTRUCTION. ( [n.,d.]).
[60]
Liyan Zhang, Xikui Wang, Dmitri V. Kalashnikov, Sharad Mehrotra, and Deva Ramanan. 2016. Query-Driven Approach to Face Clustering and Tagging. IEEE Transactions on Image Processing, Vol. 25, 10 (Oct 2016), 4504--4513. https://doi.org/10.1109/tip.2016.2592703
[61]
Ying Zhang and Huchuan Lu. 2018. Deep cross-modal projection learning for image-text matching. In Proceedings of the European conference on computer vision (ECCV). 686--701.
[62]
Dong Zhao, Shuang Wang, Qi Zang, Dou Quan, Xiutiao Ye, and Licheng Jiao. 2023. Towards better stability and adaptability: Improve online self-training for model adaptation in semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11733--11743.
[63]
Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, Mingliang Xu, and Yi-Dong Shen. 2020. Dual-Path Convolutional Image-Text Embedding with Instance Loss. ACM Transactions on Multimedia Computing, Communications, and Applications (May 2020), 1--23. https://doi.org/10.1145/3383184
[64]
Aichun Zhu, Zijie Wang, Yifeng Li, Xili Wan, Jing Jin, Tian Wang, Fangqiang Hu, and Gang Hua. 2021. Dssl: Deep surroundings-person separation learning for text-based person retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. 209--217.
[65]
Bo Zong, Qi Song, Martin Renqiang Min, Wei Cheng, Cristian Lumezanu, Daeki Cho, and Haifeng Chen. 2018. Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In International conference on learning representations.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
October 2024
11719 pages
ISBN:9798400706868
DOI:10.1145/3664647
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. key local information selection and reconstruction
  2. remote sensing
  3. specific domain image-text retrieval
  4. text-image person re-identification

Qualifiers

  • Research-article

Funding Sources

  • The National Key Research and Development Program of China under Grant

Conference

MM '24
Sponsor:
MM '24: The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 88
    Total Downloads
  • Downloads (Last 12 months)88
  • Downloads (Last 6 weeks)26
Reflects downloads up to 22 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media