Abstract
The current text-based person re-identification (re-ID) models tend to learn salient features of image and text, which however is prone to failure in identifying persons with very similar dress, because their image contents with observable but indescribable difference may have identical textual description. To address this problem, we propose a re-ID model based on saliency masking to learn non-salient but highly discriminative features, which can work together with the salient features to provide more robust pedestrian identification. To further improve the performance of the model, a cross-modal projection matching loss with dynamic label smoothing (named CMPM-DS) is proposed to train our model, and our CMPM-DS can adaptively adjust the smoothing degree of the true distribution. We conduct extensive ablation and comparison experiments on two popular re-ID benchmarks to demonstrate the efficiency of our model and loss function, and our model achieves SOTA, improving the existing best R@1 by 0.33% on CUHK-PEDE and 4.45% on RSTPReID.








Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
These data were derived from the following resources available in the public domain. The datasets link have been uploaded.
References
Li S, Xiao T, Li H, Yang W, Wang X (2017) Identity-aware textual-visual matching with latent co-attention. In: Proceedings of the IEEE international conference on computer vision. pp 1890–1899
Wang Z, Zhu A, Zheng Z, Jin J, Xue Z, Hua G (2020) IMG-Net: inner-cross-modal attentional multigranular network for description-based person re-identification. J Electron Imaging 29(4):043028
Zhu A, Wang Z, Li Y, Wan X, Jin J, Wang T, Hu F, Hua G (2021) Dssl: Deep surroundings-person separation learning for text-based person retrieval. In: Proceedings of the 29th ACM international conference on multimedia, pp 209–217
Ding Z, Ding C, Shao Z, Tao D (2021) Semantically self-aligned network for text-to-image part-aware person re-identification. arXiv:2107.12666
Chen Y, Zhang G, Lu Y, Wang Z, Zheng Y (2022) Tipcb: A simple but effective part-based convolutional baseline for text-based person search. Neurocomputing 494:171–181
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth \(16\times 16\) words: transformers for image recognition at scale. arXiv:2010.11929
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Zhang Y, Lu H (2018) Deep cross-modal projection learning for image-text matching. In: Proceedings of the European conference on computer vision (ECCV). pp 686–701
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 2818–2826
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Chen Y-C, Li L, Yu L, El Kholy A, Ahmed F, Gan Z, Cheng Y, Liu J (2020) Uniter: Universal image-text representation learning. In: Computer vision-ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX. Springer, pp 104–120
Li G, Duan N, Fang Y, Gong M, Jiang D (2020) Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. Proc AAAI Conf Artif Intell 34:11336–11344
Li LH, Yatskar M, Yin D, Hsieh C-J, Chang K-W (2019) Visualbert: a simple and performant baseline for vision and language. arXiv:1908.03557
Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inf Process Syst 32
Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J (2019) Vl-bert: pre-training of generic visual-linguistic representations. arXiv:1908.08530
Tan H, Bansal M (2019) Lxmert: Learning cross-modality encoder representations from transformers. arXiv:1908.07490
Chang X, Wang T, Cai S, Sun C (2023) Landmark: language-guided representation enhancement framework for scene graph generation. arXiv:2303.01080
Wu N, Kera H, Kawamoto K (2023) Improving zero-shot action recognition using human instruction with text description. Appl Intell 1–15
Munusamy H (2023) Multimodal attention-based transformer for video captioning. Appl Intell 1–20
Sun C, Myers A, Vondrick C, Murphy K, Schmid C (2019) Videobert: a joint model for video and language representation learning. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 7464–7473
Huang Z, Zeng Z, Huang Y, Liu B, Fu D, Fu J (2021) Seeing out of the box: end-to-end pre-training for vision-language representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 12976–12985
Ning E, Zhang C, Wang C, Ning X, Chen H, Bai X (2023) Pedestrian re-id based on feature consistency and contrast enhancement. Displays 79:102467
Zheng L, Huang Y, Lu H, Yang Y (2019) Pose-invariant embedding for deep person re-identification. IEEE Trans Image Process 28(9):4500–4509
Yang J, Zhang C, Li Z, Tang Y, Wang Z (2023) Discriminative feature mining with relation regularization for person re-identification. Inf Process Manag 60(3):103295
Wei P, Zhang C, Tang Y, Li Z, Wang Z (2023) Reinforced domain adaptation with attention and adversarial learning for unsupervised person Re-ID. Appl Intell 53(4):4109–4123
Yang J, Zhang C, Tang Y, Li Z (2022) PAFM: pose-drive attention fusion mechanism for occluded person re-identification. Neural Comput Appl 34(10):8241–8252
Li S, Xiao T, Li H, Zhou B, Yue D, Wang X (2017) Person search with natural language description. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 1970–1979
Jing Y, Si C, Wang J, Wang W, Wang L, Tan T (2020) Pose-guided multi-granularity attention network for text-based person search. Proc AAAI Conf Artif Intell 34(07):11189–11196
Niu K, Huang Y, Ouyang W, Wang L (2020) Improving description-based person re-identification by multi-granularity image-text alignments. IEEE Trans Image Process 29:5542–5556
Zheng K, Liu W, Liu J, Zha Z-J, Mei T (2020) Hierarchical Gumbel attention network for text-based person search. In: Proceedings of the 28th ACM international conference on multimedia. pp 3441–3449
Shu X, Wen W, Wu H, Chen K, Song Y, Qiao R, Ren B, Wang X (2022) See finer, see more: implicit modality alignment for text-based person retrieval. arXiv:2208.08608
Cubuk ED, Zoph B, Mane D, Vasudevan V, Le QV (2018) Autoaugment: learning augmentation policies from data. arXiv:1805.09501
Lim S, Kim I, Kim T, Kim C, Kim S (2019) Fast autoaugment. Adv Neural Inf Process Syst 32
Ho D, Liang E, Chen X, Stoica I, Abbeel P (2019) Population based augmentation: efficient learning of augmentation policy schedules. In: International conference on machine learning, PMLR. pp 2731–2741
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning. PMLR, pp 448–456
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755
He S, Luo H, Wang P, Wang F, Li H, Jiang W (2021) Transreid: Transformer-based object re-identification. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV). pp 15013–15022
Reed S, Akata Z, Lee H, Schiele B (2016) Learning deep representations of fine-grained visual descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 49–58
Chen T, Xu C, Luo J (2018) Improving text-based person search by spatial matching and adaptive threshold. In: 2018 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 1879–1887
Chen D, Li H, Liu X, Shen Y, Shao J, Yuan Z, Wang X (2018) Improving deep visual representation for person re-identification by global and local image-language association. In: Proceedings of the European conference on computer vision (ECCV). pp 54–70
Liu J, Zha Z-J, Hong R, Wang M, Zhang Y (2019) Deep adversarial graph attention convolution network for text-based person search. In: Proceedings of the 27th ACM international conference on multimedia, pp 665–673
Aggarwal S, Radhakrishnan VB, Chakraborty A (2020) Text-based person search via attribute-aided matching. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 2617–2625
Gao C, Cai G, Jiang X, Zheng F, Zhang J, Gong Y, Peng P, Guo X, Sun X (2021) Contextual non-local alignment over full-scale representation for text-based person search. arXiv:2101.03036
Wang C, Luo Z, Lin Y, Li S (2021) Text-based person search via multi-granularity embedding learning. In: IJCAI, pp 1068–1074
Han X, He S, Zhang L, Xiang T (2021) Text-based person search with limited data. arXiv:2110.10807
Wang Z, Zhu A, Xue J, Wan X, Liu C, Wang T, Li Y (2022) Look before you leap: improving text-based person retrieval by learning a consistent cross-modal common manifold. In: Proceedings of the 30th ACM international conference on multimedia, pp 1984–1992
Li F, Zhou H, Li H, Zhang Y, Yu Z (2022) Person text-image matching via text-feature interpretability embedding and external attack node implantation. arXiv:2211.08657
Li S, Cao M, Zhang M (2022) Learning semantic-aligned feature representation for text-based person search. In: ICASSP 2022–2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2724–2728
Wang Z, Zhu A, Xue J, Wan X, Liu C, Wang T, Li Y (2022) Caibc: Capturing all-round information beyond color for text-based person retrieval. In: Proceedings of the 30th ACM international conference on multimedia, pp 5314–5322
Shao Z, Zhang X, Fang M, Lin Z, Wang J, Ding C (2022) Learning granularity-unified representations for text-to-image person re-identification. In: Proceedings of the 30th ACM international conference on multimedia, pp 5566–5574
Wang Z, Xue J, Zhu A, Li Y, Zhang M, Zhong C (2021) Amen: adversarial multi-space embedding network for text-based person re-identification. In: Chinese conference on pattern recognition and computer vision (PRCV). Springer, pp 462–473
Acknowledgements
This work is supported by the National Natural Science Foundation of China (Nos. 62266009, 62276073), Guangxi Natural Science Foundation (No. 2018GXNSFDA281009, 2019GXNSFDA245018), Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing, Guangxi “Bagui Scholar” Teams for Innovation and Research Project, and Guangxi First-class Undergraduate Course Construction Project (No. 202103).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interest to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Pang, Y., Zhang, C., Li, Z. et al. Text-based person search by non-saliency enhancing and dynamic label smoothing. Neural Comput & Applic 36, 13327–13339 (2024). https://doi.org/10.1007/s00521-024-09691-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-024-09691-1