Skip to main content
Log in

Text-based person search by non-saliency enhancing and dynamic label smoothing

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

The current text-based person re-identification (re-ID) models tend to learn salient features of image and text, which however is prone to failure in identifying persons with very similar dress, because their image contents with observable but indescribable difference may have identical textual description. To address this problem, we propose a re-ID model based on saliency masking to learn non-salient but highly discriminative features, which can work together with the salient features to provide more robust pedestrian identification. To further improve the performance of the model, a cross-modal projection matching loss with dynamic label smoothing (named CMPM-DS) is proposed to train our model, and our CMPM-DS can adaptively adjust the smoothing degree of the true distribution. We conduct extensive ablation and comparison experiments on two popular re-ID benchmarks to demonstrate the efficiency of our model and loss function, and our model achieves SOTA, improving the existing best R@1 by 0.33% on CUHK-PEDE and 4.45% on RSTPReID.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

These data were derived from the following resources available in the public domain. The datasets link have been uploaded.

References

  1. Li S, Xiao T, Li H, Yang W, Wang X (2017) Identity-aware textual-visual matching with latent co-attention. In: Proceedings of the IEEE international conference on computer vision. pp 1890–1899

  2. Wang Z, Zhu A, Zheng Z, Jin J, Xue Z, Hua G (2020) IMG-Net: inner-cross-modal attentional multigranular network for description-based person re-identification. J Electron Imaging 29(4):043028

    Article  Google Scholar 

  3. Zhu A, Wang Z, Li Y, Wan X, Jin J, Wang T, Hu F, Hua G (2021) Dssl: Deep surroundings-person separation learning for text-based person retrieval. In: Proceedings of the 29th ACM international conference on multimedia, pp 209–217

  4. Ding Z, Ding C, Shao Z, Tao D (2021) Semantically self-aligned network for text-to-image part-aware person re-identification. arXiv:2107.12666

  5. Chen Y, Zhang G, Lu Y, Wang Z, Zheng Y (2022) Tipcb: A simple but effective part-based convolutional baseline for text-based person search. Neurocomputing 494:171–181

    Article  Google Scholar 

  6. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth \(16\times 16\) words: transformers for image recognition at scale. arXiv:2010.11929

  7. Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805

  8. Zhang Y, Lu H (2018) Deep cross-modal projection learning for image-text matching. In: Proceedings of the European conference on computer vision (ECCV). pp 686–701

  9. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 2818–2826

  10. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30

  11. Chen Y-C, Li L, Yu L, El Kholy A, Ahmed F, Gan Z, Cheng Y, Liu J (2020) Uniter: Universal image-text representation learning. In: Computer vision-ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX. Springer, pp 104–120

  12. Li G, Duan N, Fang Y, Gong M, Jiang D (2020) Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. Proc AAAI Conf Artif Intell 34:11336–11344

    Google Scholar 

  13. Li LH, Yatskar M, Yin D, Hsieh C-J, Chang K-W (2019) Visualbert: a simple and performant baseline for vision and language. arXiv:1908.03557

  14. Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inf Process Syst 32

  15. Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J (2019) Vl-bert: pre-training of generic visual-linguistic representations. arXiv:1908.08530

  16. Tan H, Bansal M (2019) Lxmert: Learning cross-modality encoder representations from transformers. arXiv:1908.07490

  17. Chang X, Wang T, Cai S, Sun C (2023) Landmark: language-guided representation enhancement framework for scene graph generation. arXiv:2303.01080

  18. Wu N, Kera H, Kawamoto K (2023) Improving zero-shot action recognition using human instruction with text description. Appl Intell 1–15

  19. Munusamy H (2023) Multimodal attention-based transformer for video captioning. Appl Intell 1–20

  20. Sun C, Myers A, Vondrick C, Murphy K, Schmid C (2019) Videobert: a joint model for video and language representation learning. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 7464–7473

  21. Huang Z, Zeng Z, Huang Y, Liu B, Fu D, Fu J (2021) Seeing out of the box: end-to-end pre-training for vision-language representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 12976–12985

  22. Ning E, Zhang C, Wang C, Ning X, Chen H, Bai X (2023) Pedestrian re-id based on feature consistency and contrast enhancement. Displays 79:102467

    Article  Google Scholar 

  23. Zheng L, Huang Y, Lu H, Yang Y (2019) Pose-invariant embedding for deep person re-identification. IEEE Trans Image Process 28(9):4500–4509

    Article  MathSciNet  Google Scholar 

  24. Yang J, Zhang C, Li Z, Tang Y, Wang Z (2023) Discriminative feature mining with relation regularization for person re-identification. Inf Process Manag 60(3):103295

    Article  Google Scholar 

  25. Wei P, Zhang C, Tang Y, Li Z, Wang Z (2023) Reinforced domain adaptation with attention and adversarial learning for unsupervised person Re-ID. Appl Intell 53(4):4109–4123

    Article  Google Scholar 

  26. Yang J, Zhang C, Tang Y, Li Z (2022) PAFM: pose-drive attention fusion mechanism for occluded person re-identification. Neural Comput Appl 34(10):8241–8252

    Article  Google Scholar 

  27. Li S, Xiao T, Li H, Zhou B, Yue D, Wang X (2017) Person search with natural language description. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 1970–1979

  28. Jing Y, Si C, Wang J, Wang W, Wang L, Tan T (2020) Pose-guided multi-granularity attention network for text-based person search. Proc AAAI Conf Artif Intell 34(07):11189–11196

    Google Scholar 

  29. Niu K, Huang Y, Ouyang W, Wang L (2020) Improving description-based person re-identification by multi-granularity image-text alignments. IEEE Trans Image Process 29:5542–5556

    Article  Google Scholar 

  30. Zheng K, Liu W, Liu J, Zha Z-J, Mei T (2020) Hierarchical Gumbel attention network for text-based person search. In: Proceedings of the 28th ACM international conference on multimedia. pp 3441–3449

  31. Shu X, Wen W, Wu H, Chen K, Song Y, Qiao R, Ren B, Wang X (2022) See finer, see more: implicit modality alignment for text-based person retrieval. arXiv:2208.08608

  32. Cubuk ED, Zoph B, Mane D, Vasudevan V, Le QV (2018) Autoaugment: learning augmentation policies from data. arXiv:1805.09501

  33. Lim S, Kim I, Kim T, Kim C, Kim S (2019) Fast autoaugment. Adv Neural Inf Process Syst 32

  34. Ho D, Liang E, Chen X, Stoica I, Abbeel P (2019) Population based augmentation: efficient learning of augmentation policy schedules. In: International conference on machine learning, PMLR. pp 2731–2741

  35. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958

    MathSciNet  Google Scholar 

  36. Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning. PMLR, pp 448–456

  37. Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78

    Article  Google Scholar 

  38. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755

  39. He S, Luo H, Wang P, Wang F, Li H, Jiang W (2021) Transreid: Transformer-based object re-identification. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV). pp 15013–15022

  40. Reed S, Akata Z, Lee H, Schiele B (2016) Learning deep representations of fine-grained visual descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 49–58

  41. Chen T, Xu C, Luo J (2018) Improving text-based person search by spatial matching and adaptive threshold. In: 2018 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 1879–1887

  42. Chen D, Li H, Liu X, Shen Y, Shao J, Yuan Z, Wang X (2018) Improving deep visual representation for person re-identification by global and local image-language association. In: Proceedings of the European conference on computer vision (ECCV). pp 54–70

  43. Liu J, Zha Z-J, Hong R, Wang M, Zhang Y (2019) Deep adversarial graph attention convolution network for text-based person search. In: Proceedings of the 27th ACM international conference on multimedia, pp 665–673

  44. Aggarwal S, Radhakrishnan VB, Chakraborty A (2020) Text-based person search via attribute-aided matching. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 2617–2625

  45. Gao C, Cai G, Jiang X, Zheng F, Zhang J, Gong Y, Peng P, Guo X, Sun X (2021) Contextual non-local alignment over full-scale representation for text-based person search. arXiv:2101.03036

  46. Wang C, Luo Z, Lin Y, Li S (2021) Text-based person search via multi-granularity embedding learning. In: IJCAI, pp 1068–1074

  47. Han X, He S, Zhang L, Xiang T (2021) Text-based person search with limited data. arXiv:2110.10807

  48. Wang Z, Zhu A, Xue J, Wan X, Liu C, Wang T, Li Y (2022) Look before you leap: improving text-based person retrieval by learning a consistent cross-modal common manifold. In: Proceedings of the 30th ACM international conference on multimedia, pp 1984–1992

  49. Li F, Zhou H, Li H, Zhang Y, Yu Z (2022) Person text-image matching via text-feature interpretability embedding and external attack node implantation. arXiv:2211.08657

  50. Li S, Cao M, Zhang M (2022) Learning semantic-aligned feature representation for text-based person search. In: ICASSP 2022–2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2724–2728

  51. Wang Z, Zhu A, Xue J, Wan X, Liu C, Wang T, Li Y (2022) Caibc: Capturing all-round information beyond color for text-based person retrieval. In: Proceedings of the 30th ACM international conference on multimedia, pp 5314–5322

  52. Shao Z, Zhang X, Fang M, Lin Z, Wang J, Ding C (2022) Learning granularity-unified representations for text-to-image person re-identification. In: Proceedings of the 30th ACM international conference on multimedia, pp 5566–5574

  53. Wang Z, Xue J, Zhu A, Li Y, Zhang M, Zhong C (2021) Amen: adversarial multi-space embedding network for text-based person re-identification. In: Chinese conference on pattern recognition and computer vision (PRCV). Springer, pp 462–473

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Nos. 62266009, 62276073), Guangxi Natural Science Foundation (No. 2018GXNSFDA281009, 2019GXNSFDA245018), Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing, Guangxi “Bagui Scholar” Teams for Innovation and Research Project, and Guangxi First-class Undergraduate Course Construction Project (No. 202103).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Canlong Zhang.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interest to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pang, Y., Zhang, C., Li, Z. et al. Text-based person search by non-saliency enhancing and dynamic label smoothing. Neural Comput & Applic 36, 13327–13339 (2024). https://doi.org/10.1007/s00521-024-09691-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-024-09691-1

Keywords