Skip to main content

Advertisement

Log in

Exploring granularity-associated invariance features for text-to-image person re-identification

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Text-to-image person re-identification (TIReID) aims to identify and locate pedestrian images based on given textual description queries. The main challenge of the task is bridging the significant gap between text and image modalities. Previous works primarily utilize cross-modality matching constraints to align the global or local features between samples. However, these methods overlook the relationship inconsistency problem caused by different text descriptions and generate local information redundancy in the local feature extraction process. In this paper, we propose the Granularity-Associated Invariance Features (GAIF) learning strategy to explore potential cross-modality invariant information. Firstly, we propose Global Matching Relationship Improvement (GMRI) with dynamic constraint factors to regulate the matching relationships between different samples. Secondly, we construct the Local Joint Learning Strategy (LJLS) to iteratively optimize fine-grained information from representation learning or metric learning views. Furthermore, we integrate GMRI and LJLS into a unified framework and utilize various constraints to comprehensively optimize global and local associated invariant features. We conduct extensive experiments to assess the proposed GAIF on three TIReID benchmark databases. The experimental results demonstrate that the proposed GAIF outperforms most of the advanced methods in key criteria.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

No datasets were generated or analysed during the current study.

References

  1. Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., Wang, X.: Person search with natural language description. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1970–1979 (2017)

  2. Jiang, D., Ye, M.: Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2787–2797 (2023)

  3. Qin, Y., Chen, Y., Peng, D., Peng, X., Zhou, J.T., Hu, P.: Noisy-correspondence learning for text-to-image person re-identification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 27197–27206 (2024)

  4. Li, Z., Xie, Y.: Bcra: bidirectional cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. Multimedia Syst. 30(4), 177 (2024)

    Article  MATH  Google Scholar 

  5. Miao, J., Wu, Y., Liu, P., Ding, Y., Yang, Y.: Pose-guided feature alignment for occluded person re-identification. In: IEEE International Conference on Computer Vision, pp. 542–551 (2019)

  6. Si, T., He, F., Wu, H., Duan, Y.: Spatial-driven features based on image dependencies for person re-identification. Pattern Recogn. 124, 108462 (2022)

    Article  MATH  Google Scholar 

  7. Si, T., He, F., Li, P., Gao, X.: Tri-modality consistency optimization with heterogeneous augmented images for visible-infrared person re-identification. Neurocomputing 523, 170–181 (2023)

    Article  MATH  Google Scholar 

  8. Ning, X., Gong, K., Li, W., Zhang, L., Bai, X., Tian, S.: Feature refinement and filter network for person re-identification. IEEE Trans. Circuits Syst. Video Technol. 31(9), 3391–3402 (2020)

    Article  MATH  Google Scholar 

  9. Li, Y., He, J., Zhang, T., Liu, X., Zhang, Y., Wu, F.: Diverse part discovery: Occluded person re-identification with part-aware transformer. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2898–2907 (2021)

  10. Yadav, A., Vishwakarma, D.K.: A deep multi-level attentive network for multimodal sentiment analysis. ACM Trans. Multimedia Comput. Commun. Appl. 19(1), 1–19 (2023)

    Article  MATH  Google Scholar 

  11. Lei, Z., Zhang, G., Wu, L., Zhang, K., Liang, R.: A multi-level mesh mutual attention model for visual question answering. Data Sci. Eng. 7(4), 339–353 (2022)

    Article  MATH  Google Scholar 

  12. Yan, S., Tang, H., Zhang, L., Tang, J.: Image-specific information suppression and implicit local alignment for text-based person search. IEEE Transactions on Neural Networks and Learning Systems (2023)

  13. Li, S., Xu, X., Yang, Y., Shen, F., Mo, Y., Li, Y., Shen, H.T.: Dcel: Deep cross-modal evidential learning for text-based person retrieval. In: ACM International Conference on Multimedia, pp. 6292–6300 (2023)

  14. Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching. In: European Conference on Computer Vision, pp. 686–701 (2018)

  15. Chen, Y., Huang, R., Chang, H., Tan, C., Xue, T., Ma, B.: Cross-modal knowledge adaptation for language-based person search. IEEE Trans. Image Process. 30, 4057–4069 (2021)

    Article  MATH  Google Scholar 

  16. Han, X., He, S., Zhang, L., Xiang, T.: Text-based person search with limited data. arXiv:2110.10807 (2021)

  17. Lin, D., Peng, Y., Meng, J., Zheng, W.-S.: Cross-modal adaptive dual association for text-to-image person retrieval. IEEE Transactions on Multimedia (2024)

  18. Yang, S., Zhou, Y., Zheng, Z., Wang, Y., Zhu, L., Wu, Y.: Towards unified text-based person retrieval: A large-scale multi-attribute and language search benchmark. In: ACM International Conference on Multimedia, pp. 4492–4501 (2023)

  19. Cheng, K., Geng, Q., Huang, S., Tu, J., Lu, H.: Learning shared features from specific and ambiguous descriptions for text-based person search. Multimedia Syst. 30(2), 94 (2024)

    Article  Google Scholar 

  20. Chen, Y., Zhang, G., Lu, Y., Wang, Z., Zheng, Y.: Tipcb: a simple but effective part-based convolutional baseline for text-based person search. Neurocomputing 494, 171–181 (2022)

    Article  MATH  Google Scholar 

  21. Jing, Y., Si, C., Wang, J., Wang, W., Wang, L., Tan, T.: Pose-guided multi-granularity attention network for text-based person search. In: Association for the Advance of Artificial Intelligence, pp. 11189–11196 (2020)

  22. Niu, K., Huang, Y., Ouyang, W., Wang, L.: Improving description-based person re-identification by multi-granularity image-text alignments. IEEE Trans. Image Process. 29, 5542–5556 (2020)

    Article  Google Scholar 

  23. Si, T., He, F., Li, P., Ye, M.: Homogeneous and heterogeneous optimization for unsupervised cross-modality person re-identification in visual internet of things. IEEE Internet Things J. 11(7), 12165–12176 (2024)

    Article  MATH  Google Scholar 

  24. Li, P., Wang, Y., Si, T., Ullah, K., Han, W., Wang, L.: Mffsp: multi-scale feature fusion scene parsing network for landslides detection based on high-resolution satellite images. Eng. Appl. Artific. Intellig. 127, 107337 (2024)

    Article  Google Scholar 

  25. Wu, Z., Hu, Z., Ding, J.: Same-clothes person re-identification with dual-stream network. Multimedia Syst. 30(2), 70 (2024)

    Article  MATH  Google Scholar 

  26. Chen, J., Gao, C., Sun, L., Sang, N.: Ccsd: cross-camera self-distillation for unsupervised person re-identification. Vis. Intellig. 1(1), 27 (2023)

    Article  Google Scholar 

  27. Yan, P., Liu, X., Zhang, P., Lu, H.: Learning convolutional multi-level transformers for image-based person re-identification. Vis. Intellig. 1(1), 24 (2023)

    Article  MATH  Google Scholar 

  28. Li, W., Zhao, R., Xiao, T., Wang, X.: Deepreid: Deep filter pairing neural network for person re-identification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 152–159 (2014)

  29. Sun, Y., Zheng, L., Yang, Y., Tian, Q., Wang, S.: Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In: European Conference on Computer Vision, pp. 480–496 (2018)

  30. Wang, G., Yuan, Y., Chen, X., Li, J., Zhou, X.: Learning discriminative features with multiple granularities for person re-identification. In: ACM International Conference on Multimedia, pp. 274–282 (2018)

  31. Hou, R., Ma, B., Chang, H., Gu, X., Shan, S., Chen, X.: Interaction-and-aggregation network for person re-identification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 9317–9326 (2019)

  32. Wang, G., Zhang, T., Cheng, J., Liu, S., Yang, Y., Hou, Z.: Rgb-infrared cross-modality person re-identification via joint pixel and feature alignment. In: IEEE International Conference on Computer Vision, pp. 3623–3632 (2019)

  33. He, S., Luo, H., Wang, P., Wang, F., Li, H., Jiang, W.: Transreid: Transformer-based object re-identification. In: IEEE International Conference on Computer Vision, pp. 15013–15022 (2021)

  34. Wen, Y., Zhang, K., Li, Z., Qiao, Y.: A discriminative feature learning approach for deep face recognition. In: European Conference on Computer Vision, pp. 499–515 (2016)

  35. Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identification. arXiv:1703.07737 (2017)

  36. Sun, Y., Cheng, C., Zhang, Y., Zhang, C., Zheng, L., Wang, Z., Wei, Y.: Circle loss: A unified perspective of pair similarity optimization. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6398–6407 (2020)

  37. Chen, T., Xu, C., Luo, J.: Improving text-based person search by spatial matching and adaptive threshold. In: IEEE Winter Conference on Applications of Computer Vision, pp. 1879–1887 (2018)

  38. Li, S., Xiao, T., Li, H., Yang, W., Wang, X.: Identity-aware textual-visual matching with latent co-attention. In: IEEE International Conference on Computer Vision, pp. 1890–1899 (2017)

  39. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)

  40. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neur. Comput. 9(8), 1735–1780 (1997)

    Article  MATH  Google Scholar 

  41. Sarafianos, N., Xu, X., Kakadiaris, I.A.: Adversarial representation learning for text-to-image matching. In: IEEE International Conference on Computer Vision, pp. 5814–5824 (2019)

  42. Wang, Z., Fang, Z., Wang, J., Yang, Y.: Vitaa: Visual-textual attributes alignment in person search by natural language. In: European Conference on Computer Vision, pp. 402–420 (2020)

  43. Chen, M., Gao, J., Xu, C.: Conjugated semantic pool improves ood detection with pre-trained vision-language models. arXiv:2410.08611 (2024)

  44. Zhao, Z., Liu, B., Lu, Y., Chu, Q., Yu, N.: Unifying multi-modal uncertainty modeling and semantic alignment for text-to-image person re-identification. In: Association for the Advance of Artificial Intelligence, pp. 7534–7542 (2024)

  45. Wang, Z., Zhu, A., Xue, J., Wan, X., Liu, C., Wang, T., Li, Y.: Look before you leap: Improving text-based person retrieval by learning a consistent cross-modal common manifold. In: ACM International Conference on Multimedia, pp. 1984–1992 (2022)

  46. Li, S., Cao, M., Zhang, M.: Learning semantic-aligned feature representation for text-based person search. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2724–2728 (2022)

  47. Wang, Z., Zhu, A., Xue, J., Wan, X., Liu, C., Wang, T., Li, Y.: Caibc: Capturing all-round information beyond color for text-based person retrieval. In: ACM International Conference on Multimedia, pp. 5314–5322 (2022)

  48. Farooq, A., Awais, M., Kittler, J., Khalid, S.S.: Axm-net: Implicit cross-modal feature alignment for person re-identification. In: Association for the Advance of Artificial Intelligence, pp. 4477–4485 (2022)

  49. Shao, Z., Zhang, X., Fang, M., Lin, Z., Wang, J., Ding, C.: Learning granularity-unified representations for text-to-image person re-identification. In: ACM International Conference on Multimedia, pp. 5566–5574 (2022)

  50. Shu, X., Wen, W., Wu, H., Chen, K., Song, Y., Qiao, R., Ren, B., Wang, X.: See finer, see more: Implicit modality alignment for text-based person retrieval. In: European Conference on Computer Vision, pp. 624–641 (2022)

  51. Ma, Y., Sun, X., Ji, J., Jiang, G., Zhuang, W., Ji, R.: Beat: Bi-directional one-to-many embedding alignment for text-based person retrieval. In: ACM International Conference on Multimedia, pp. 4157–4168 (2023)

  52. Yan, S., Dong, N., Liu, J., Zhang, L., Tang, J.: Learning comprehensive representations with richer self for text-to-image person re-identification. In: ACM International Conference on Multimedia, pp. 6202–6211 (2023)

  53. Shao, Z., Zhang, X., Ding, C., Wang, J., Wang, J.: Unified pre-training with pseudo texts for text-to-image person re-identification. In: IEEE International Conference on Computer Vision, pp. 11174–11184 (2023)

  54. Wu, H., Chen, W., Liu, Z., Chen, T., Chen, Z., Lin, L.: Contrastive transformer learning with proximity data generation for text-based person search. IEEE Transactions on Circuits and Systems for Video Technology (2023)

  55. Yan, S., Dong, N., Zhang, L., Tang, J.: Clip-driven fine-grained text-image person re-identification. IEEE Transactions on Image Processing (2023)

  56. Han, G., Lin, M., Li, Z., Zhao, H., Kwong, S.: Text-to-image person re-identification based on multimodal graph convolutional network. IEEE Transactions on Multimedia (2023)

  57. Xie, S., Zhang, C., Ning, E., Li, Z., Wang, Z., Wei, C.: Full-view salient feature mining and alignment for text-based person search. Expert Syst. Appl. 251, 124071 (2024)

    Article  MATH  Google Scholar 

  58. Cao, M., Bai, Y., Zeng, Z., Ye, M., Zhang, M.: An empirical study of clip for text-based person search. In: Association for the Advance of Artificial Intelligence, pp. 465–473 (2024)

  59. Luo, H., Jiang, W., Gu, Y., Liu, F., Liao, X., Lai, S., Gu, J.: A strong baseline and batch normalization neck for deep person re-identification. IEEE Transactions on Multimedia, 2597–2609 (2019)

  60. Zhu, A., Wang, Z., Li, Y., Wan, X., Jin, J., Wang, T., Hu, F., Hua, G.: Dssl: Deep surroundings-person separation learning for text-based person retrieval. In: ACM International Conference on Multimedia, pp. 209–217 (2021)

  61. Ding, Z., Ding, C., Shao, Z., Tao, D.: Semantically self-aligned network for text-to-image part-aware person re-identification. arXiv:2107.12666 (2021)

  62. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv:1412.6980 (2014)

  63. Fujii, T., Tarashima, S.: Bilma: Bidirectional local-matching for text-based person re-identification. In: IEEE International Conference on Computer Vision, pp. 2786–2790 (2023)

Download references

Acknowledgements

This work is supported by the Shandong Provincial Natural Science Foundation under Grant No. ZR2023LZH013 and No. ZR2024QF185, the Jinan Municipal and School Integration Development Strategy Project under Grant No. JNSX2023025 and No. JNSX2023015, and the New Introduced Talents Program of University of Jinan under Grant No. 1009569. The Numerical Calculations are Supported by High-performance Computing Platform at University of Jinan.

Author information

Authors and Affiliations

Authors

Contributions

C. Shao: Methodology, Software, Investigation, Validation, Writing - original draft. T. Si: Conceptualization, Methodology, Writing - original draft, Writing-review, Project administration. X. Yang: Writing - review, Validation, Supervision, Project administration, Funding acquisition.

Corresponding authors

Correspondence to Tongzhen Si or Xiaohui Yang.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Communicated by Junyu Gao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shao, C., Si, T. & Yang, X. Exploring granularity-associated invariance features for text-to-image person re-identification. Multimedia Systems 31, 51 (2025). https://doi.org/10.1007/s00530-024-01638-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-024-01638-9

Keywords