Abstract
Text-to-image person re-identification (TIReID) aims to identify and locate pedestrian images based on given textual description queries. The main challenge of the task is bridging the significant gap between text and image modalities. Previous works primarily utilize cross-modality matching constraints to align the global or local features between samples. However, these methods overlook the relationship inconsistency problem caused by different text descriptions and generate local information redundancy in the local feature extraction process. In this paper, we propose the Granularity-Associated Invariance Features (GAIF) learning strategy to explore potential cross-modality invariant information. Firstly, we propose Global Matching Relationship Improvement (GMRI) with dynamic constraint factors to regulate the matching relationships between different samples. Secondly, we construct the Local Joint Learning Strategy (LJLS) to iteratively optimize fine-grained information from representation learning or metric learning views. Furthermore, we integrate GMRI and LJLS into a unified framework and utilize various constraints to comprehensively optimize global and local associated invariant features. We conduct extensive experiments to assess the proposed GAIF on three TIReID benchmark databases. The experimental results demonstrate that the proposed GAIF outperforms most of the advanced methods in key criteria.








Similar content being viewed by others
Data availability
No datasets were generated or analysed during the current study.
References
Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., Wang, X.: Person search with natural language description. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1970–1979 (2017)
Jiang, D., Ye, M.: Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2787–2797 (2023)
Qin, Y., Chen, Y., Peng, D., Peng, X., Zhou, J.T., Hu, P.: Noisy-correspondence learning for text-to-image person re-identification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 27197–27206 (2024)
Li, Z., Xie, Y.: Bcra: bidirectional cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. Multimedia Syst. 30(4), 177 (2024)
Miao, J., Wu, Y., Liu, P., Ding, Y., Yang, Y.: Pose-guided feature alignment for occluded person re-identification. In: IEEE International Conference on Computer Vision, pp. 542–551 (2019)
Si, T., He, F., Wu, H., Duan, Y.: Spatial-driven features based on image dependencies for person re-identification. Pattern Recogn. 124, 108462 (2022)
Si, T., He, F., Li, P., Gao, X.: Tri-modality consistency optimization with heterogeneous augmented images for visible-infrared person re-identification. Neurocomputing 523, 170–181 (2023)
Ning, X., Gong, K., Li, W., Zhang, L., Bai, X., Tian, S.: Feature refinement and filter network for person re-identification. IEEE Trans. Circuits Syst. Video Technol. 31(9), 3391–3402 (2020)
Li, Y., He, J., Zhang, T., Liu, X., Zhang, Y., Wu, F.: Diverse part discovery: Occluded person re-identification with part-aware transformer. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2898–2907 (2021)
Yadav, A., Vishwakarma, D.K.: A deep multi-level attentive network for multimodal sentiment analysis. ACM Trans. Multimedia Comput. Commun. Appl. 19(1), 1–19 (2023)
Lei, Z., Zhang, G., Wu, L., Zhang, K., Liang, R.: A multi-level mesh mutual attention model for visual question answering. Data Sci. Eng. 7(4), 339–353 (2022)
Yan, S., Tang, H., Zhang, L., Tang, J.: Image-specific information suppression and implicit local alignment for text-based person search. IEEE Transactions on Neural Networks and Learning Systems (2023)
Li, S., Xu, X., Yang, Y., Shen, F., Mo, Y., Li, Y., Shen, H.T.: Dcel: Deep cross-modal evidential learning for text-based person retrieval. In: ACM International Conference on Multimedia, pp. 6292–6300 (2023)
Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching. In: European Conference on Computer Vision, pp. 686–701 (2018)
Chen, Y., Huang, R., Chang, H., Tan, C., Xue, T., Ma, B.: Cross-modal knowledge adaptation for language-based person search. IEEE Trans. Image Process. 30, 4057–4069 (2021)
Han, X., He, S., Zhang, L., Xiang, T.: Text-based person search with limited data. arXiv:2110.10807 (2021)
Lin, D., Peng, Y., Meng, J., Zheng, W.-S.: Cross-modal adaptive dual association for text-to-image person retrieval. IEEE Transactions on Multimedia (2024)
Yang, S., Zhou, Y., Zheng, Z., Wang, Y., Zhu, L., Wu, Y.: Towards unified text-based person retrieval: A large-scale multi-attribute and language search benchmark. In: ACM International Conference on Multimedia, pp. 4492–4501 (2023)
Cheng, K., Geng, Q., Huang, S., Tu, J., Lu, H.: Learning shared features from specific and ambiguous descriptions for text-based person search. Multimedia Syst. 30(2), 94 (2024)
Chen, Y., Zhang, G., Lu, Y., Wang, Z., Zheng, Y.: Tipcb: a simple but effective part-based convolutional baseline for text-based person search. Neurocomputing 494, 171–181 (2022)
Jing, Y., Si, C., Wang, J., Wang, W., Wang, L., Tan, T.: Pose-guided multi-granularity attention network for text-based person search. In: Association for the Advance of Artificial Intelligence, pp. 11189–11196 (2020)
Niu, K., Huang, Y., Ouyang, W., Wang, L.: Improving description-based person re-identification by multi-granularity image-text alignments. IEEE Trans. Image Process. 29, 5542–5556 (2020)
Si, T., He, F., Li, P., Ye, M.: Homogeneous and heterogeneous optimization for unsupervised cross-modality person re-identification in visual internet of things. IEEE Internet Things J. 11(7), 12165–12176 (2024)
Li, P., Wang, Y., Si, T., Ullah, K., Han, W., Wang, L.: Mffsp: multi-scale feature fusion scene parsing network for landslides detection based on high-resolution satellite images. Eng. Appl. Artific. Intellig. 127, 107337 (2024)
Wu, Z., Hu, Z., Ding, J.: Same-clothes person re-identification with dual-stream network. Multimedia Syst. 30(2), 70 (2024)
Chen, J., Gao, C., Sun, L., Sang, N.: Ccsd: cross-camera self-distillation for unsupervised person re-identification. Vis. Intellig. 1(1), 27 (2023)
Yan, P., Liu, X., Zhang, P., Lu, H.: Learning convolutional multi-level transformers for image-based person re-identification. Vis. Intellig. 1(1), 24 (2023)
Li, W., Zhao, R., Xiao, T., Wang, X.: Deepreid: Deep filter pairing neural network for person re-identification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 152–159 (2014)
Sun, Y., Zheng, L., Yang, Y., Tian, Q., Wang, S.: Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In: European Conference on Computer Vision, pp. 480–496 (2018)
Wang, G., Yuan, Y., Chen, X., Li, J., Zhou, X.: Learning discriminative features with multiple granularities for person re-identification. In: ACM International Conference on Multimedia, pp. 274–282 (2018)
Hou, R., Ma, B., Chang, H., Gu, X., Shan, S., Chen, X.: Interaction-and-aggregation network for person re-identification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 9317–9326 (2019)
Wang, G., Zhang, T., Cheng, J., Liu, S., Yang, Y., Hou, Z.: Rgb-infrared cross-modality person re-identification via joint pixel and feature alignment. In: IEEE International Conference on Computer Vision, pp. 3623–3632 (2019)
He, S., Luo, H., Wang, P., Wang, F., Li, H., Jiang, W.: Transreid: Transformer-based object re-identification. In: IEEE International Conference on Computer Vision, pp. 15013–15022 (2021)
Wen, Y., Zhang, K., Li, Z., Qiao, Y.: A discriminative feature learning approach for deep face recognition. In: European Conference on Computer Vision, pp. 499–515 (2016)
Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identification. arXiv:1703.07737 (2017)
Sun, Y., Cheng, C., Zhang, Y., Zhang, C., Zheng, L., Wang, Z., Wei, Y.: Circle loss: A unified perspective of pair similarity optimization. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6398–6407 (2020)
Chen, T., Xu, C., Luo, J.: Improving text-based person search by spatial matching and adaptive threshold. In: IEEE Winter Conference on Applications of Computer Vision, pp. 1879–1887 (2018)
Li, S., Xiao, T., Li, H., Yang, W., Wang, X.: Identity-aware textual-visual matching with latent co-attention. In: IEEE International Conference on Computer Vision, pp. 1890–1899 (2017)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neur. Comput. 9(8), 1735–1780 (1997)
Sarafianos, N., Xu, X., Kakadiaris, I.A.: Adversarial representation learning for text-to-image matching. In: IEEE International Conference on Computer Vision, pp. 5814–5824 (2019)
Wang, Z., Fang, Z., Wang, J., Yang, Y.: Vitaa: Visual-textual attributes alignment in person search by natural language. In: European Conference on Computer Vision, pp. 402–420 (2020)
Chen, M., Gao, J., Xu, C.: Conjugated semantic pool improves ood detection with pre-trained vision-language models. arXiv:2410.08611 (2024)
Zhao, Z., Liu, B., Lu, Y., Chu, Q., Yu, N.: Unifying multi-modal uncertainty modeling and semantic alignment for text-to-image person re-identification. In: Association for the Advance of Artificial Intelligence, pp. 7534–7542 (2024)
Wang, Z., Zhu, A., Xue, J., Wan, X., Liu, C., Wang, T., Li, Y.: Look before you leap: Improving text-based person retrieval by learning a consistent cross-modal common manifold. In: ACM International Conference on Multimedia, pp. 1984–1992 (2022)
Li, S., Cao, M., Zhang, M.: Learning semantic-aligned feature representation for text-based person search. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2724–2728 (2022)
Wang, Z., Zhu, A., Xue, J., Wan, X., Liu, C., Wang, T., Li, Y.: Caibc: Capturing all-round information beyond color for text-based person retrieval. In: ACM International Conference on Multimedia, pp. 5314–5322 (2022)
Farooq, A., Awais, M., Kittler, J., Khalid, S.S.: Axm-net: Implicit cross-modal feature alignment for person re-identification. In: Association for the Advance of Artificial Intelligence, pp. 4477–4485 (2022)
Shao, Z., Zhang, X., Fang, M., Lin, Z., Wang, J., Ding, C.: Learning granularity-unified representations for text-to-image person re-identification. In: ACM International Conference on Multimedia, pp. 5566–5574 (2022)
Shu, X., Wen, W., Wu, H., Chen, K., Song, Y., Qiao, R., Ren, B., Wang, X.: See finer, see more: Implicit modality alignment for text-based person retrieval. In: European Conference on Computer Vision, pp. 624–641 (2022)
Ma, Y., Sun, X., Ji, J., Jiang, G., Zhuang, W., Ji, R.: Beat: Bi-directional one-to-many embedding alignment for text-based person retrieval. In: ACM International Conference on Multimedia, pp. 4157–4168 (2023)
Yan, S., Dong, N., Liu, J., Zhang, L., Tang, J.: Learning comprehensive representations with richer self for text-to-image person re-identification. In: ACM International Conference on Multimedia, pp. 6202–6211 (2023)
Shao, Z., Zhang, X., Ding, C., Wang, J., Wang, J.: Unified pre-training with pseudo texts for text-to-image person re-identification. In: IEEE International Conference on Computer Vision, pp. 11174–11184 (2023)
Wu, H., Chen, W., Liu, Z., Chen, T., Chen, Z., Lin, L.: Contrastive transformer learning with proximity data generation for text-based person search. IEEE Transactions on Circuits and Systems for Video Technology (2023)
Yan, S., Dong, N., Zhang, L., Tang, J.: Clip-driven fine-grained text-image person re-identification. IEEE Transactions on Image Processing (2023)
Han, G., Lin, M., Li, Z., Zhao, H., Kwong, S.: Text-to-image person re-identification based on multimodal graph convolutional network. IEEE Transactions on Multimedia (2023)
Xie, S., Zhang, C., Ning, E., Li, Z., Wang, Z., Wei, C.: Full-view salient feature mining and alignment for text-based person search. Expert Syst. Appl. 251, 124071 (2024)
Cao, M., Bai, Y., Zeng, Z., Ye, M., Zhang, M.: An empirical study of clip for text-based person search. In: Association for the Advance of Artificial Intelligence, pp. 465–473 (2024)
Luo, H., Jiang, W., Gu, Y., Liu, F., Liao, X., Lai, S., Gu, J.: A strong baseline and batch normalization neck for deep person re-identification. IEEE Transactions on Multimedia, 2597–2609 (2019)
Zhu, A., Wang, Z., Li, Y., Wan, X., Jin, J., Wang, T., Hu, F., Hua, G.: Dssl: Deep surroundings-person separation learning for text-based person retrieval. In: ACM International Conference on Multimedia, pp. 209–217 (2021)
Ding, Z., Ding, C., Shao, Z., Tao, D.: Semantically self-aligned network for text-to-image part-aware person re-identification. arXiv:2107.12666 (2021)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv:1412.6980 (2014)
Fujii, T., Tarashima, S.: Bilma: Bidirectional local-matching for text-based person re-identification. In: IEEE International Conference on Computer Vision, pp. 2786–2790 (2023)
Acknowledgements
This work is supported by the Shandong Provincial Natural Science Foundation under Grant No. ZR2023LZH013 and No. ZR2024QF185, the Jinan Municipal and School Integration Development Strategy Project under Grant No. JNSX2023025 and No. JNSX2023015, and the New Introduced Talents Program of University of Jinan under Grant No. 1009569. The Numerical Calculations are Supported by High-performance Computing Platform at University of Jinan.
Author information
Authors and Affiliations
Contributions
C. Shao: Methodology, Software, Investigation, Validation, Writing - original draft. T. Si: Conceptualization, Methodology, Writing - original draft, Writing-review, Project administration. X. Yang: Writing - review, Validation, Supervision, Project administration, Funding acquisition.
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Communicated by Junyu Gao.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Shao, C., Si, T. & Yang, X. Exploring granularity-associated invariance features for text-to-image person re-identification. Multimedia Systems 31, 51 (2025). https://doi.org/10.1007/s00530-024-01638-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00530-024-01638-9