Exploring granularity-associated invariance features for text-to-image person re-identification

Shao, Chenglong; Si, Tongzhen; Yang, Xiaohui

doi:10.1007/s00530-024-01638-9

Exploring granularity-associated invariance features for text-to-image person re-identification

Regular Paper
Published: 07 January 2025

Volume 31, article number 51, (2025)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Chenglong Shao¹,
Tongzhen Si^1,2 &
Xiaohui Yang^1,2,3

275 Accesses
Explore all metrics

Abstract

Text-to-image person re-identification (TIReID) aims to identify and locate pedestrian images based on given textual description queries. The main challenge of the task is bridging the significant gap between text and image modalities. Previous works primarily utilize cross-modality matching constraints to align the global or local features between samples. However, these methods overlook the relationship inconsistency problem caused by different text descriptions and generate local information redundancy in the local feature extraction process. In this paper, we propose the Granularity-Associated Invariance Features (GAIF) learning strategy to explore potential cross-modality invariant information. Firstly, we propose Global Matching Relationship Improvement (GMRI) with dynamic constraint factors to regulate the matching relationships between different samples. Secondly, we construct the Local Joint Learning Strategy (LJLS) to iteratively optimize fine-grained information from representation learning or metric learning views. Furthermore, we integrate GMRI and LJLS into a unified framework and utilize various constraints to comprehensively optimize global and local associated invariant features. We conduct extensive experiments to assess the proposed GAIF on three TIReID benchmark databases. The experimental results demonstrate that the proposed GAIF outperforms most of the advanced methods in key criteria.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spatial enhanced multi-level alignment learning for text-image person re-identification with coupled noisy labels

Article 09 March 2025

Cross-Modal Dual Matching and Comparison for Text-to-Image Person Re-identification

DSFAT: a dual-stream framework assisted by textual information for person re-identification in real scenes

Article 13 March 2025

Data availability

No datasets were generated or analysed during the current study.

References

Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., Wang, X.: Person search with natural language description. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1970–1979 (2017)
Jiang, D., Ye, M.: Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2787–2797 (2023)
Qin, Y., Chen, Y., Peng, D., Peng, X., Zhou, J.T., Hu, P.: Noisy-correspondence learning for text-to-image person re-identification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 27197–27206 (2024)
Li, Z., Xie, Y.: Bcra: bidirectional cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. Multimedia Syst. 30(4), 177 (2024)
Article MATH Google Scholar
Miao, J., Wu, Y., Liu, P., Ding, Y., Yang, Y.: Pose-guided feature alignment for occluded person re-identification. In: IEEE International Conference on Computer Vision, pp. 542–551 (2019)
Si, T., He, F., Wu, H., Duan, Y.: Spatial-driven features based on image dependencies for person re-identification. Pattern Recogn. 124, 108462 (2022)
Article MATH Google Scholar
Si, T., He, F., Li, P., Gao, X.: Tri-modality consistency optimization with heterogeneous augmented images for visible-infrared person re-identification. Neurocomputing 523, 170–181 (2023)
Article MATH Google Scholar
Ning, X., Gong, K., Li, W., Zhang, L., Bai, X., Tian, S.: Feature refinement and filter network for person re-identification. IEEE Trans. Circuits Syst. Video Technol. 31(9), 3391–3402 (2020)
Article MATH Google Scholar
Li, Y., He, J., Zhang, T., Liu, X., Zhang, Y., Wu, F.: Diverse part discovery: Occluded person re-identification with part-aware transformer. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2898–2907 (2021)
Yadav, A., Vishwakarma, D.K.: A deep multi-level attentive network for multimodal sentiment analysis. ACM Trans. Multimedia Comput. Commun. Appl. 19(1), 1–19 (2023)
Article MATH Google Scholar
Lei, Z., Zhang, G., Wu, L., Zhang, K., Liang, R.: A multi-level mesh mutual attention model for visual question answering. Data Sci. Eng. 7(4), 339–353 (2022)
Article MATH Google Scholar
Yan, S., Tang, H., Zhang, L., Tang, J.: Image-specific information suppression and implicit local alignment for text-based person search. IEEE Transactions on Neural Networks and Learning Systems (2023)
Li, S., Xu, X., Yang, Y., Shen, F., Mo, Y., Li, Y., Shen, H.T.: Dcel: Deep cross-modal evidential learning for text-based person retrieval. In: ACM International Conference on Multimedia, pp. 6292–6300 (2023)
Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching. In: European Conference on Computer Vision, pp. 686–701 (2018)
Chen, Y., Huang, R., Chang, H., Tan, C., Xue, T., Ma, B.: Cross-modal knowledge adaptation for language-based person search. IEEE Trans. Image Process. 30, 4057–4069 (2021)
Article MATH Google Scholar
Han, X., He, S., Zhang, L., Xiang, T.: Text-based person search with limited data. arXiv:2110.10807 (2021)
Lin, D., Peng, Y., Meng, J., Zheng, W.-S.: Cross-modal adaptive dual association for text-to-image person retrieval. IEEE Transactions on Multimedia (2024)
Yang, S., Zhou, Y., Zheng, Z., Wang, Y., Zhu, L., Wu, Y.: Towards unified text-based person retrieval: A large-scale multi-attribute and language search benchmark. In: ACM International Conference on Multimedia, pp. 4492–4501 (2023)
Cheng, K., Geng, Q., Huang, S., Tu, J., Lu, H.: Learning shared features from specific and ambiguous descriptions for text-based person search. Multimedia Syst. 30(2), 94 (2024)
Article Google Scholar
Chen, Y., Zhang, G., Lu, Y., Wang, Z., Zheng, Y.: Tipcb: a simple but effective part-based convolutional baseline for text-based person search. Neurocomputing 494, 171–181 (2022)
Article MATH Google Scholar
Jing, Y., Si, C., Wang, J., Wang, W., Wang, L., Tan, T.: Pose-guided multi-granularity attention network for text-based person search. In: Association for the Advance of Artificial Intelligence, pp. 11189–11196 (2020)
Niu, K., Huang, Y., Ouyang, W., Wang, L.: Improving description-based person re-identification by multi-granularity image-text alignments. IEEE Trans. Image Process. 29, 5542–5556 (2020)
Article Google Scholar
Si, T., He, F., Li, P., Ye, M.: Homogeneous and heterogeneous optimization for unsupervised cross-modality person re-identification in visual internet of things. IEEE Internet Things J. 11(7), 12165–12176 (2024)
Article MATH Google Scholar
Li, P., Wang, Y., Si, T., Ullah, K., Han, W., Wang, L.: Mffsp: multi-scale feature fusion scene parsing network for landslides detection based on high-resolution satellite images. Eng. Appl. Artific. Intellig. 127, 107337 (2024)
Article Google Scholar
Wu, Z., Hu, Z., Ding, J.: Same-clothes person re-identification with dual-stream network. Multimedia Syst. 30(2), 70 (2024)
Article MATH Google Scholar
Chen, J., Gao, C., Sun, L., Sang, N.: Ccsd: cross-camera self-distillation for unsupervised person re-identification. Vis. Intellig. 1(1), 27 (2023)
Article Google Scholar
Yan, P., Liu, X., Zhang, P., Lu, H.: Learning convolutional multi-level transformers for image-based person re-identification. Vis. Intellig. 1(1), 24 (2023)
Article MATH Google Scholar
Li, W., Zhao, R., Xiao, T., Wang, X.: Deepreid: Deep filter pairing neural network for person re-identification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 152–159 (2014)
Sun, Y., Zheng, L., Yang, Y., Tian, Q., Wang, S.: Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In: European Conference on Computer Vision, pp. 480–496 (2018)
Wang, G., Yuan, Y., Chen, X., Li, J., Zhou, X.: Learning discriminative features with multiple granularities for person re-identification. In: ACM International Conference on Multimedia, pp. 274–282 (2018)
Hou, R., Ma, B., Chang, H., Gu, X., Shan, S., Chen, X.: Interaction-and-aggregation network for person re-identification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 9317–9326 (2019)
Wang, G., Zhang, T., Cheng, J., Liu, S., Yang, Y., Hou, Z.: Rgb-infrared cross-modality person re-identification via joint pixel and feature alignment. In: IEEE International Conference on Computer Vision, pp. 3623–3632 (2019)
He, S., Luo, H., Wang, P., Wang, F., Li, H., Jiang, W.: Transreid: Transformer-based object re-identification. In: IEEE International Conference on Computer Vision, pp. 15013–15022 (2021)
Wen, Y., Zhang, K., Li, Z., Qiao, Y.: A discriminative feature learning approach for deep face recognition. In: European Conference on Computer Vision, pp. 499–515 (2016)
Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identification. arXiv:1703.07737 (2017)
Sun, Y., Cheng, C., Zhang, Y., Zhang, C., Zheng, L., Wang, Z., Wei, Y.: Circle loss: A unified perspective of pair similarity optimization. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6398–6407 (2020)
Chen, T., Xu, C., Luo, J.: Improving text-based person search by spatial matching and adaptive threshold. In: IEEE Winter Conference on Applications of Computer Vision, pp. 1879–1887 (2018)
Li, S., Xiao, T., Li, H., Yang, W., Wang, X.: Identity-aware textual-visual matching with latent co-attention. In: IEEE International Conference on Computer Vision, pp. 1890–1899 (2017)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neur. Comput. 9(8), 1735–1780 (1997)
Article MATH Google Scholar
Sarafianos, N., Xu, X., Kakadiaris, I.A.: Adversarial representation learning for text-to-image matching. In: IEEE International Conference on Computer Vision, pp. 5814–5824 (2019)
Wang, Z., Fang, Z., Wang, J., Yang, Y.: Vitaa: Visual-textual attributes alignment in person search by natural language. In: European Conference on Computer Vision, pp. 402–420 (2020)
Chen, M., Gao, J., Xu, C.: Conjugated semantic pool improves ood detection with pre-trained vision-language models. arXiv:2410.08611 (2024)
Zhao, Z., Liu, B., Lu, Y., Chu, Q., Yu, N.: Unifying multi-modal uncertainty modeling and semantic alignment for text-to-image person re-identification. In: Association for the Advance of Artificial Intelligence, pp. 7534–7542 (2024)
Wang, Z., Zhu, A., Xue, J., Wan, X., Liu, C., Wang, T., Li, Y.: Look before you leap: Improving text-based person retrieval by learning a consistent cross-modal common manifold. In: ACM International Conference on Multimedia, pp. 1984–1992 (2022)
Li, S., Cao, M., Zhang, M.: Learning semantic-aligned feature representation for text-based person search. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2724–2728 (2022)
Wang, Z., Zhu, A., Xue, J., Wan, X., Liu, C., Wang, T., Li, Y.: Caibc: Capturing all-round information beyond color for text-based person retrieval. In: ACM International Conference on Multimedia, pp. 5314–5322 (2022)
Farooq, A., Awais, M., Kittler, J., Khalid, S.S.: Axm-net: Implicit cross-modal feature alignment for person re-identification. In: Association for the Advance of Artificial Intelligence, pp. 4477–4485 (2022)
Shao, Z., Zhang, X., Fang, M., Lin, Z., Wang, J., Ding, C.: Learning granularity-unified representations for text-to-image person re-identification. In: ACM International Conference on Multimedia, pp. 5566–5574 (2022)
Shu, X., Wen, W., Wu, H., Chen, K., Song, Y., Qiao, R., Ren, B., Wang, X.: See finer, see more: Implicit modality alignment for text-based person retrieval. In: European Conference on Computer Vision, pp. 624–641 (2022)
Ma, Y., Sun, X., Ji, J., Jiang, G., Zhuang, W., Ji, R.: Beat: Bi-directional one-to-many embedding alignment for text-based person retrieval. In: ACM International Conference on Multimedia, pp. 4157–4168 (2023)
Yan, S., Dong, N., Liu, J., Zhang, L., Tang, J.: Learning comprehensive representations with richer self for text-to-image person re-identification. In: ACM International Conference on Multimedia, pp. 6202–6211 (2023)
Shao, Z., Zhang, X., Ding, C., Wang, J., Wang, J.: Unified pre-training with pseudo texts for text-to-image person re-identification. In: IEEE International Conference on Computer Vision, pp. 11174–11184 (2023)
Wu, H., Chen, W., Liu, Z., Chen, T., Chen, Z., Lin, L.: Contrastive transformer learning with proximity data generation for text-based person search. IEEE Transactions on Circuits and Systems for Video Technology (2023)
Yan, S., Dong, N., Zhang, L., Tang, J.: Clip-driven fine-grained text-image person re-identification. IEEE Transactions on Image Processing (2023)
Han, G., Lin, M., Li, Z., Zhao, H., Kwong, S.: Text-to-image person re-identification based on multimodal graph convolutional network. IEEE Transactions on Multimedia (2023)
Xie, S., Zhang, C., Ning, E., Li, Z., Wang, Z., Wei, C.: Full-view salient feature mining and alignment for text-based person search. Expert Syst. Appl. 251, 124071 (2024)
Article MATH Google Scholar
Cao, M., Bai, Y., Zeng, Z., Ye, M., Zhang, M.: An empirical study of clip for text-based person search. In: Association for the Advance of Artificial Intelligence, pp. 465–473 (2024)
Luo, H., Jiang, W., Gu, Y., Liu, F., Liao, X., Lai, S., Gu, J.: A strong baseline and batch normalization neck for deep person re-identification. IEEE Transactions on Multimedia, 2597–2609 (2019)
Zhu, A., Wang, Z., Li, Y., Wan, X., Jin, J., Wang, T., Hu, F., Hua, G.: Dssl: Deep surroundings-person separation learning for text-based person retrieval. In: ACM International Conference on Multimedia, pp. 209–217 (2021)
Ding, Z., Ding, C., Shao, Z., Tao, D.: Semantically self-aligned network for text-to-image part-aware person re-identification. arXiv:2107.12666 (2021)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv:1412.6980 (2014)
Fujii, T., Tarashima, S.: Bilma: Bidirectional local-matching for text-based person re-identification. In: IEEE International Conference on Computer Vision, pp. 2786–2790 (2023)

Download references

Acknowledgements

This work is supported by the Shandong Provincial Natural Science Foundation under Grant No. ZR2023LZH013 and No. ZR2024QF185, the Jinan Municipal and School Integration Development Strategy Project under Grant No. JNSX2023025 and No. JNSX2023015, and the New Introduced Talents Program of University of Jinan under Grant No. 1009569. The Numerical Calculations are Supported by High-performance Computing Platform at University of Jinan.

Author information

Authors and Affiliations

School of Information Science and Engineering, University of Jinan, Jinan, 250022, China
Chenglong Shao, Tongzhen Si & Xiaohui Yang
Shandong Key Laboratory of Ubiquitous Intelligent Computing, University of Jinan, Jinan, 250022, China
Tongzhen Si & Xiaohui Yang
Jinan Inspur Data Technology Co. Ltd., Jinan, 250101, China
Xiaohui Yang

Authors

Chenglong Shao
View author publications
You can also search for this author inPubMed Google Scholar
Tongzhen Si
View author publications
You can also search for this author inPubMed Google Scholar
Xiaohui Yang
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

C. Shao: Methodology, Software, Investigation, Validation, Writing - original draft. T. Si: Conceptualization, Methodology, Writing - original draft, Writing-review, Project administration. X. Yang: Writing - review, Validation, Supervision, Project administration, Funding acquisition.

Corresponding authors

Correspondence to Tongzhen Si or Xiaohui Yang.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Communicated by Junyu Gao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Shao, C., Si, T. & Yang, X. Exploring granularity-associated invariance features for text-to-image person re-identification. Multimedia Systems 31, 51 (2025). https://doi.org/10.1007/s00530-024-01638-9

Download citation

Received: 21 July 2024
Accepted: 20 December 2024
Published: 07 January 2025
DOI: https://doi.org/10.1007/s00530-024-01638-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring granularity-associated invariance features for text-to-image person re-identification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Spatial enhanced multi-level alignment learning for text-image person re-identification with coupled noisy labels

Cross-Modal Dual Matching and Comparison for Text-to-Image Person Re-identification

DSFAT: a dual-stream framework assisted by textual information for person re-identification in real scenes

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now