Abstract
Generalizable person re-identification (ReID) has gained significant attention in recent years as it poses greater challenges in recognizing individuals across different domains and unseen scenarios. Existing methods are typically limited to a single visual modality, making it challenging to capture rich semantic information across different domains. Recently, pre-trained vision-language models like CLIP have shown promising performances in various tasks by linking visual representations with their corresponding text descriptions. This enables them to capture diverse high-level semantics from the accompanying text and obtain transferable features. However, the adoption of CLIP has been hindered in person ReID due to the labels being typically index-based rather than descriptive texts. To address this limitation, we propose a novel Cross-modal framework wIth Conditional Prompt (CICP) framework based on CLIP involving the Description Prompt Module (DPM) that pre-trains a set of prompts to tackle the lack of textual information in person ReID. In addition, we further propose the Prompt Generalization Module (PGM) incorporates a lightweight network that generates a conditional token for each image. This module shifts the focus from being limited to a class set to being specific to each input instance, thereby enhancing domain generalization capability for the entire task. Through extensive experiments, we show that our proposed method outperforms state-of-the-art (SOTA) approaches on popular benchmark datasets.
The work was supported in part by the National Natural Science Foundation of China under Grant 82172033, U19B2031, 61971369, 52105126, 82272071, 62271430, and the Fundamental Research Funds for the Central Universities 20720230104.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Baldrati, A., Bertini, M., Uricchio, T., Del Bimbo, A.: Effective conditioned and composed image retrieval combining clip-based features. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21466–21474 (2022)
Choi, S., Kim, T., Jeong, M., Park, H., Kim, C.: Meta batch-instance normalization for generalizable person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3425–3435 (2021)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hu, Y., Yi, D., Liao, S., Lei, Z., Li, S.Z.: Cross dataset person re-identification. In: Jawahar, C.V., Shan, S. (eds.) ACCV 2014. LNCS, vol. 9010, pp. 650–664. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16634-6_47
Huang, Y., Du, C., Xue, Z., Chen, X., Zhao, H., Huang, L.: What makes multi-modal learning better than single (provably). Adv. Neural. Inf. Process. Syst. 34, 10944–10956 (2021)
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning (2021)
Jia, J., Ruan, Q., Hospedales, T.M.: Frustratingly easy person re-identification: generalizing person re-id in practice. arXiv preprint arXiv:1905.03422 (2019)
Jin, X., Lan, C., Zeng, W., Chen, Z., Zhang, L.: Style normalization and restitution for generalizable person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3143–3152 (2020)
Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning, pp. 5583–5594. PMLR (2021)
Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
Li, S., Sun, L., Li, Q.: CLIP-ReID: exploiting vision-language model for image re-identification without concrete text labels. arXiv preprint arXiv:2211.13977 (2022)
Liao, S., Shao, L.: Interpretable and generalizable person re-identification with query-adaptive convolution and temporal lifting. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 456–474. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_27
Liao, S., Shao, L.: Transmatcher: deep image matching through transformers for generalizable person re-identification. Adv. Neural. Inf. Process. Syst. 34, 1992–2003 (2021)
Liao, S., Shao, L.: Graph sampling based deep metric learning for generalizable person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7359–7368 (2022)
Ma, H., et al.: EI-CLIP: entity-aware interventional contrastive learning for e-commerce cross-modal retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18051–18061 (2022)
Ni, H., Song, J., Luo, X., Zheng, F., Li, W., Shen, H.T.: Meta distribution alignment for generalizable person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2487–2496 (2022)
Radford, A., et al.: Learning transferable visual models from natural language supervision. Cornell University - arXiv (2021)
Rao, Y., et al.: Denseclip: language-guided dense prediction with context-aware prompting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18082–18091 (2022)
Song, J., Yang, Y., Song, Y.Z., Xiang, T., Hospedales, T.M.: Generalizable person re-identification by domain-invariant mapping network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 719–728 (2019)
Wei, L., Zhang, S., Gao, W., Tian, Q.: Person transfer GAN to bridge domain gap for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 79–88 (2018)
Yang, Q., Yu, H.X., Wu, A., Zheng, W.S.: Patch-based discriminative feature learning for unsupervised person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3633–3642 (2019)
Yi, D., Lei, Z., Liao, S., Li, S.Z.: Deep metric learning for person re-identification. In: 2014 22nd International Conference on Pattern Recognition, pp. 34–39. IEEE (2014)
Yu, H.X., Zheng, W.S., Wu, A., Guo, X., Gong, S., Lai, J.H.: Unsupervised person re-identification by soft multilabel learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2148–2157 (2019)
Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., Tian, Q.: Scalable person re-identification: a benchmark. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1116–1124 (2015)
Zheng, Z., Zheng, L., Yang, Y.: Unlabeled samples generated by GAN improve the person re-identification baseline in vitro. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3754–3762 (2017)
Zhou, C., Loy, C.C., Dai, B.: Denseclip: extract free dense labels from clip. arXiv preprint arXiv:2112.01071 (2021)
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16816–16825 (2022)
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. Int. J. Comput. Vision 130(9), 2337–2348 (2022)
Zhou, K., Yang, Y., Cavallaro, A., Xiang, T.: Omni-scale feature learning for person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3702–3712 (2019)
Zhou, K., Yang, Y., Cavallaro, A., Xiang, T.: Learning generalisable omni-scale representations for person re-identification. IEEE Trans. Pattern Anal. Mach. Intell. 44(9), 5056–5069 (2021)
Zhou, K., Yang, Y., Qiao, Y., Xiang, T.: Domain generalization with mixstyle. arXiv preprint arXiv:2104.02008 (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Cheng, L., Kuang, Z., Zhang, H., Ding, X., Huang, Y. (2024). Boosting Generalization Performance in Person Re-identification. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14434. Springer, Singapore. https://doi.org/10.1007/978-981-99-8549-4_15
Download citation
DOI: https://doi.org/10.1007/978-981-99-8549-4_15
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8548-7
Online ISBN: 978-981-99-8549-4
eBook Packages: Computer ScienceComputer Science (R0)