Skip to main content

Cross-Platform Video Person ReID: A New Benchmark Dataset and Adaptation Approach

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

In this paper, we construct a large-scale benchmark dataset for Ground-to-Aerial Video-based person Re-Identification, named G2A-VReID, which comprises 185,907 images and 5,576 tracklets, featuring 2,788 distinct identities. To our knowledge, this is the first dataset for video ReID under Ground-to-Aerial scenarios. G2A-VReID dataset has the following characteristics: 1) Drastic view changes; 2) Large number of annotated identities; 3) Rich outdoor scenarios; 4) Huge difference in resolution. Additionally, we propose a new benchmark approach for cross-platform ReID by transforming the cross-platform visual alignment problem into visual-semantic alignment through vision-language model (i.e., CLIP) and applying a parameter-efficient Video Set-Level-Adapter module to adapt image-based foundation model to video ReID tasks, termed VSLA-CLIP. Besides, to further reduce the great discrepancy across the platforms, we also devise the platform-bridge prompts for efficient visual feature alignment. Extensive experiments demonstrate the superiority of the proposed method on all existing video ReID datasets and our proposed G2A-VReID dataset. The code and datasets are available at https://github.com/FHR-L/VSLA-CLIP.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Aich, A., Zheng, M., Karanam, S., Chen, T., Roy-Chowdhury, A.K., Wu, Z.: Spatio-temporal representation factorization for video-based person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 152–162 (2021)

    Google Scholar 

  2. Bai, S., Ma, B., Chang, H., Huang, R., Chen, X.: Salient-to-broad transition for video person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7339–7348 (2022)

    Google Scholar 

  3. Baltieri, D., Vezzani, R., Cucchiara, R.: 3dpes: 3d people dataset for surveillance and forensics. In: Joint Acm Workshop on Human Gesture & Behavior Understanding (2011)

    Google Scholar 

  4. Chao, H., He, Y., Zhang, J., Feng, J.: Gaitset: regarding gait as a set for cross-view gait recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8126–8133 (2019)

    Google Scholar 

  5. Chen, D., Li, H., Xiao, T., Yi, S., Wang, X.: Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1169–1178 (2018)

    Google Scholar 

  6. Chen, G., Rao, Y., Lu, J., Zhou, J.: Temporal coherence or temporal motion: Which is more critical for video-based person re-identification? In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pp. 660–676. Springer (2020)

    Google Scholar 

  7. Cheng, D., He, L., Wang, N., Zhang, S., Wang, Z., Gao, X.: Efficient bilateral cross-modality cluster matching for unsupervised visible-infrared person reid. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 1325–1333 (2023)

    Google Scholar 

  8. Cheng, D., et al.: Continual all-in-one adverse weather removal with knowledge replay on a unified network structure. IEEE Trans. Multimed. (2024)

    Google Scholar 

  9. Cheng, D., Zhou, J., Wang, N., Gao, X.: Hybrid dynamic contrast and probability distillation for unsupervised person re-id. IEEE Trans. Image Process. 31, 3334–3346 (2022). https://doi.org/10.1109/TIP.2022.3169693

    Article  Google Scholar 

  10. Chung, D., Tahboub, K., Delp, E.J.: A two stream siamese convolutional neural network for person re-identification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1983–1991 (2017)

    Google Scholar 

  11. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  12. Eom, C., Lee, G., Lee, J., Ham, B.: Video-based person re-identification with spatial and temporal memory networks. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12016–12025 (2021). https://doi.org/10.1109/ICCV48922.2021.01182

  13. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)

    Google Scholar 

  14. Fu, Y., Wang, X., Wei, Y., Huang, T.: Sta: spatial-temporal attention for large-scale video-based person re-identification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8287–8294 (2019)

    Google Scholar 

  15. Gu, X., Chang, H., Ma, B., Zhang, H., Chen, X.: Appearance-preserving 3d convolution for video-based person re-identification. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pp. 228–243. Springer (2020)

    Google Scholar 

  16. He, S., Luo, H., Wang, P., Wang, F., Li, H., Jiang, W.: Transreid: transformer-based object re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15013–15022 (October 2021)

    Google Scholar 

  17. He, T., Jin, X., Shen, X., Huang, J., Chen, Z., Hua, X.S.: Dense interaction learning for video-based person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1490–1501 (2021)

    Google Scholar 

  18. Hirzer, M., Beleznai, C., Roth, P.M., Bischof, H.: Person re-identification by descriptive and discriminative classification. In: Image Analysis: 17th Scandinavian Conference, SCIA 2011, Ystad, Sweden, May 2011. Proceedings 17, pp. 91–102. Springer (2011)

    Google Scholar 

  19. Hou, R., Ma, B., Chang, H., Gu, X., Shan, S., Chen, X.: Vrstc: occlusion-free video person re-identification. In: CVPR, pp. 7183–7192 (2019)

    Google Scholar 

  20. Hou, R., Ma, B., Chang, H., Gu, X., Shan, S., Chen, X.: Temporal complementary learning for video person re-identification. In: ECCV, pp. 388–405 (2020)

    Google Scholar 

  21. Hou, R., Chang, H., Ma, B., Huang, R., Shan, S.: Bicnet-tks: learning efficient spatial-temporal representation for video person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2014–2023, June 2021

    Google Scholar 

  22. Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=nZeVKeeFYf9

  23. Jia, M., Tang, L., Chen, B., Cardie, C., Belongie, S.J., Hariharan, B., Lim, S.: Visual prompt tuning. In: ECCV (33). LNCS, vol. 13693, pp. 709–727. Springer (2022)

    Google Scholar 

  24. Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

  25. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  26. Lester, B., Al-Rfou, R., Constant, N.: The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 (2021)

  27. Li, H., et al.: Boosting low-data instance segmentation by unsupervised pre-training with saliency prompt. arXiv preprint arXiv:2302.01171 (2023)

  28. Li, J., Wang, J., Tian, Q., Gao, W., Zhang, S.: Global-local temporal representations for video person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3958–3967 (2019)

    Google Scholar 

  29. Li, J., Zhang, S., Huang, T.: Multiscale 3d convolution network for video based person reidentification. In: AAAI, pp. 8618–8625 (2019)

    Google Scholar 

  30. Li, S., Sun, L., Li, Q.: Clip-reid: exploiting vision-language model for image re-identification without concrete text labels. arXiv preprint arXiv:2211.13977 (2022)

  31. Liu, H., Jie, Z., Jayashree, K., Qi, M., Jiang, J., Yan, S., Feng, J.: Video-based person re-identification with accumulative motion context. IEEE Trans. Circuits Syst. Video Technol. 28(10), 2788–2802 (2017)

    Article  Google Scholar 

  32. Liu, X., Zhang, P., Lu, H.: Video-based person re-identification with long short-term representation learning. In: International Conference on Image and Graphics, pp. 55–67. Springer (2023)

    Google Scholar 

  33. Liu, X., Zhang, P., Yu, C., Lu, H., Yang, X.: Watching you: Global-guided reciprocal learning for video-based person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13334–13343 (2021)

    Google Scholar 

  34. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  35. Rasheed, H., khattak, M.U., Maaz, M., Khan, S., Khan, F.S.: Finetuned clip models are efficient video learners. In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

    Google Scholar 

  36. Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)

    Google Scholar 

  37. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)

    Google Scholar 

  38. Vaswani, A., et al.: Attention is all you need. Advances in neural information processing systems 30 (2017)

    Google Scholar 

  39. Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: European Conference on Computer Vision, pp. 20–36. Springer (2016)

    Google Scholar 

  40. Wang, X., Zhao, R.: Person re-identification: System design and evaluation overview. In: Person Re-Identification, pp. 351–370. Springer (2014)

    Google Scholar 

  41. Wang, Y., Zhang, P., Gao, S., Geng, X., Lu, H., Wang, D.: Pyramid spatial-temporal aggregation for video-based person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12026–12035 (2021)

    Google Scholar 

  42. Xing, Y., Wu, Q., Cheng, D., Zhang, S., Liang, G., Wang, P., Zhang, Y.: Dual modality prompt tuning for vision-language pre-trained model. IEEE Trans. Multimedia 26, 2056–2068 (2024). https://doi.org/10.1109/TMM.2023.3291588

    Article  Google Scholar 

  43. Yan, Y., et al.: Learning multi-granular hypergraphs for video-based person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2899–2908 (2020)

    Google Scholar 

  44. Yang, J., Zheng, W.S., Yang, Q., Chen, Y.C., Tian, Q.: Spatial-temporal graph convolutional network for video-based person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3289–3299 (2020)

    Google Scholar 

  45. Yin, J., Wu, A., Zheng, W.S.: Fine-grained person re-identification. Int. J. Comput. Vision 128(6), 1654–1672 (2020). https://doi.org/10.1007/s11263-019-01259-0

    Article  Google Scholar 

  46. Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R.R., Smola, A.J.: Deep sets. Advances in neural information processing systems 30 (2017)

    Google Scholar 

  47. Zang, X., Li, G., Gao, W.: Multidirection and multiscale pyramid in transformer for video-based pedestrian retrieval. IEEE Trans. Industr. Inf. 18(12), 8776–8785 (2022). https://doi.org/10.1109/TII.2022.3151766

    Article  Google Scholar 

  48. Zhang, S., Yang, Y., Wang, P., Liang, G., Zhang, X., Zhang, Y.: Attend to the difference: Cross-modality person re-identification via contrastive correlation. IEEE Trans. Image Process. 30, 8861–8872 (2021). https://doi.org/10.1109/TIP.2021.3120881

    Article  Google Scholar 

  49. Zhang, S., et al.: Person re-identification in aerial imagery. IEEE Trans. Multimedia 23, 281–291 (2021). https://doi.org/10.1109/TMM.2020.2977528

    Article  Google Scholar 

  50. Zhang, Z., Lan, C., Zeng, W., Chen, Z.: Multi-granularity reference-aided attentive feature aggregation for video-based person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10407–10416 (2020)

    Google Scholar 

  51. Zheng, L., et al.: Mars: a video benchmark for large-scale person re-identification. In: ECCV, pp. 868–884 (2016)

    Google Scholar 

  52. Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13001–13008 (2020)

    Google Scholar 

  53. Zhou, Z., Huang, Y., Wang, W., Wang, L., Tan, T.: See the forest for the trees: Joint spatial and temporal recurrent neural networks for video-based person re-identification. In: Proceedings of the IEEE CDonference on Computer Vision and Pattern Recognition, pp. 4747–4756 (2017)

    Google Scholar 

  54. Zhu, K., Guo, H., Zhang, S., Wang, Y., Liu, J., Wang, J., Tang, M.: Aaformer: auto-aligned transformer for person re-identification. IEEE Trans. Neural Networks Learn. Syst. (2023)

    Google Scholar 

Download references

Acknowledgement

This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant 62101453, 62176198 and 62201467, the Key Research and Development Program of Shaanxi Province under Grant 2024GX-YBXM-135, in part by China Postdoctoral Science Foundation under Grant 2022TQ0260, 2023M742842, in part by the Young Talent Fund of Xi’an Association for Science and Technology under Grant 959202313088, Innovation Capability Support Program of Shaanxi (No. 2024ZC-KJXX-043), the Fundamental Research Funds for the Central Universities No. HYGJZN202331 and the Natural Science Basic Research Program of Shaanxi Province (No. 2022JC-DW-08).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to De Cheng .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 97906 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, S. et al. (2025). Cross-Platform Video Person ReID: A New Benchmark Dataset and Adaptation Approach. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15085. Springer, Cham. https://doi.org/10.1007/978-3-031-73383-3_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73383-3_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73382-6

  • Online ISBN: 978-3-031-73383-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics