Skip to main content
Log in

Spatial-temporal graph-guided global attention network for video-based person re-identification

  • Research
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

Global attention learning has been extensively applied in video-based person re-identification due to its superiority in capturing contextual correlations. However, existing global attention learning methods usually adopt the conventional neural network to model non-Euclidean contextual correlations, resulting in a limited representation ability. Inspired by the graph-structure property of the contextual correlations, we propose a spatial-temporal graph-guided global attention network (STG\(^3\)A) for video-based person re-identification. STG\(^3\)A comprises two graph-guided attention modules to capture the spatial contexts within a frame and temporal contexts across all frames in a sequence for global attention learning. Furthermore, the graphs from both modules are encoded as graph representations, which combine with weighted representations to grasp the spatial-temporal contextual information adequately for video feature learning. To reduce the effect of noisy graph nodes and learn robust graph representations, a graph node attention is developed to trade-off the importance of each graph node, leading to noise-tolerant graph models. Finally, we design a graph-guided fusion scheme to integrate the representations output by these two attentive modules for a more compact video feature. Extensive experiments on MARS and DukeMTMCVideoReID datasets demonstrate the superior performance of the STG\(^3\)A.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Karanam, S., Gou, M., Wu, Z., Rates-Borras, A., Camps, O.I., Radke, R.J.: A systematic evaluation and benchmark for person re-identification: features, metrics, and datasets. IEEE Trans. Pattern Anal. Mach. Intell. 41(3), 523–536 (2019)

    Article  Google Scholar 

  2. Ye, M., Shen, J., Lin, G., Xiang, T., Shao, L., Hoi, S.C.: Deep learning for person re-identification: a survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 44(6), 2872–2893 (2021)

    Article  Google Scholar 

  3. Sun, J., Li, Y., Chen, H., Peng, Y., Zhu, J.: Visible-infrared person re-identification model based on feature consistency and modal indistinguishability. Mach. Vis. Appl. 34(1), 14 (2023)

    Article  Google Scholar 

  4. Perwaiz, N., Shahzad, M., Fraz, M.: Ubiquitous vision of transformers for person re-identification. Mach. Vis. Appl. 34(2), 27–40 (2023)

    Article  Google Scholar 

  5. Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2010)

    Article  Google Scholar 

  6. Dehghan, A., Modiri Assari, S., Shah, M.: Gmmcp tracker: Globally optimal generalized maximum multi clique problem for multiple object tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE pp. 4091–4099 (2015)

  7. Zhang, Z., Lan, C., Zeng, W., Jin, X., Chen, Z.: Relation-aware global attention for person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3186–3195 (2020). IEEE

  8. Hou, R., Ma, B., Chang, H., Gu, X., Shan, S., Chen, X.: Iaunet: global context-aware feature learning for person reidentification. IEEE Trans. Neural Netw. Learn. Syst. 32(10), 4460–4474 (2020)

    Article  Google Scholar 

  9. Wu, Y., Bourahla, O.E.F., Li, X., Wu, F., Tian, Q., Zhou, X.: Adaptive graph representation learning for video person re-identification. IEEE Trans. Image Process. 29, 8821–8830 (2020)

    Article  Google Scholar 

  10. Hou, R., Chang, H., Ma, B., Shan, S., Chen, X.: Temporal complementary learning for video person re-identification. In: European Conference on Computer Vision, pp. 388–405 (2020). Springer

  11. Zhang, Z., Lan, C., Zeng, W., Chen, Z.: Multi-granularity reference-aided attentive feature aggregation for video-based person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10407–10416 (2020). IEEE

  12. Wu, Y., Lin, Y., Dong, X., Yan, Y., Ouyang, W., Yang, Y.: Exploit the unknown gradually: One-shot video-based person re-identification by stepwise learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5177–5186 (2018). IEEE

  13. Li, S., Bak, S., Carr, P., Wang, X.: Diversity regularized spatiotemporal attention for video-based person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 369–378 (2018). IEEE

  14. Zhang, R., Li, J., Sun, H., Ge, Y., Luo, P., Wang, X., Lin, L.: Scan: Self-and-collaborative attention network for video person re-identification. IEEE Trans. Image Process. 28(10), 4870–4882 (2019)

    Article  MathSciNet  Google Scholar 

  15. Chen, G., Lu, J., Yang, M., Zhou, J.: Spatial-temporal attention-aware learning for video-based person re-identification. IEEE Trans. Image Process. 28(9), 4192–4205 (2019)

    Article  MathSciNet  Google Scholar 

  16. Li, J., Zhang, S., Huang, T.: Multi-scale 3d convolution network for video based person re-identification. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8618–8625 (2019)

  17. Chen, G., Lu, J., Yang, M., Zhou, J.: Learning recurrent 3d attention for video-based person re-identification. IEEE Trans. Image Process. 29, 6963–6976 (2020)

    Article  Google Scholar 

  18. Yan, Y., Ni, B., Song, Z., Ma, C., Yan, Y., Yang, X.: Person re-identification via recurrent feature aggregation. In: European Conference on Computer Vision, pp. 701–716 (2016). Springer

  19. Li, X., Zhou, W., Zhou, Y., Li, H.: Relation-guided spatial attention and temporal refinement for video-based person re-identification. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 11434–11441 (2020)

  20. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016). IEEE

  21. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., Philip, S.Y.: A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 32(1), 4–24 (2020)

    Article  MathSciNet  Google Scholar 

  22. Nayak, R., Balabantaray, B.K., Patra, D.: A new single-image super-resolution using efficient feature fusion and patch similarity in non-Euclidean space. Arab. J. Sci. Eng. 45(12), 10261–10285 (2020)

    Article  Google Scholar 

  23. Yu, J., Tan, M., Zhang, H., Rui, Y., Tao, D.: Hierarchical deep click feature prediction for fine-grained image recognition. IEEE Trans. Pattern Anal. Mach. Intell. 44(2), 563–578 (2019)

    Article  Google Scholar 

  24. Gu, X., Chang, H., Ma, B., Zhang, H., Chen, X.: Appearance-preserving 3d convolution for video-based person re-identification. In: European Conference on Computer Vision, pp. 228–243 (2020). Springer

  25. Chen, Y., Duffner, S., Stoian, A., Dufour, J.-Y., Baskurt, A.: List-wise learning-to-rank with convolutional neural networks for person re-identification. Mach. Vis. Appl. 32, 1–14 (2021). (Springer)

    Article  Google Scholar 

  26. Ye, Z., Hong, C., Zeng, Z., Zhuang, W.: Self-supervised person re-identification with channel-wise transformer. In: 2022 IEEE International Conference on Big Data (Big Data), pp. 4210–4217 (2022). IEEE

  27. Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015). IEEE

  28. Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271 (2017). IEEE

  29. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European Conference on Computer Vision, pp. 213–229 (2020). Springer

  30. Hong, C., Yu, J., Wan, J., Tao, D., Wang, M.: Multimodal deep autoencoder for human pose recovery. IEEE Trans. Image Process. 24(12), 5659–5670 (2015)

    Article  MathSciNet  Google Scholar 

  31. Hong, C., Yu, J., Zhang, J., Jin, X., Lee, K.-H.: Multimodal face-pose estimation with multitask manifold deep learning. IEEE Trans. Industr. Inf. 15(7), 3952–3961 (2018)

    Article  Google Scholar 

  32. Gu, X., Chang, H., Ma, B., Zhang, H., Chen, X.: Appearance-preserving 3d convolution for video-based person re-identification. In: European Conference on Computer Vision, pp. 228–243 (2020). Springer

  33. Gao, C., Chen, Y., Yu, J.-G., Sang, N.: Pose-guided spatiotemporal alignment for video-based person re-identification. Inf. Sci. 527, 176–190 (2020)

    Article  MathSciNet  Google Scholar 

  34. Zhao, H., Tian, M., Sun, S., Shao, J., Yan, J., Yi, S., Wang, X., Tang, X.: Spindle net: Person re-identification with human body region guided feature decomposition and fusion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1077–1085 (2017). IEEE

  35. Fang, P., Zhou, J., Roy, S.K., Petersson, L., Harandi, M.: Bilinear attention networks for person retrieval. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 8030–8039 (2019). IEEE

  36. Chen, B., Deng, W., Hu, J.: Mixed high-order attention network for person re-identification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 371–381 (2019). IEEE

  37. Chen, T., Ding, S., Xie, J., Yuan, Y., Chen, W., Yang, Y., Ren, Z., Wang, Z.: Abd-net: Attentive but diverse person re-identification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 8351–8361 (2019). IEEE

  38. Dong, H., Yang, Y., Sun, X., Zhang, L., Fang, L.: Cascaded attention-guided multi-granularity feature learning for person re-identification. Mach. Vis. Appl. 34(1), 1–16 (2023). (Springer)

    Article  Google Scholar 

  39. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018). IEEE

  40. Chen, J., Lei, B., Song, Q., Ying, H., Chen, D.Z., Wu, J.: A hierarchical graph network for 3d object detection on point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 392–401 (2020). IEEE

  41. Lin, Z.-H., Huang, S.-Y., Wang, Y.-C.F.: Convolution in the cloud: Learning deformable kernels in 3d graph convolution networks for point cloud analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1800–1809 (2020). IEEE

  42. Zhao, L., Peng, X., Tian, Y., Kapadia, M., Metaxas, D.N.: Semantic graph convolutional networks for 3d human pose regression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3425–3435 (2019). IEEE

  43. Cai, Y., Ge, L., Liu, J., Cai, J., Cham, T.-J., Yuan, J., Thalmann, N.M.: Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2272–2281 (2019). IEEE

  44. Shen, Y., Li, H., Yi, S., Chen, D., Wang, X.: Person re-identification with deep similarity-guided graph neural network. In: Proceedings of the European Conference on Computer Vision, pp. 486–504 (2018). Springer

  45. Yang, J., Zheng, W.-S., Yang, Q., Chen, Y.-C., Tian, Q.: Spatial-temporal graph convolutional network for video-based person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3289–3299 (2020). IEEE

  46. Guo, S., Lin, Y., Feng, N., Song, C., Wan, H.: Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 922–929 (2019)

  47. Guo, S., Lin, Y., Wan, H., Li, X., Cong, G.: Learning dynamics and heterogeneity of spatial-temporal graph data for traffic forecasting. IEEE Trans. Knowl. Data Eng. 34(11), 5415–5428 (2021). (IEEE)

    Article  Google Scholar 

  48. Su, Y., Zhu, H., Tan, Y., An, S., Xing, M.: Prime: privacy-preserving video anomaly detection via motion exemplar guidance. Knowl.-Based Syst. 278, 110872 (2023)

    Article  Google Scholar 

  49. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence (2018)

  50. Liu, J., Zha, Z.-J., Wu, W., Zheng, K., Sun, Q.: Spatial-temporal correlation and topology learning for person re-identification in videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4370–4379 (2021). IEEE

  51. Wei, X., Yu, R., Sun, J.: View-gcn: View-based graph convolutional network for 3d shape analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1850–1859 (2020). IEEE

  52. Yang, L., Zhan, X., Chen, D., Yan, J., Loy, C.C., Lin, D.: Learning to cluster faces on an affinity graph. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2298–2306 (2019). IEEE

  53. Huang, H.-C., Chuang, Y.-Y., Chen, C.-S.: Affinity aggregation for spectral clustering. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 773–780 (2012). IEEE

  54. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015). PMLR

  55. Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 315–323 (2011). JMLR Workshop and Conference Proceedings

  56. Li, G., Muller, M., Thabet, A., Ghanem, B.: Deepgcns: Can gcns go as deep as cnns? In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9267–9276 (2019). IEEE

  57. Zheng, L., Bie, Z., Sun, Y., Wang, J., Su, C., Wang, S., Tian, Q.: Mars: A video benchmark for large-scale person re-identification. In: European Conference on Computer Vision, pp. 868–884 (2016). Springer

  58. Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: European Conference on Computer Vision Workshops, pp. 17–35 (2016). Springer

  59. Zhong, Z., Zheng, L., Cao, D., Li, S.: Re-ranking person re-identification with k-reciprocal encoding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1318–1327 (2017). IEEE

  60. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014). IEEE

  61. Jiang, X., Qiao, Y., Yan, J., Li, Q., Zheng, W., Chen, D.: SSN3D: Self-separated network to align parts for 3D convolution in video person re-identification. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1691–1699 (2021)

  62. Liu, Y., Yan, J., Ouyang, W.: Quality aware network for set to set recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5790–5799 (2017). IEEE

  63. Chen, D., Li, H., Xiao, T., Yi, S., Wang, X.: Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1169–1178 (2018). IEEE

  64. Fu, Y., Wang, X., Wei, Y., Huang, T.: Sta: Spatial-temporal attention for large-scale video-based person re-identification. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8287–8294 (2019)

  65. Li, J., Wang, J., Tian, Q., Gao, W., Zhang, S.: Global-local temporal representations for video person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3958–3967 (2019). IEEE

  66. Subramaniam, A., Nambiar, A., Mittal, A.: Co-segmentation inspired attention networks for video-based person re-identification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 562–572 (2019). IEEE

  67. Hou, R., Ma, B., Chang, H., Gu, X., Shan, S., Chen, X.: Vrstc: Occlusion-free video person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7183–7192 (2019). IEEE

  68. Hou, R., Chang, H., Ma, B., Huang, R., Shan, S.: Bicnet-tks: Learning efficient spatial-temporal representation for video person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2014–2023 (2021)

  69. Chen, C., Ye, M., Qi, M., Wu, J., Liu, Y., Jiang, J.: Saliency and granularity: discovering temporal coherence for video-based person re-identification. IEEE Trans. Circuits Syst. Video Technol. 32(9), 6100–6112 (2022)

    Article  Google Scholar 

  70. Pan, H., Chen, Y., He, Z.: Multi-granularity graph pooling for video-based person re-identification. Neural Netw. 160, 22–33 (2023)

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China under Grant 62276019 and 62006017.

Author information

Authors and Affiliations

Authors

Contributions

The proposed method was designed by XL, QL and WW. Material preparation, data collection and experimentation were performed by XL, and the first draft of the manuscript was written by XL. The analysis of experimental results was performed by XL, QL, WW and JZ. All authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Xiaobao Li.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, X., Wang, W., Li, Q. et al. Spatial-temporal graph-guided global attention network for video-based person re-identification. Machine Vision and Applications 35, 8 (2024). https://doi.org/10.1007/s00138-023-01489-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00138-023-01489-w

Keywords

Navigation