Skip to main content
Log in

Local and global aligned spatiotemporal attention network for video-based person re-identification

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Matching video clips of people across non-overlapping surveillance cameras (video-based person re-identification) is of significant importance in many real-world applications. In this paper, we address the video-based person re-identification by developing a Local and Global Aligned Spatiotemporal Attention (LGASA) network. Our LGASA network consists of five cascaded modules, including 3D convolutional layers, residual block, spatial transformer network (STN), multi-stream recurrent network and multiple-attention module. Specifically, the 3D convolutional layers are used to capture local short-term fast-varying motion information encoded in multiple adjacent original frames. The residual block is used to extract mid-level feature maps. STN is applied to align the mid-level feature maps. The multi-stream recurrent network is designed to exploit the useful local and global long-term temporal dependency from the aligned mid-level feature maps. The multiple-attention module is designed to aggregate feature vectors of the same body part (or global) from different frames within each video into a single vector according to their importance. Experimental results on three video pedestrian datasets verify the effectiveness of the proposed local and global aligned spatiotemporal attention network.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. ReduceLROnPlateau is a scheduler function provided by Pytorch in https://pytorch.org/docs/stable/optim.html

References

  1. Alexander H, Lucas B, Bastian L (2017) In defense of the triplet loss for person re-identification. arXiv:1703.07737

  2. Alexander K, Marcin M, Cordelia S (2008) A spatio-temporal descriptor based on 3d-gradients. In: Conference on BMVC, pp 1–10

  3. Ashish V, Noam S, Niki P, Jakob U, Llion J, Gomez AN, Lukasz K, Illia P (2017) Attention is all you need. In: Conference on NIPS, pp 6000–6010

  4. Bazzani L, Cristani M, Perina A, Farenzena M, Murino V (2010) Multiple-shot person re-identification by hpe signature. In: IEEE Conference on CPR. IEEE, pp 1413–1416

  5. Bazzani L, Cristani M, Perina A, Murino V (2012) Multiple-shot person re-identification by chromatic and epitomic analyses. Pattern Recogn Lett 33(7):898–903

    Article  Google Scholar 

  6. Bhaswati S, Sai K, Jayanta R, Aditi M, Anchit RN (2018) Video based person re-identification by re-ranking attentive temporal information in deep recurrent convolutional networks. In: IEEE Conference on ICIP, pp 1663–1667

  7. Bryan James P, Wei-Shi Z, Shaogang G, Tao X (2010) Person re-identification by support vector ranking. In: Conference on BMVC, pp 1–11

  8. Chen L, Yang H, Zhu J, Zhou Q, Wu S, Gao Z (2017) Deep spatial-temporal fusion network for video-based person re-identification. In: IEEE Conference on CVPR Workshops, pp 478–1485

  9. Chen YC, Zhu X, Zheng WS, Lai JH (2018) Person re-identification by camera correlation aware feature augmentation. IEEE Trans Pattern Anal Mach Intell 40(2):392–408

    Article  Google Scholar 

  10. Cheng D, Gong Y, Zhou S, Wang J, Zheng N (2016) Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In: IEEE Conference on CVPR, pp 1335–1344

  11. Chi S, Jianing L, Shiliang Z, Junliang X, Wen G, Qi T (2017) Pose-driven deep convolutional model for person re-identification. In: IEEE Conference on ICCV, pp 3980–3989

  12. Chung D, Tahboub K, Delp EJ (2017) A two stream siamese convolutional neural network for person re-identification. In: IEEE Conference on ICCV

  13. Chunxiao L, Shaogang G, Chen CL, Xinggang L (2012) Person re-identification: what features are important?. In: ECCV Workshops, pp 391–401

  14. Chen D, Zheng-Jun Z, Jiawei L, Hongtao X, Yongdong Z (2018) Temporal-contextual attention network for video-based person re-identification. In: Advances in multimedia information processing - PCM, pp 146–157

  15. Dangwei L, Xiaotang C, Zhang Z, Kaiqi H (2017) Learning deep context-aware features over body and latent parts for person re-identification. In: IEEE Conference on CVPR, pp 7398–7407

  16. De C, Yihong G, Sanping Z, Jinjun W, Nanning Z (2016) Person re-identification by multi-channel parts-based CNN with improved triplet loss function. In: IEEE Conference on CVPR, pp 1335–1344

  17. Dehghan A, Modiri Assari S, Shah M (2015) Gmmcp yracker: globally optimal generalized maximum multi clique problem for multiple object tracking. In: IEEE Conference on CVPR, pp 4091–4099

  18. Farenzena M, Bazzani L, Perina A, Murino V, Cristani M (2010) Person re-identification by symmetry-driven accumulation of local features. In: IEEE Conference on CVPR, pp 2360–2367

  19. Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D (2010) Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell 32(9):1627–1645

    Article  Google Scholar 

  20. Fergnani F, Alletto S, Serra G, De Mira J, Cucchiara R (2016) Body part based re-identification from an egocentric perspective. In: IEEE Conference on CVPR

  21. Furqan MK, Franċois B (2017) Multi-shot person re-identification using part appearance mixture. In: IEEE Conference on WACV, pp 605–614

  22. Gong S, Cristani M, Yan S, Loy CC (2014) Person re-identification. Springer

  23. Hao Y, Chunfeng Y, Bing L, Yang D, Junliang X, Weiming H, Maybank SJ (2019) Asymmetric 3d convolutional neural networks for action recognition. Pattern Recogn 85:1–12

    Article  Google Scholar 

  24. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE Conference on CVPR, pp 770–778

  25. He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In: Conference on ECCV. Springer, pp 630–645

  26. Hirzer M, Beleznai C, Roth PM, Bischof H (2011) Person re-identification by descriptive and discriminative classification. In: Image analysis, pp 91–102

  27. Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231

    Article  Google Scholar 

  28. Jianlou S, Honggang Z, Chun-Guang L, Jason K, Xiangfei K, Kot AC, Gang W (2018) Dual attention matching network for context-aware feature sequence based person re-identification. In: IEEE Conference on CVPR, pp 5363–5372

  29. Jing XY, Zhu X, Wu F, You X, Liu Q, Yue D, Hu R, Xu B (2015) Super-resolution person re-identification with semi-coupled low-rank discriminant dictionary learning. In: IEEE Conference on CVPR, pp 695–704

  30. Ju D, Pingping Z, Dong W, Huchuan L, Hongyu W (2019) Video person re-identification by temporal residual learning. IEEE Trans Image Process 28 (3):1366–1377

    Article  MathSciNet  Google Scholar 

  31. Karanam S, Gou M, Wu Z, Rates-Borras A, Camps O, Radke RJ (2019) A systematic evaluation and benchmark for person re-identification: features, metrics, and datasets. IEEE Trans Pattern Anal Mach Intell 41(3):523–536

    Article  Google Scholar 

  32. Kelvin X, Jimmy B, Ryan K, Kyunghyun C, Courville AC, Ruslan S, Zemel RS, Yoshua B (2015) Show, attend and tell: neural image caption generation with visual attention. In: IEEE Conference on ICML, pp 2048–2057

  33. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: IEEE Conference on ICLR

  34. Li K, Ding Z, Li S, Fu Y (2019) Toward resolution-invariant person reidentification via projective dictionary learning. IEEE Trans Neural Netw Learning Syst 30(6):1896–1907

    Article  MathSciNet  Google Scholar 

  35. Li S, Shao M, Fu Y (2015) Cross-view projective dictionary learning for person re-identification. In: IJCAI, pp 2155–2161

  36. Li S, Shao M, Fu Y (2018) Person re-identification by cross-view multi-level dictionary learning. IEEE Trans Pattern Anal Mach Intell 40(12):2963–2977

    Article  Google Scholar 

  37. Li Y, Wu Z, Karanam S, Radke RJ (2014) Real-world re-identification in an airport camera network. In: International conference on ICDSC. ACM, p 35

  38. Liao S, Hu Y, Zhu X, Li SZ (2015) Person re-identification by local maximal occurrence representation and metric learning. In: IEEE Conference on CVPR, pp 2197–2206

  39. Liao S, Li SZ (2015) Efficient psd constrained asymmetric metric learning for person re-identification. In: IEEE Conference on ICCV, pp 3685–3693

  40. Liu H, Jie Z, Jayashree K, Qi M, Jiang J, Yan S, Feng J (2017) Video-based person re-identification with accumulative motion context. IEEE Transactions on Circuits and Systems for Video Technology

  41. Manmatha R, Wu C, Smola AJ, Krähenbühl P. (2017) Sampling matters in deep embedding learning. In: IEEE Conference on ICCV, pp 2859–2867

  42. Max J, Karen S, Andrew Z, Koray K (2015) Spatial transformer networks. In: Conference on NIPS, pp 2017–2025

  43. McLaughlin N, Martinez del Rincon J, Miller P (2016) Recurrent convolutional network for video-based person re-identification. In: IEEE Conference on CVPR, pp 1325–1334

  44. Mclaughlin N, Rincon JMD, Miller P (2017) Video person re-identification for wide area tracking based on recurrent neural networks. IEEE Trans Circ Syst Video Technol PP(99):1–1

    Google Scholar 

  45. Niloofar G, Thomas BS, Richard IH (2006) Person reidentification using spatiotemporal appearance. In: IEEE Conference on CVPR, pp 1528–1535

  46. Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in pytorch

  47. Pavlo M, Xiaodong Y, Shalini G, Kihwan K, Stephen T, Jan K (2016) Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural networks. In: IEEE Conference on CVPR, pp 4207–4215

  48. Rohit G, Deva R (2017) Attentional pooling for action recognition. In: Conference on NIPS, pp 33–44

  49. Rui Z, Wanli O, Xiaogang W (2014) Learning mid-level filters for person re-identification. In: IEEE Conference on CVPR, pp 144–151

  50. Sergey Z, Nikos K (2016) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv:1612.03928

  51. Shangxuan W, Ying-Cong C, Xiang L, Ancong W, Jinjie Y, Wei-Shi Z (2016) An enhanced deep feature representation for person re-identification. In: IEEE Conference on WACV, pp 1–8

  52. Song C, Huang Y, Ouyang W, Wang L (2018) Mask-guided contrastive attention model for person re-identification. In: IEEE Conference on CVPR, pp 1179–1188

  53. Su C, Yang F, Zhang S, Tian Q, Davis LS, Gao W (2015) Multi-task learning with low rank attribute embedding for person re-identification. In: IEEE Conference on ICCV, pp 3739–3747

  54. Sumit C, Raia H, Yann L (2005) Learning a similarity metric discriminatively, with application to face verification. In: IEEE Conference on CVPR, pp 539–546

  55. Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Conference on AAAI, vol 4, p 12

  56. Varior RR, Shuai B, Lu J, Xu D, Wang G (2016) A siamese long short-term memory architecture for human re-identification. In: Conference on ECCV. Springer, pp 135–153

  57. Volodymyr M, Nicolas H, Alex G, Koray K (2014) Recurrent models of visual attention. In: Conference on NIPS, pp 2204–2212

  58. Wang T, Gong S, Zhu X, Wang S (2014) Person re-identification by video ranking. In: Conference on ECCV, pp 688–703

  59. Wei-Shi Z, Shaogang G, Tao X (2011) Person re-identification by probabilistic relative distance comparison. In: IEEE Conference on CVPR, pp 649–656

  60. Wei Z, Xiaodong Y, Xuanyu H (2018) Learning bidirectional temporal cues for video-based person re-identification. IEEE Trans Circuits Syst Video Techn 28 (10):2768–2776

    Article  Google Scholar 

  61. Xie Y, Yu H, Gong X, Dong Z, Gao Y (2015) Learning visual-spatial saliency for multiple-shot person re-identification. IEEE Signal Process Lett 22(11):1854–1858

    Article  Google Scholar 

  62. Xu S, Cheng Y, Gu K, Yang Y, Chang S, Zhou P (2017) Jointly attentive spatial-temporal pooling networks for video-based person re-identification. In: IEEE Conference on ICCV, pp 4743–4752

  63. Yang W, Jie Q, Jun T, Tsukasa O (2018) Temporal-enhanced convolutional network for person re-identification. In: Conference on AAAI, pp 7412–7419

  64. Yi D, Lei Z, Liao S, Li SZ (2014) Deep metric learning for person re-identification. In: IEEE Conference on ICPR, pp 34–39

  65. Yifan S, Liang Z, Yi Y, Qi T, Shengjin W (2018) Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). In: ECCV, pp 501–518

  66. Yiluan G, Ngai-Man C (2018) Efficient and deep person re-identification using multi-level similarity. In: IEEE Conference on CVPR, pp 2335–2344

  67. Yizhou Z, Xiaoyan S, Zheng-Jun Z, Wenjun Z (2018) Mict: mixed 3d/2d convolutional tube for human action recognition. In: IEEE Conference on CVPR, pp 449–458

  68. You J, Wu A, Li X, Zheng WS (2016) Top-push video-based person re-identification. In: IEEE Conference on CVPR, pp 1345–1353

  69. Yu L, Junjie Y, Wanli O (2017) Quality aware network for set to set recognition. In: IEEE Conference on CVPR, pp 4694–4703

  70. Zhang W, Ma B, Liu K, Huang R (2017) Video-based pedestrian re-identification by adaptive spatio-temporal appearance model. IEEE Trans Image Process PP(99):1–1

    MathSciNet  MATH  Google Scholar 

  71. Zhen L, Shiyu C, Feng L, Thomas SH, Liangliang C, John RS (2013) Learning locally-adaptive decision functions for person verification. In: IEEE Conference on CVPR, pp 3610–3617

  72. Zhen Z, Yan H, Wei W, Liang W, Tieniu T (2017) See the forest for the trees: joint spatial and temporal recurrent neural networks for video-based person re-identification. In: IEEE Conference on CVPR, pp 6776–6785

  73. Zheng L, Wang S, Tian L, He F, Liu Z, Tian Q (2015) Query-adaptive late fusion for image search and person re-identification. In: IEEE Conference on CVPR, pp 1741–1750

  74. Zheng L, Bie Z, Sun Y, Wang J, Su C, Wang S, Tian Q (2016) Mars: a video benchmark for large-scale person re-identification. In: Conference on ECCV. Springer, pp 868–884

  75. Zhihui L, Lina Y, Feiping N, Dingwen Z, Min X (2018) Multi-rate gated recurrent convolutional networks for video-based pedestrian re-identification. In: Conference on AAAI, pp 7081–7088

  76. Zhu X, Jing XY, Wu F, Feng H (2016) Video-based person re-identification by simultaneously learning intra-video and inter-video distance metrics. In: IJCAI, pp 3552–3559

  77. Zhun Z, Liang Z, Donglin C, Shaozi L (2017) Re-ranking person re-identification with k-reciprocal encoding. In: IEEE Conference on CVPR, pp 3652–3661

Download references

Acknowledgments

The authors would like to thank the editors and anonymous reviewers for their constructive comments and suggestions. This work was supported by the NSFC-Key Project under Grant No. 61933013, the NSFC-Key Project of General Technology Fundamental Research United Fund under Grant No. U1736211, the Key Project of Natural Science Foundation of Hubei Province under Grant No. 2018CFA024, the Natural Science Foundation of Guangdong Province under Grant No. 2019A1515011076, the National Key Research and Development Program of China under Grant No.2017YFB0202001, the National Nature Science Foundation of China under Grant No. 61672208, the Higher Education Institution Key Research Projects of Henan Province, No. 19A520001, the Key Scientific and Technological Project of Henan Province, No.192102210277.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Li Cheng.

Ethics declarations

Conflict of interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cheng, L., Jing, XY., Zhu, X. et al. Local and global aligned spatiotemporal attention network for video-based person re-identification. Multimed Tools Appl 79, 34489–34512 (2020). https://doi.org/10.1007/s11042-020-08765-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-08765-1

Keywords

Navigation