Local and global aligned spatiotemporal attention network for video-based person re-identification

Cheng, Li; Jing, Xiao-Yuan; Zhu, Xiaoke; Hu, Chang-Hui; Gao, Guangwei; Wu, Songsong

doi:10.1007/s11042-020-08765-1

Local and global aligned spatiotemporal attention network for video-based person re-identification

Published: 06 March 2020

Volume 79, pages 34489–34512, (2020)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Li Cheng¹,
Xiao-Yuan Jing^1,2,3,
Xiaoke Zhu^1,4,
Chang-Hui Hu³,
Guangwei Gao³ &
…
Songsong Wu⁵

317 Accesses
3 Citations
Explore all metrics

Abstract

Matching video clips of people across non-overlapping surveillance cameras (video-based person re-identification) is of significant importance in many real-world applications. In this paper, we address the video-based person re-identification by developing a Local and Global Aligned Spatiotemporal Attention (LGASA) network. Our LGASA network consists of five cascaded modules, including 3D convolutional layers, residual block, spatial transformer network (STN), multi-stream recurrent network and multiple-attention module. Specifically, the 3D convolutional layers are used to capture local short-term fast-varying motion information encoded in multiple adjacent original frames. The residual block is used to extract mid-level feature maps. STN is applied to align the mid-level feature maps. The multi-stream recurrent network is designed to exploit the useful local and global long-term temporal dependency from the aligned mid-level feature maps. The multiple-attention module is designed to aggregate feature vectors of the same body part (or global) from different frames within each video into a single vector according to their importance. Experimental results on three video pedestrian datasets verify the effectiveness of the proposed local and global aligned spatiotemporal attention network.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video-Based Person Re-identification via 3D Convolutional Networks and Non-local Attention

Temporal-Contextual Attention Network for Video-Based Person Re-identification

Video-Based Convolutional Attention for Person Re-Identification

Notes

ReduceLROnPlateau is a scheduler function provided by Pytorch in https://pytorch.org/docs/stable/optim.html

References

Alexander H, Lucas B, Bastian L (2017) In defense of the triplet loss for person re-identification. arXiv:1703.07737
Alexander K, Marcin M, Cordelia S (2008) A spatio-temporal descriptor based on 3d-gradients. In: Conference on BMVC, pp 1–10
Ashish V, Noam S, Niki P, Jakob U, Llion J, Gomez AN, Lukasz K, Illia P (2017) Attention is all you need. In: Conference on NIPS, pp 6000–6010
Bazzani L, Cristani M, Perina A, Farenzena M, Murino V (2010) Multiple-shot person re-identification by hpe signature. In: IEEE Conference on CPR. IEEE, pp 1413–1416
Bazzani L, Cristani M, Perina A, Murino V (2012) Multiple-shot person re-identification by chromatic and epitomic analyses. Pattern Recogn Lett 33(7):898–903
Article Google Scholar
Bhaswati S, Sai K, Jayanta R, Aditi M, Anchit RN (2018) Video based person re-identification by re-ranking attentive temporal information in deep recurrent convolutional networks. In: IEEE Conference on ICIP, pp 1663–1667
Bryan James P, Wei-Shi Z, Shaogang G, Tao X (2010) Person re-identification by support vector ranking. In: Conference on BMVC, pp 1–11
Chen L, Yang H, Zhu J, Zhou Q, Wu S, Gao Z (2017) Deep spatial-temporal fusion network for video-based person re-identification. In: IEEE Conference on CVPR Workshops, pp 478–1485
Chen YC, Zhu X, Zheng WS, Lai JH (2018) Person re-identification by camera correlation aware feature augmentation. IEEE Trans Pattern Anal Mach Intell 40(2):392–408
Article Google Scholar
Cheng D, Gong Y, Zhou S, Wang J, Zheng N (2016) Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In: IEEE Conference on CVPR, pp 1335–1344
Chi S, Jianing L, Shiliang Z, Junliang X, Wen G, Qi T (2017) Pose-driven deep convolutional model for person re-identification. In: IEEE Conference on ICCV, pp 3980–3989
Chung D, Tahboub K, Delp EJ (2017) A two stream siamese convolutional neural network for person re-identification. In: IEEE Conference on ICCV
Chunxiao L, Shaogang G, Chen CL, Xinggang L (2012) Person re-identification: what features are important?. In: ECCV Workshops, pp 391–401
Chen D, Zheng-Jun Z, Jiawei L, Hongtao X, Yongdong Z (2018) Temporal-contextual attention network for video-based person re-identification. In: Advances in multimedia information processing - PCM, pp 146–157
Dangwei L, Xiaotang C, Zhang Z, Kaiqi H (2017) Learning deep context-aware features over body and latent parts for person re-identification. In: IEEE Conference on CVPR, pp 7398–7407
De C, Yihong G, Sanping Z, Jinjun W, Nanning Z (2016) Person re-identification by multi-channel parts-based CNN with improved triplet loss function. In: IEEE Conference on CVPR, pp 1335–1344
Dehghan A, Modiri Assari S, Shah M (2015) Gmmcp yracker: globally optimal generalized maximum multi clique problem for multiple object tracking. In: IEEE Conference on CVPR, pp 4091–4099
Farenzena M, Bazzani L, Perina A, Murino V, Cristani M (2010) Person re-identification by symmetry-driven accumulation of local features. In: IEEE Conference on CVPR, pp 2360–2367
Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D (2010) Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell 32(9):1627–1645
Article Google Scholar
Fergnani F, Alletto S, Serra G, De Mira J, Cucchiara R (2016) Body part based re-identification from an egocentric perspective. In: IEEE Conference on CVPR
Furqan MK, Franċois B (2017) Multi-shot person re-identification using part appearance mixture. In: IEEE Conference on WACV, pp 605–614
Gong S, Cristani M, Yan S, Loy CC (2014) Person re-identification. Springer
Hao Y, Chunfeng Y, Bing L, Yang D, Junliang X, Weiming H, Maybank SJ (2019) Asymmetric 3d convolutional neural networks for action recognition. Pattern Recogn 85:1–12
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE Conference on CVPR, pp 770–778
He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In: Conference on ECCV. Springer, pp 630–645
Hirzer M, Beleznai C, Roth PM, Bischof H (2011) Person re-identification by descriptive and discriminative classification. In: Image analysis, pp 91–102
Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Article Google Scholar
Jianlou S, Honggang Z, Chun-Guang L, Jason K, Xiangfei K, Kot AC, Gang W (2018) Dual attention matching network for context-aware feature sequence based person re-identification. In: IEEE Conference on CVPR, pp 5363–5372
Jing XY, Zhu X, Wu F, You X, Liu Q, Yue D, Hu R, Xu B (2015) Super-resolution person re-identification with semi-coupled low-rank discriminant dictionary learning. In: IEEE Conference on CVPR, pp 695–704
Ju D, Pingping Z, Dong W, Huchuan L, Hongyu W (2019) Video person re-identification by temporal residual learning. IEEE Trans Image Process 28 (3):1366–1377
Article MathSciNet Google Scholar
Karanam S, Gou M, Wu Z, Rates-Borras A, Camps O, Radke RJ (2019) A systematic evaluation and benchmark for person re-identification: features, metrics, and datasets. IEEE Trans Pattern Anal Mach Intell 41(3):523–536
Article Google Scholar
Kelvin X, Jimmy B, Ryan K, Kyunghyun C, Courville AC, Ruslan S, Zemel RS, Yoshua B (2015) Show, attend and tell: neural image caption generation with visual attention. In: IEEE Conference on ICML, pp 2048–2057
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: IEEE Conference on ICLR
Li K, Ding Z, Li S, Fu Y (2019) Toward resolution-invariant person reidentification via projective dictionary learning. IEEE Trans Neural Netw Learning Syst 30(6):1896–1907
Article MathSciNet Google Scholar
Li S, Shao M, Fu Y (2015) Cross-view projective dictionary learning for person re-identification. In: IJCAI, pp 2155–2161
Li S, Shao M, Fu Y (2018) Person re-identification by cross-view multi-level dictionary learning. IEEE Trans Pattern Anal Mach Intell 40(12):2963–2977
Article Google Scholar
Li Y, Wu Z, Karanam S, Radke RJ (2014) Real-world re-identification in an airport camera network. In: International conference on ICDSC. ACM, p 35
Liao S, Hu Y, Zhu X, Li SZ (2015) Person re-identification by local maximal occurrence representation and metric learning. In: IEEE Conference on CVPR, pp 2197–2206
Liao S, Li SZ (2015) Efficient psd constrained asymmetric metric learning for person re-identification. In: IEEE Conference on ICCV, pp 3685–3693
Liu H, Jie Z, Jayashree K, Qi M, Jiang J, Yan S, Feng J (2017) Video-based person re-identification with accumulative motion context. IEEE Transactions on Circuits and Systems for Video Technology
Manmatha R, Wu C, Smola AJ, Krähenbühl P. (2017) Sampling matters in deep embedding learning. In: IEEE Conference on ICCV, pp 2859–2867
Max J, Karen S, Andrew Z, Koray K (2015) Spatial transformer networks. In: Conference on NIPS, pp 2017–2025
McLaughlin N, Martinez del Rincon J, Miller P (2016) Recurrent convolutional network for video-based person re-identification. In: IEEE Conference on CVPR, pp 1325–1334
Mclaughlin N, Rincon JMD, Miller P (2017) Video person re-identification for wide area tracking based on recurrent neural networks. IEEE Trans Circ Syst Video Technol PP(99):1–1
Google Scholar
Niloofar G, Thomas BS, Richard IH (2006) Person reidentification using spatiotemporal appearance. In: IEEE Conference on CVPR, pp 1528–1535
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in pytorch
Pavlo M, Xiaodong Y, Shalini G, Kihwan K, Stephen T, Jan K (2016) Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural networks. In: IEEE Conference on CVPR, pp 4207–4215
Rohit G, Deva R (2017) Attentional pooling for action recognition. In: Conference on NIPS, pp 33–44
Rui Z, Wanli O, Xiaogang W (2014) Learning mid-level filters for person re-identification. In: IEEE Conference on CVPR, pp 144–151
Sergey Z, Nikos K (2016) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv:1612.03928
Shangxuan W, Ying-Cong C, Xiang L, Ancong W, Jinjie Y, Wei-Shi Z (2016) An enhanced deep feature representation for person re-identification. In: IEEE Conference on WACV, pp 1–8
Song C, Huang Y, Ouyang W, Wang L (2018) Mask-guided contrastive attention model for person re-identification. In: IEEE Conference on CVPR, pp 1179–1188
Su C, Yang F, Zhang S, Tian Q, Davis LS, Gao W (2015) Multi-task learning with low rank attribute embedding for person re-identification. In: IEEE Conference on ICCV, pp 3739–3747
Sumit C, Raia H, Yann L (2005) Learning a similarity metric discriminatively, with application to face verification. In: IEEE Conference on CVPR, pp 539–546
Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Conference on AAAI, vol 4, p 12
Varior RR, Shuai B, Lu J, Xu D, Wang G (2016) A siamese long short-term memory architecture for human re-identification. In: Conference on ECCV. Springer, pp 135–153
Volodymyr M, Nicolas H, Alex G, Koray K (2014) Recurrent models of visual attention. In: Conference on NIPS, pp 2204–2212
Wang T, Gong S, Zhu X, Wang S (2014) Person re-identification by video ranking. In: Conference on ECCV, pp 688–703
Wei-Shi Z, Shaogang G, Tao X (2011) Person re-identification by probabilistic relative distance comparison. In: IEEE Conference on CVPR, pp 649–656
Wei Z, Xiaodong Y, Xuanyu H (2018) Learning bidirectional temporal cues for video-based person re-identification. IEEE Trans Circuits Syst Video Techn 28 (10):2768–2776
Article Google Scholar
Xie Y, Yu H, Gong X, Dong Z, Gao Y (2015) Learning visual-spatial saliency for multiple-shot person re-identification. IEEE Signal Process Lett 22(11):1854–1858
Article Google Scholar
Xu S, Cheng Y, Gu K, Yang Y, Chang S, Zhou P (2017) Jointly attentive spatial-temporal pooling networks for video-based person re-identification. In: IEEE Conference on ICCV, pp 4743–4752
Yang W, Jie Q, Jun T, Tsukasa O (2018) Temporal-enhanced convolutional network for person re-identification. In: Conference on AAAI, pp 7412–7419
Yi D, Lei Z, Liao S, Li SZ (2014) Deep metric learning for person re-identification. In: IEEE Conference on ICPR, pp 34–39
Yifan S, Liang Z, Yi Y, Qi T, Shengjin W (2018) Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). In: ECCV, pp 501–518
Yiluan G, Ngai-Man C (2018) Efficient and deep person re-identification using multi-level similarity. In: IEEE Conference on CVPR, pp 2335–2344
Yizhou Z, Xiaoyan S, Zheng-Jun Z, Wenjun Z (2018) Mict: mixed 3d/2d convolutional tube for human action recognition. In: IEEE Conference on CVPR, pp 449–458
You J, Wu A, Li X, Zheng WS (2016) Top-push video-based person re-identification. In: IEEE Conference on CVPR, pp 1345–1353
Yu L, Junjie Y, Wanli O (2017) Quality aware network for set to set recognition. In: IEEE Conference on CVPR, pp 4694–4703
Zhang W, Ma B, Liu K, Huang R (2017) Video-based pedestrian re-identification by adaptive spatio-temporal appearance model. IEEE Trans Image Process PP(99):1–1
MathSciNet MATH Google Scholar
Zhen L, Shiyu C, Feng L, Thomas SH, Liangliang C, John RS (2013) Learning locally-adaptive decision functions for person verification. In: IEEE Conference on CVPR, pp 3610–3617
Zhen Z, Yan H, Wei W, Liang W, Tieniu T (2017) See the forest for the trees: joint spatial and temporal recurrent neural networks for video-based person re-identification. In: IEEE Conference on CVPR, pp 6776–6785
Zheng L, Wang S, Tian L, He F, Liu Z, Tian Q (2015) Query-adaptive late fusion for image search and person re-identification. In: IEEE Conference on CVPR, pp 1741–1750
Zheng L, Bie Z, Sun Y, Wang J, Su C, Wang S, Tian Q (2016) Mars: a video benchmark for large-scale person re-identification. In: Conference on ECCV. Springer, pp 868–884
Zhihui L, Lina Y, Feiping N, Dingwen Z, Min X (2018) Multi-rate gated recurrent convolutional networks for video-based pedestrian re-identification. In: Conference on AAAI, pp 7081–7088
Zhu X, Jing XY, Wu F, Feng H (2016) Video-based person re-identification by simultaneously learning intra-video and inter-video distance metrics. In: IJCAI, pp 3552–3559
Zhun Z, Liang Z, Donglin C, Shaozi L (2017) Re-ranking person re-identification with k-reciprocal encoding. In: IEEE Conference on CVPR, pp 3652–3661

Download references

Acknowledgments

The authors would like to thank the editors and anonymous reviewers for their constructive comments and suggestions. This work was supported by the NSFC-Key Project under Grant No. 61933013, the NSFC-Key Project of General Technology Fundamental Research United Fund under Grant No. U1736211, the Key Project of Natural Science Foundation of Hubei Province under Grant No. 2018CFA024, the Natural Science Foundation of Guangdong Province under Grant No. 2019A1515011076, the National Key Research and Development Program of China under Grant No.2017YFB0202001, the National Nature Science Foundation of China under Grant No. 61672208, the Higher Education Institution Key Research Projects of Henan Province, No. 19A520001, the Key Scientific and Technological Project of Henan Province, No.192102210277.

Author information

Authors and Affiliations

School of Computer Science, Wuhan University, Wuhan, China
Li Cheng, Xiao-Yuan Jing & Xiaoke Zhu
School of Computer, Guangdong University of Petrochemical Technology, Maoming, China
Xiao-Yuan Jing
College of Automation, Nanjing University of Posts and Telecommunications, Nanjing, China
Xiao-Yuan Jing, Chang-Hui Hu & Guangwei Gao
School of Computer and Information Engineering, Henan University, Kaifeng, China
Xiaoke Zhu
Institute of Advanced Technology, Nanjing University of Posts and Telecommunications, Nanjing, China
Songsong Wu

Authors

Li Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Yuan Jing
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoke Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Chang-Hui Hu
View author publications
You can also search for this author in PubMed Google Scholar
Guangwei Gao
View author publications
You can also search for this author in PubMed Google Scholar
Songsong Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Li Cheng.

Ethics declarations

Conflict of interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cheng, L., Jing, XY., Zhu, X. et al. Local and global aligned spatiotemporal attention network for video-based person re-identification. Multimed Tools Appl 79, 34489–34512 (2020). https://doi.org/10.1007/s11042-020-08765-1

Download citation

Received: 02 May 2019
Revised: 21 December 2019
Accepted: 17 February 2020
Published: 06 March 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s11042-020-08765-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Local and global aligned spatiotemporal attention network for video-based person re-identification

Abstract

Access this article

Similar content being viewed by others

Video-Based Person Re-identification via 3D Convolutional Networks and Non-local Attention

Temporal-Contextual Attention Network for Video-Based Person Re-identification

Video-Based Convolutional Attention for Person Re-Identification

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Local and global aligned spatiotemporal attention network for video-based person re-identification

Abstract

Access this article

Similar content being viewed by others

Video-Based Person Re-identification via 3D Convolutional Networks and Non-local Attention

Temporal-Contextual Attention Network for Video-Based Person Re-identification

Video-Based Convolutional Attention for Person Re-Identification

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation