Abstract
Human tracking and localization play a crucial role in many applications like accident avoidance, action recognition, safety and security, surveillance and crowd analysis. Inspired by its use and scope, we introduced a novel method for human tracking (one or many) and re-localization in a complex environment with large displacement. The model can handle complex background, variations in illumination, changes in target pose, the presence of similar target and appearance (pose and clothes), the motion of target and camera, occlusion of the target, background variation, and massive displacement of the target. Our model uses three convolutional neural network based deep architecture and cascades their learning such that it improves the overall efficiency of the model. The first network learns the pixel level representation of small regions. The second architecture uses these features and learns the displacement of a region with its category between moved, not-moved, and occluded classes. Whereas, the third network improves the displacement result of the second network by utilizing the previous two learning. We also create a semi-synthetic dataset for training purpose. The model is trained on this dataset first and tested on a subset of CamNeT, VOT2015, LITIV-tracking and Visual Tracker Benchmark database without training with real data. The proposed model yield comparative results with respect to current state-of-the-art methods based on evaluation criteria described in Object Tracking Benchmark, TPAMI 2015, CVPR 2013 and ICCV 2017.









Similar content being viewed by others
References
Babenko B, Yang MH, Belongie S (2011) Robust object tracking with online multiple instance learning. IEEE Trans Pattern Anal Mach Intell 33(8):1619–1632
Bouachir W, Bilodeau GA (2015) Collaborative part-based tracking using salient local predictors. Comput Vis Image Underst 137:88–101
Chen K, Tao W (2018) Learning linear regression via single convolutional layer for visual object tracking. IEEE Trans Multimed
Comaniciu D, Ramesh V, Meer P (2003) Kernel-based object tracking. IEEE Trans Pattern Anal Mach Intell 25(5):564–577
Crammer K, Dekel O, Keshet J, Shalev-Shwartz S, Singer Y (2006) Online passive-aggressive algorithms. J Mach Learn Res 7(Mar):551–585
Danelljan M, Bhat G, Khan FS, Felsberg M (2017) Eco: efficient convolution operators for tracking. In: CVPR, vol 1, p 3
Danelljan M, Hager G, Shahbaz Khan F, Felsberg M (2015) Convolutional features for correlation filter based visual tracking. In: Proceedings of the IEEE international conference on computer vision workshops, pp 58–66
Danelljan M, Hager G, Shahbaz Khan F, Felsberg M (2015) Learning spatially regularized correlation filters for visual tracking. In: Proceedings of the IEEE international conference on computer vision, pp 4310–4318
Dauphin Y, de Vries H, Bengio Y (2015) Equilibrated adaptive learning rates for non-convex optimization. In: Advances in neural information processing systems, pp 1504–1512
Fan H, Ling H (2017) Sanet: structure-aware network for visual tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 42–49
Fan J, Xu W, Wu Y, Gong Y (2010) Human tracking using convolutional neural networks. IEEE Trans Neural Netw 21(10):1610–1623
Fang K, Xiang Y, Savarese S (2017) Recurrent autoregressive networks for online multi-object tracking. arXiv:1711.02741
Gan W, Wang S, Lei X, Lee MS, Kuo CCJ (2018) Online cnn-based multiple object tracking with enhanced model updates and identity association. Signal Process Image Commun 66:95–102
Henriques JF, Caseiro R, Martins P, Batista J (2012) Exploiting the circulant structure of tracking-by-detection with kernels. In: European conference on computer vision. Springer, pp 702–715
Henriques JF, Caseiro R, Martins P, Batista J (2015) High-speed tracking with kernelized correlation filters. IEEE Trans Pattern Anal Mach Intell 37(3):583–596
Jia X, Lu H, Yang MH (2012) Visual tracking via adaptive structural local sparse appearance model. In: 2012 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 1822–1829
Kristan M, Matas J, Leonardis A, Felsberg M, Cehovin L, Fernandez G, Vojir T, Hager G, Nebehay G, Pflugfelder R (2015) The visual object tracking vot2015 challenge results. In: Proceedings of the IEEE international conference on computer vision workshops, pp 1–23
Laaroussi K, Saaidi A, Masrar M, Satori K (2016) Human tracking based on appearance model. In: Proceedings of the mediterranean conference on information & communication technologies 2015. Springer, pp 297–305
Laaroussi K, Saaidi A, Masrar M, Satori K (2018) Human tracking using joint color-texture features and foreground-weighted histogram. Multimed Tools Appl 77(11):13,947–13,981
Le Cun Y, Jackel L, Boser B, Denker J, Graf H, Guyon I, Henderson D, Howard R, Hubbard W (1989) Handwritten digit recognition: applications of neural network chips and automatic learning. IEEE Commun Mag 27(11):41–46
Lu X, Tang F, Huo H, Fang T (2018) Learning channel-aware deep regression for object tracking. Pattern Recogn Lett
Ma C, Huang JB, Yang X, Yang MH (2018) Adaptive correlation filters with long-term and short-term memory for object tracking. Int J Comput Vis 8:1–26
Marszałek M, Laptev I, Schmid C (2009) Actions in context. In: IEEE conference on computer vision & pattern recognition
Nam H, Han B (2016) Learning multi-domain convolutional neural networks for visual tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4293–4302
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, pp 618–626
Senior A, Hampapur A, Tian YL, Brown L, Pankanti S, Bolle R (2006) Appearance models for occlusion handling. Image Vis Comput 24(11):1233–1243
Shen Y, Lin W, Yan J, Xu M, Wu J, Wang J (2015) Person re-identification with correspondence structure learning. In: Proceedings of the IEEE international conference on computer vision, pp 3200–3208
Sidenbladh H, Black MJ, Fleet DJ (2000) Stochastic tracking of 3d human figures using 2d image motion. In: European conference on computer vision. Springer, pp 702–718
Soomro K, Zamir AR, Shah M (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Takada H, Hotta K, Janney P (2016) Human tracking in crowded scenes using target information at previous frames. In: 2016 23rd international conference on pattern recognition (ICPR). IEEE, pp 1809–1814
The litiv datasets (2017) http://www.polymtl.ca/litiv/en/vid. Accessed 10 Aug 2018
The visual tracker benchmark database (2017) http://www.visual-tracking.net. Accessed 10 Aug 2018
Wang D, Lu H, Bo C (2015) Visual tracking via weighted local cosine similarity. IEEE Trans Cybern 45(9):1838–1850
Wang D, Sun W, Yu S, Li L, Liu W (2016) A novel background-weighted histogram scheme based on foreground saliency for mean-shift tracking. Multimed Tools Appl 75(17):10,271–10,289
Wang L, Ouyang W, Wang X, Lu H (2015) Visual tracking with fully convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 3119–3127
Wren CR, Azarbayejani A, Darrell T, Pentland AP (1997) Pfinder: Real-time tracking of the human body. IEEE Trans Pattern Anal Mach Intell 19(7):780–785
Wu Y, Lim J, Yang MH (2015) Object tracking benchmark. IEEE Trans Pattern Anal Mach Intell 37(9):1834–1848
Xiao H, Lin W, Sheng B, Lu K, Yan J, Wang J, Ding E, Zhang Y, Xiong H (2018) Group re-identification: leveraging and integrating multi-grain information. In: 2018 ACM multimedia conference on multimedia conference. ACM, pp 192–200
Zhang K, Zhang L, Liu Q, Zhang D, Yang MH (2014) Fast visual tracking via dense spatio-temporal context learning. In: European conference on computer vision. Springer, pp 127–141
Zhong W, Lu H, Yang MH (2014) Robust object tracking via sparse collaborative appearance model. IEEE Trans Image Process 23(5):2356–2368
Zhou B, Zhao H, Puig X, Fidler S, Barriuso A, Torralba A (2016) Semantic understanding of scenes through the ade20k dataset. arXiv:1608.05442
Zhou B, Zhao H, Puig X, Fidler S, Barriuso A, Torralba A (2017) Scene parsing through ade20k dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kumar, N., Sukavanam, N. A cascaded CNN model for multiple human tracking and re-localization in complex video sequences with large displacement. Multimed Tools Appl 79, 6109–6134 (2020). https://doi.org/10.1007/s11042-019-08501-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-019-08501-4