Action detection based on tracklets with the two-stream CNN

Zhang, Minwen; Gao, Chenqiang; Li, Qiang; Wang, Lan; Zhang, Jiayao

doi:10.1007/s11042-017-5116-9

Action detection based on tracklets with the two-stream CNN

Published: 24 August 2017

Volume 77, pages 3303–3316, (2018)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Minwen Zhang¹,
Chenqiang Gao¹,
Qiang Li¹,
Lan Wang¹ &
…
Jiayao Zhang¹

783 Accesses
15 Citations
Explore all metrics

Abstract

Different from action recognition which just needs to assign correct labels to video clips, action detection aims to recognize and localize the action from an unknown video. While action recognition has made a good progress, action detection still remains a challenging task. Inspired by the success of object detection and action recognition based on the powerful Convolutional Neural Network (CNN), in this paper, a novel action detection method is proposed by embedding multiple object tracking into the action detection process. Firstly, we fine-tune the off-the-shelf faster-RCNN model to detect people in frames. Then, a simple tracking-by-detection algorithm is adopted to obtain tracklets for keeping temporal consistency. After that, we apply a temporal multi-scale sliding window strategy to each tracklet to generate the action proposal. Finally, the action proposal is further fed into a fully connected neural network to complete the classification task. Here, features of the action proposal are obtained by the two-stream CNN. Experiment results reveal that our method outperforms the state-of-the-art methods on J-HMDB and UCF sports action detection datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SSD: Single Shot MultiBox Detector

Object detection using YOLO: challenges, architectural successors, datasets and applications

Article 08 August 2022

Tausif Diwan, G. Anirudh & Jitendra V. Tembhurne

CBAM: Convolutional Block Attention Module

References

Chang X, Yang Y (2016) Semisupervised feature analysis by mining correlations among multiple tasks. IEEE Trans Neural Netw Learn Syst PP(99):1–12. https://doi.org/10.1109/TNNLS.2016.2582746
Google Scholar
Chang X, Yang Y, Hauptmann AG, Xing E, Yu Y (2015) Semantic concept discovery for large-scale zero-shot event detection. In: IJCAI, pp 2234–2240. AAAI Press
Chang X, Yang Y, Xing E, Yu Y (2015) Complex event detection using semantic saliency and nearly-isotonic svm. In: Proceedings of the 32nd international conference on machine learning, pp 1348–1357. PMLR
Chang X, Nie F, Wang S, Yang Y, Zhou X, Zhang C (2016) Compound rank- k projections for bilinear analysis. IEEE Trans Neural Netw Learn Syst 27(7):1502–1513
Article MathSciNet Google Scholar
Chang X, Ma Z, Lin M, Yang Y, Hauptmann A (2017) Feature interaction augmented sparse learning for fast kinect motion detection. IEEE Trans Image Process 26(8):3911–3920
Article MathSciNet Google Scholar
Chang X, Ma Z, Yang Y, Zeng Z, Hauptmann AG (2017) Bi-level semantic representation analysis for multimedia event detection. IEEE Trans Cybern 47(5):1180–1197
Article Google Scholar
Chang X, Yu Y, Yang Y, Xing EP (2017) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39(8):1617–1632. https://doi.org/10.1109/TPAMI.2016.2608901
Article Google Scholar
Chéron G, Laptev I, Schmid C (2015) P-cnn: pose-based cnn features for action recognition. In: 2015 IEEE international conference on computer vision (ICCV), pp 3218–3226. IEEE
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR), vol. 1, pp 886–893. IEEE
Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: Proceedings of the 9th European conference on computer vision, pp 428–441. Springer
Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: 2005 2nd joint IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance, pp 65–72. IEEE
Gao C, Meng D, Tong W, Yang Y, Cai Y, Shen H, Liu G, Xu S, Hauptmann AG (2014) Interactive surveillance event detection through mid-level discriminative representation. In: Proceedings of international conference on multimedia retrieval, pp 305–312. ACM
Gao C, Du Y, Liu J, Lv J, Yang L, Meng D, Hauptmann AG (2016) Infar dataset: infrared action recognition at different times. Neurocomputing 212:36–47
Article Google Scholar
Girshick R (2015) Fast r-cnn. In: 2015 IEEE international conference on computer vision (ICCV), pp 1440–1448. IEEE
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE conference on computer vision and pattern recognition (CVPR), pp 580–587. IEEE
Gkioxari G, Malik J (2015) Finding action tubes. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 759–768. IEEE
Jain M, Van Gemert J, Jégou H, Bouthemy P, Snoek CG (2014) Action localization with tubelets from motion. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 740–747. IEEE
Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation. In: 2010 IEEE conference on computer vision and pattern recognition (CVPR), pp 3304–3311. IEEE
Jhuang H, Gall J, Zuffi S, Schmid C, Black MJ (2013) Towards understanding action recognition. In: 2013 IEEE international conference on computer vision (ICCV), pp 3192–3199. IEEE
Ji S, Xu W, Yang M (2013) Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Article Google Scholar
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: 2014 IEEE conference on computer vision and pattern recognition (CVPR), pp 1725–1732. IEEE
Klaser A, Marszałek M, Schmid C, Zisserman A (2010) Human focused action localization in video. In: SGA 2010-international workshop on sign, gesture, and activity, ECCV 2010 workshops, vol. 6553, pp 219–233. Springer
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems 25 (NIPS 2012), pp 1097–1105. Curran Associates, Inc
Lan T, Wang Y, Mori G (2011) Discriminative figure-centric models for joint action localization and recognition. In: 2011 IEEE international conference on computer vision (ICCV), pp 2003–2010. IEEE
LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551
Article Google Scholar
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. In: Proceedings of the 14th European conference on computer vision, pp 21–37. Springer
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 3431–3440. IEEE
Lowe D.G. (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60:91–110
Article Google Scholar
Perronnin F, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: Proceedings of the 11th European conference on computer vision, pp 143–156. Springer
Poppe R (2010) A survey on vision-based human action recognition. Image Vis Comput 28:976–990
Article Google Scholar
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems 28 (NIPS 2015), pp 91–99. Curran Associates, Inc
Rodriguez MD, Ahmed J, Shah M (2008) Action mach a spatio-temporal maximum average correlation height filter for action recognition. In: 2008 IEEE conference on computer vision and pattern recognition (CVPR), pp 1–8. IEEE
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems 27 (NIPS 2014), pp 568–576. Curran Associates, Inc
Tian Y, Sukthankar R, Shah M (2013) Spatiotemporal deformable part models for action detection. In: 2013 IEEE conference on computer vision and pattern recognition (CVPR), pp 2642–2649. IEEE
Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171
Article Google Scholar
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: 2013 IEEE international conference on computer vision (ICCV), pp 3551–3558. IEEE
Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: 2011 IEEE conference on computer vision and pattern recognition (CVPR), pp 3169–3176. IEEE
Wang L, Qiao Y, Tang X, Van Gool L (2016) Actionness estimation using hybrid fully convolutional networks. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 2708–2717. IEEE
Weinland D, Ronfard R, Boyer E (2011) A survey of vision-based methods for action representation, segmentation and recognition. Comput Vis Image Underst 115:224–241
Article Google Scholar
Weinzaepfel P, Harchaoui Z, Schmid C (2015) Learning to track for spatio-temporal action localization. In: 2015 IEEE international conference on computer vision (ICCV), pp 3164–3172. IEEE
Xiang Y, Alahi A, Savarese S (2015) Learning to track: online multi-object tracking by decision making. In: 2015 IEEE international conference on computer vision (ICCV), pp 4705–4713. IEEE
Yan Y, Ricci E, Liu G, Subramanian R, Sebe N (2014) Clustered multi-task linear discriminant analysis for view invariant color-depth action recognition. In: 2014 22nd international conference on pattern recognition (ICPR), pp 3493–3498. IEEE
Yan Y, Ricci E, Subramanian R, Liu G, Sebe N (2014) Multitask linear discriminant analysis for view invariant action recognition. IEEE Trans Image Process 23:5599–5611
Article MathSciNet Google Scholar
Yeung S, Russakovsky O, Mori G, Fei-Fei L (2016) End-to-end learning of action detection from frame glimpses in videos. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 2678–2687. IEEE
Yu G, Yuan J (2015) Fast action proposals for human action detection and search. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 1302–1311. IEEE
Zhang D, Han J, Li C, Wang J, Li X (2016) Detection of co-salient objects by looking deep and wide. Int J Comput Vis 120(2):215–232
Article MathSciNet Google Scholar
Zhang D, Han J, Jiang L, Ye S, Chang X (2017) Revealing event saliency in unconstrained video collection. IEEE Trans Image Process 26(4):1746–1758
Article MathSciNet Google Scholar
Zhang D, Meng D, Han J (2017) Co-saliency detection via a self-paced multiple-instance learning framework. IEEE Trans Pattern Anal Mach Intell 39 (5):865–878
Article Google Scholar
Zhu L, Shen J, Jin H, Xie L, Zheng R (2015) Landmark classification with hierarchical multi-modal exemplar feature. IEEE Trans Multimedia 17(7):981–993
Article Google Scholar
Zhu L, Shen J, Jin H, Zheng R, Xie L (2015) Content-based visual landmark search via multimodal hypergraph learning. IEEE Trans Cybern 45(12):2756–2769
Article Google Scholar
Zhu L, Shen J, Xie L, Cheng Z (2016) Unsupervised topic hypergraph hashing for efficient mobile image retrieval. IEEE Trans Cybern PP(99):1–14. https://doi.org/10.1109/TCYB.2016.2591068
Article Google Scholar
Zhu L, Shen J, Xie L, Cheng Z (2017) Unsupervised visual hashing with semantic assistant for content-based image retrieval. IEEE Trans Knowl Data Eng 29 (2):472–486
Article Google Scholar
Zitnick CL, Dollár P (2014) Edge boxes: Locating object proposals from edges. In: Proceedings of the 13th European Conference on Computer Vision, pp 391–405. Springer

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (No.61571071), Wenfeng innovation and start-up project of Chongqing University of Posts and Telecommunications (No. WF201404), the Research Innovation Program for Postgraduate of Chongqing (No. CYS17222). The authors also thank NVIDIA corporation for the donation of GeForce GTX TITAN X GPU.

Author information

Authors and Affiliations

Chongqing Key Laboratory of Signal and Information Processing, Chongqing University of Posts and Telecommunications, Chongqing, China
Minwen Zhang, Chenqiang Gao, Qiang Li, Lan Wang & Jiayao Zhang

Authors

Minwen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chenqiang Gao
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Lan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jiayao Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Minwen Zhang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, M., Gao, C., Li, Q. et al. Action detection based on tracklets with the two-stream CNN. Multimed Tools Appl 77, 3303–3316 (2018). https://doi.org/10.1007/s11042-017-5116-9

Download citation

Received: 15 March 2017
Revised: 09 August 2017
Accepted: 14 August 2017
Published: 24 August 2017
Issue Date: February 2018
DOI: https://doi.org/10.1007/s11042-017-5116-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Action detection based on tracklets with the two-stream CNN

Abstract

Access this article

Similar content being viewed by others

SSD: Single Shot MultiBox Detector

Object detection using YOLO: challenges, architectural successors, datasets and applications

CBAM: Convolutional Block Attention Module

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Action detection based on tracklets with the two-stream CNN

Abstract

Access this article

Similar content being viewed by others

SSD: Single Shot MultiBox Detector

Object detection using YOLO: challenges, architectural successors, datasets and applications

CBAM: Convolutional Block Attention Module

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation