Skip to main content
Log in

Action detection based on tracklets with the two-stream CNN

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Different from action recognition which just needs to assign correct labels to video clips, action detection aims to recognize and localize the action from an unknown video. While action recognition has made a good progress, action detection still remains a challenging task. Inspired by the success of object detection and action recognition based on the powerful Convolutional Neural Network (CNN), in this paper, a novel action detection method is proposed by embedding multiple object tracking into the action detection process. Firstly, we fine-tune the off-the-shelf faster-RCNN model to detect people in frames. Then, a simple tracking-by-detection algorithm is adopted to obtain tracklets for keeping temporal consistency. After that, we apply a temporal multi-scale sliding window strategy to each tracklet to generate the action proposal. Finally, the action proposal is further fed into a fully connected neural network to complete the classification task. Here, features of the action proposal are obtained by the two-stream CNN. Experiment results reveal that our method outperforms the state-of-the-art methods on J-HMDB and UCF sports action detection datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Chang X, Yang Y (2016) Semisupervised feature analysis by mining correlations among multiple tasks. IEEE Trans Neural Netw Learn Syst PP(99):1–12. https://doi.org/10.1109/TNNLS.2016.2582746

    Google Scholar 

  2. Chang X, Yang Y, Hauptmann AG, Xing E, Yu Y (2015) Semantic concept discovery for large-scale zero-shot event detection. In: IJCAI, pp 2234–2240. AAAI Press

  3. Chang X, Yang Y, Xing E, Yu Y (2015) Complex event detection using semantic saliency and nearly-isotonic svm. In: Proceedings of the 32nd international conference on machine learning, pp 1348–1357. PMLR

  4. Chang X, Nie F, Wang S, Yang Y, Zhou X, Zhang C (2016) Compound rank- k projections for bilinear analysis. IEEE Trans Neural Netw Learn Syst 27(7):1502–1513

    Article  MathSciNet  Google Scholar 

  5. Chang X, Ma Z, Lin M, Yang Y, Hauptmann A (2017) Feature interaction augmented sparse learning for fast kinect motion detection. IEEE Trans Image Process 26(8):3911–3920

    Article  MathSciNet  Google Scholar 

  6. Chang X, Ma Z, Yang Y, Zeng Z, Hauptmann AG (2017) Bi-level semantic representation analysis for multimedia event detection. IEEE Trans Cybern 47(5):1180–1197

    Article  Google Scholar 

  7. Chang X, Yu Y, Yang Y, Xing EP (2017) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39(8):1617–1632. https://doi.org/10.1109/TPAMI.2016.2608901

    Article  Google Scholar 

  8. Chéron G, Laptev I, Schmid C (2015) P-cnn: pose-based cnn features for action recognition. In: 2015 IEEE international conference on computer vision (ICCV), pp 3218–3226. IEEE

  9. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR), vol. 1, pp 886–893. IEEE

  10. Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: Proceedings of the 9th European conference on computer vision, pp 428–441. Springer

  11. Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: 2005 2nd joint IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance, pp 65–72. IEEE

  12. Gao C, Meng D, Tong W, Yang Y, Cai Y, Shen H, Liu G, Xu S, Hauptmann AG (2014) Interactive surveillance event detection through mid-level discriminative representation. In: Proceedings of international conference on multimedia retrieval, pp 305–312. ACM

  13. Gao C, Du Y, Liu J, Lv J, Yang L, Meng D, Hauptmann AG (2016) Infar dataset: infrared action recognition at different times. Neurocomputing 212:36–47

    Article  Google Scholar 

  14. Girshick R (2015) Fast r-cnn. In: 2015 IEEE international conference on computer vision (ICCV), pp 1440–1448. IEEE

  15. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE conference on computer vision and pattern recognition (CVPR), pp 580–587. IEEE

  16. Gkioxari G, Malik J (2015) Finding action tubes. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 759–768. IEEE

  17. Jain M, Van Gemert J, Jégou H, Bouthemy P, Snoek CG (2014) Action localization with tubelets from motion. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 740–747. IEEE

  18. Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation. In: 2010 IEEE conference on computer vision and pattern recognition (CVPR), pp 3304–3311. IEEE

  19. Jhuang H, Gall J, Zuffi S, Schmid C, Black MJ (2013) Towards understanding action recognition. In: 2013 IEEE international conference on computer vision (ICCV), pp 3192–3199. IEEE

  20. Ji S, Xu W, Yang M (2013) Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231

    Article  Google Scholar 

  21. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: 2014 IEEE conference on computer vision and pattern recognition (CVPR), pp 1725–1732. IEEE

  22. Klaser A, Marszałek M, Schmid C, Zisserman A (2010) Human focused action localization in video. In: SGA 2010-international workshop on sign, gesture, and activity, ECCV 2010 workshops, vol. 6553, pp 219–233. Springer

  23. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems 25 (NIPS 2012), pp 1097–1105. Curran Associates, Inc

  24. Lan T, Wang Y, Mori G (2011) Discriminative figure-centric models for joint action localization and recognition. In: 2011 IEEE international conference on computer vision (ICCV), pp 2003–2010. IEEE

  25. LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551

    Article  Google Scholar 

  26. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. In: Proceedings of the 14th European conference on computer vision, pp 21–37. Springer

  27. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 3431–3440. IEEE

  28. Lowe D.G. (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60:91–110

    Article  Google Scholar 

  29. Perronnin F, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: Proceedings of the 11th European conference on computer vision, pp 143–156. Springer

  30. Poppe R (2010) A survey on vision-based human action recognition. Image Vis Comput 28:976–990

    Article  Google Scholar 

  31. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems 28 (NIPS 2015), pp 91–99. Curran Associates, Inc

  32. Rodriguez MD, Ahmed J, Shah M (2008) Action mach a spatio-temporal maximum average correlation height filter for action recognition. In: 2008 IEEE conference on computer vision and pattern recognition (CVPR), pp 1–8. IEEE

  33. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems 27 (NIPS 2014), pp 568–576. Curran Associates, Inc

  34. Tian Y, Sukthankar R, Shah M (2013) Spatiotemporal deformable part models for action detection. In: 2013 IEEE conference on computer vision and pattern recognition (CVPR), pp 2642–2649. IEEE

  35. Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171

    Article  Google Scholar 

  36. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: 2013 IEEE international conference on computer vision (ICCV), pp 3551–3558. IEEE

  37. Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: 2011 IEEE conference on computer vision and pattern recognition (CVPR), pp 3169–3176. IEEE

  38. Wang L, Qiao Y, Tang X, Van Gool L (2016) Actionness estimation using hybrid fully convolutional networks. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 2708–2717. IEEE

  39. Weinland D, Ronfard R, Boyer E (2011) A survey of vision-based methods for action representation, segmentation and recognition. Comput Vis Image Underst 115:224–241

    Article  Google Scholar 

  40. Weinzaepfel P, Harchaoui Z, Schmid C (2015) Learning to track for spatio-temporal action localization. In: 2015 IEEE international conference on computer vision (ICCV), pp 3164–3172. IEEE

  41. Xiang Y, Alahi A, Savarese S (2015) Learning to track: online multi-object tracking by decision making. In: 2015 IEEE international conference on computer vision (ICCV), pp 4705–4713. IEEE

  42. Yan Y, Ricci E, Liu G, Subramanian R, Sebe N (2014) Clustered multi-task linear discriminant analysis for view invariant color-depth action recognition. In: 2014 22nd international conference on pattern recognition (ICPR), pp 3493–3498. IEEE

  43. Yan Y, Ricci E, Subramanian R, Liu G, Sebe N (2014) Multitask linear discriminant analysis for view invariant action recognition. IEEE Trans Image Process 23:5599–5611

    Article  MathSciNet  Google Scholar 

  44. Yeung S, Russakovsky O, Mori G, Fei-Fei L (2016) End-to-end learning of action detection from frame glimpses in videos. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 2678–2687. IEEE

  45. Yu G, Yuan J (2015) Fast action proposals for human action detection and search. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 1302–1311. IEEE

  46. Zhang D, Han J, Li C, Wang J, Li X (2016) Detection of co-salient objects by looking deep and wide. Int J Comput Vis 120(2):215–232

    Article  MathSciNet  Google Scholar 

  47. Zhang D, Han J, Jiang L, Ye S, Chang X (2017) Revealing event saliency in unconstrained video collection. IEEE Trans Image Process 26(4):1746–1758

    Article  MathSciNet  Google Scholar 

  48. Zhang D, Meng D, Han J (2017) Co-saliency detection via a self-paced multiple-instance learning framework. IEEE Trans Pattern Anal Mach Intell 39 (5):865–878

    Article  Google Scholar 

  49. Zhu L, Shen J, Jin H, Xie L, Zheng R (2015) Landmark classification with hierarchical multi-modal exemplar feature. IEEE Trans Multimedia 17(7):981–993

    Article  Google Scholar 

  50. Zhu L, Shen J, Jin H, Zheng R, Xie L (2015) Content-based visual landmark search via multimodal hypergraph learning. IEEE Trans Cybern 45(12):2756–2769

    Article  Google Scholar 

  51. Zhu L, Shen J, Xie L, Cheng Z (2016) Unsupervised topic hypergraph hashing for efficient mobile image retrieval. IEEE Trans Cybern PP(99):1–14. https://doi.org/10.1109/TCYB.2016.2591068

    Article  Google Scholar 

  52. Zhu L, Shen J, Xie L, Cheng Z (2017) Unsupervised visual hashing with semantic assistant for content-based image retrieval. IEEE Trans Knowl Data Eng 29 (2):472–486

    Article  Google Scholar 

  53. Zitnick CL, Dollár P (2014) Edge boxes: Locating object proposals from edges. In: Proceedings of the 13th European Conference on Computer Vision, pp 391–405. Springer

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (No.61571071), Wenfeng innovation and start-up project of Chongqing University of Posts and Telecommunications (No. WF201404), the Research Innovation Program for Postgraduate of Chongqing (No. CYS17222). The authors also thank NVIDIA corporation for the donation of GeForce GTX TITAN X GPU.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Minwen Zhang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, M., Gao, C., Li, Q. et al. Action detection based on tracklets with the two-stream CNN. Multimed Tools Appl 77, 3303–3316 (2018). https://doi.org/10.1007/s11042-017-5116-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-017-5116-9

Keywords

Navigation