Skip to main content
Log in

Video sketch: A middle-level representation for action recognition

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Different modalities extracted from videos, such as RGB and optical flows, may provide complementary cues for improving video action recognition. In this paper, we introduce a new modality named video sketch, which implies the human shape information, as a complementary modality for video action representation. We show that video action recognition can be enhanced by using the proposed video sketch. More specifically, we first generate video sketch with class distinctive action areas and then employ a two-stream network to combine the shape information extracted from image-based sketch and point-based sketch, followed by fusing the classification scores of two streams to generate shape representation for videos. Finally, we use the shape representation as the complementary one for the traditional appearance (RGB) and motion (optical flow) representations for the final video classification. We conduct extensive experiments on four human action recognition datasets – KTH, HMDB51, UCF101, Something-Something and UTI. The experimental results show that the proposed method outperforms the existing state-of-the-art action recognition methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

References

  1. Shi Y, Tian Y, Wang Y, Huang T (2017) Sequential deep trajectory descriptor for action recognition with three-stream cnn. IEEE TMM

  2. Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N (2019) View adaptive neural networks for high performance skeleton-based human action recognition. PAMI

  3. Liu Y, Pados DA (2016) Compressed-sensed-domain l 1-pca video surveillance. IEEE TMM

  4. Pérez-Hernández F, Tabik S, Lamas AC, Olmos R, Fujita H, Herrera F (2020) Object detection binary classifiers methodology based on deep learning to identify small objects handled similarly: Application in video surveillance. Knowledge Based Systems, pp 105590

  5. Yang X, Shyu M-L, Yu H-Q, Sun S-M, Yin N-S, Chen W (2018) Integrating image and textual information in human–robot interactions for children with autism spectrum disorder. IEEE TMM

  6. Kuanar SK, Ranga KB, Chowdhury AS (2015) Multi-view video summarization using bipartite matching constrained optimum-path forest clustering. IEEE TMM

  7. Liu M, Liu H, Chen C (2017) Enhanced skeleton visualization for view invariant human action recognition. PR

  8. Zheng Y, Yao H, Sun X, Zhao S, Porikli F (2018) Distinctive action sketch for human action recognition. Signal Processing

  9. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: Towards good practices for deep action recognition. In: ECCV

  10. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: NIPS

  11. Tang Y, Yi T, Lu J, Li P, Zhou J (2018) Deep progressive reinforcement learning for skeleton-based action recognition. In: CVPR

  12. Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI

  13. Han Z, Xu Z, Zhu S-C (2015) Video primal sketch: A unified middle-level representation for video. JMIV

  14. Yilmaz A, Shah M (2015) Actions sketch: a novel action representation. In: CVPR

  15. Klaser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. In: BMVC

  16. Scovanner P, Ali S, Shah M (2007) A 3-dimensional sift descriptor and its application to action recognition. In: ACM MM

  17. Ojala T, Pietikäinen M, Mäenpää T (2002) Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. PAMI

  18. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: ICCV

  19. Wang X, Gao L, Wang P, Sun X, Liu X (2017) Two-stream 3-d convnet fusion for action recognition in videos with arbitrary size and length. IEEE TMM

  20. Hu J-F, Zheng W-S, Pan J, Lai J, Zhang J (2018) Deep bilinear learning for rgb-d action recognition. In: ECCV

  21. Li L, Wang S, Hu B, Qiong Q, Wen J, Rosenblum DS (2018) Learning structures of interval-based bayesian networks in probabilistic generative model for human complex activity recognition. Pattern Recognition 81:545–561

    Article  Google Scholar 

  22. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR

  23. Gao R, Bo X, Grauman K (2018) Im2flow: Motion hallucination from static images for action recognition. In: CVPR

  24. Ng JY-H, Choi J, Neumann J, Davis LS (2018) Actionflownet: Learning motion representation for action recognition. In: WACV

  25. Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2016) Real-time action recognition with enhanced motion vector cnns. In: CVPR

  26. Sun S, Kuang Z, Sheng L, Ouyang W, Zhang W (2018) Optical flow guided feature: a fast and robust motion representation for video action recognition. In: CVPR

  27. Piergiovanni AJ, Ryoo MS (2019) Representation flow for action recognition. In: CVPR

  28. Wang X, Gao L, Wang P, Sun X, Liu X (2018) Two-stream 3-d convnet fusion for action recognition in videos with arbitrary size and length. IEEE Transactions on Multimedia 20:634–644

    Article  Google Scholar 

  29. Zolfaghari M, Oliveira GL, Sedaghat N, Brox T (2017) Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In: ICCV

  30. Wang C, Wang Y, Yuille AL (2013) An approach to pose-based action recognition. In: CVPR

  31. Nie BX, Xiong C, Zhu S-C (2015) Joint action recognition and pose estimation from video. In: CVPR

  32. Luvizon DC, Picard D, Tabia H (2018) 2d/3d pose estimation and action recognition using multitask deep learning. In: CVPR

  33. Weng J, Liu M, Jiang X, Yuan J (2018) Deformable pose traversal convolution for 3d action and gesture recognition. In: ECCV

  34. Song S, Lan C, Xing J, Zeng W, Liu J (2017) An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: AAAI

  35. Hussein ME, Torki M, Gowayyed MA, El-Saban M (2013) Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations. In: IJCAI

  36. Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: CVPR

  37. Wang P, Yuan C, Hu W, Li B, Zhang Y (2016) Graph based skeleton motion representation and similarity measurement for action recognition. In: ECCV

  38. Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. In: CVPR

  39. Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2017) A new representation of skeleton sequences for 3d action recognition. In: CVPR

  40. Jain A, Zamir AR, Savarese S, Saxena A (2016) Structural-rnn: Deep learning on spatio-temporal graphs. In: CVPR

  41. Zhang S, Yang Y, Xiao J, Liu X, Yang Y, Xie D, Zhuang Y (2018) Fusing geometric features for skeleton-based action recognition using multilayer lstm networks. IEEE TMM

  42. Hou Y, Li Z, Wang P, Li W (2018) Skeleton optical spectra-based action recognition using convolutional neural networks. IEEE Transactions on Circuits and Systems for Video Technology 28:807–811

    Article  Google Scholar 

  43. Liu J, Wang G, Hu P, Duan L-Y, Kot AC (2017) Global context-aware attention lstm networks for 3d action recognition. In: CVPR

  44. Li D, Yao T, Duan L-Y, Mei T, Rui Y (2018) Unified spatio-temporal attention networks for action recognition in videos. IEEE TMM

  45. Du W., Wang Y, Qiao Y (2017) Rpan An end-to-end recurrent pose-attention network for action recognition in videos. In: ICCV

  46. Zhu Q, Song G, Shi J (2007) Untangling cycles for contour grouping

  47. Wang S, Kubota T, Siskind JM, Wang J (2005) Salient closed boundary extraction with ratio contour. PAMI

  48. Arbelaez P, Maire M, Fowlkes C, Malik J (2010) Contour detection and hierarchical image segmentation. PAMI

  49. Marvaniya S, Bhattacharjee S, Manickavasagam V, Mittal A (2012) Drawing an automatic sketch of deformable objects using only a few images. In: ECCV. Springer

  50. Lim JJ, Zitnick LC, Dollár P (2013) Sketch tokens: A learned mid-level representation for contour and object detection. In: CVPR

  51. Qi Y, Song Y-Z, Xiang T, Zhang H, Hospedales T, Li Y, Guo J (2015) Making better use of edges via perceptual grouping. In: CVPR

  52. Xie S, Tu Z (2015) Holistically-nested edge detection. In: ICCV

  53. Liu Y, Cheng M-M, Hu X, Wang K, Bai X (2017) Richer convolutional features for edge detection. In: CVPR

  54. Zhang X, Huang Y, Qi Z, Guan Q, Liu J (2018) Making better use of edges for sketch generation. JEI

  55. Yu Z, Feng C, Liu M-Y, Ramalingam S (2017) Casenet: Deep category-aware semantic edge detection. In: CVPR

  56. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: NIPS

  57. Isola P, Zhu J-Y, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversarial networks. In: CVPR

  58. Zhang X, Li X, Li X, Shen M (2018) Better freehand sketch synthesis for sketch-based image retrieval: Beyond image edges. Neurocomputing

  59. Eitz M, Hays J, Alexa M (2012) How do humans sketch objects? ACM TOG

  60. Eitz M, Hildebrand K, Boubekeur T, Alexa M (2010) Sketch-based image retrieval: Benchmark and bag-of-features descriptors. TVCG

  61. Schneider RG, Tuytelaars T (2014) Sketch classification and classification-driven analysis using fisher vectors. ACM TOG

  62. Li Y, Hospedales TM, Song Y-Z, Gong S (2015) Free-hand sketch recognition by multi-kernel feature learning. CVIU

  63. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  64. Sert M, Boyacı E (2019) Sketch recognition using transfer learning. Multimedia Tools and Applications

  65. Zhang H, She P, Liu Y, Gan J, Cao X, Foroosh H (2019) Learning structural representations via dynamic object landmarks discovery for sketch recognition and retrieval. IEEE TIP

  66. Yu Q., Yang Y, Liu F, Song Y-Z, Xiang T, Hospedales TM (2017) Sketch-a-net: A deep neural network that beats humans. IJCV

  67. Sarvadevabhatla RK, Babu RV (2015) Freehand sketch recognition using deep features. arXiv

  68. Zhang H, Si L, Zhang C, Ren W, Wang R, Cao X (2016) Sketchnet: Sketch classification with web images. In: CVPR

  69. Xiao Q, Dai J, Luo J, Fujita H (2019) Multi-view manifold regularized learning-based method for prioritizing candidate disease mirnas. Knowl Based Syst 175:118–129

    Article  Google Scholar 

  70. Sun S, Shawe-Taylor J, Mao L (2017) Pac-bayes analysis of multi-view learning. Inf Fusion 35:117–131

    Article  Google Scholar 

  71. Higgs M, Shawe-Taylor J (2010) A pac-bayes bound for tailored density estimation. In: ALT

  72. Seldin Y, Laviolette F, Cesa-Bianchi N, Shawe-Taylor J, Auer P (2012) Pac-bayesian inequalities for martingales. IEEE Trans Inf Theory 58:7086–7093

    Article  MathSciNet  Google Scholar 

  73. Wang H, Yang Y, Liu B, Fujita H (2019) A study of graph-based system for multi-view clustering. Knowl Based Syst 163:1009–1019

    Article  Google Scholar 

  74. Sun S, Mao L, Dong Z, Wu L (2019) Multiview machine learning. In: Springer, Singapore

  75. Sun S, Liu Y, Mao L (2019) Multi-view learning for visual violence recognition with maximum entropy discrimination and deep features. Inf Fusion 50:43–53

    Article  Google Scholar 

  76. Liu M, Zhang J, Yap P-T, Shen D (2017) View-aligned hypergraph learning for alzheimer’s disease diagnosis with incomplete multi-modality data. Med Image Anal 36:123–134

    Article  Google Scholar 

  77. Zhang W, Zhou H, Sun S, Wang Z, Shi J, Loy CC (2019) Robust multi-modality multi-object tracking. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp 2365–2374

  78. Gkalelis N, Nikolaidis N, Pitas I (2009) View indepedent human movement recognition from multi-view video exploiting a circular invariant posture representation. In: 2009 IEEE International Conference on Multimedia and Expo. IEEE, pp 394–397

  79. Iosifidis A, Tefas A, Pitas I (2013) View-independent human action recognition based on multi-view action images and discriminant learning. In: IVMSP 2013. IEEE, pp 1–4

  80. Ren Z, Zhang Q, Gao X, Hao P, Cheng J (2020) Multi-modality learning for human action recognition. Multimedia Tools and Applications 1–19

  81. Wang T, Brown H-F Drawing aid system for multi-touch devices, October 14 2014. US Patent 8,860,675

  82. Zhao H, Tian M, Sun S, Shao J, Yan J, Yi S, Wang X, Tang X (2017) Spindle net: Person re-identification with human body region guided feature decomposition and fusion. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1077–1085

  83. Chen H, Wang G, Xue J-H, He L (2016) A novel hierarchical framework for human action recognition. PR

  84. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: ICCV

  85. Laptev I, Caputo B, et al. (2004) Recognizing human actions: a local svm approach. In: Null

  86. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: ICCV

  87. Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402

  88. Qi J, Yu M, Fan X, Li H (2017) Sequential dual deep learning with shape and texture features for sketch recognition

  89. Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167

  90. Liu Z, Gao J, Yang G, Zhang H, He Y (2016) Localization and classification of paddy field pests using a saliency map and deep convolutional neural network. Scientific reports

  91. Belongie S, Malik J, Puzicha J (2002) Shape matching and object recognition using shape contexts. PAMI

  92. Carlsson S, Sullivan J (2001) Action recognition by shape matching to key frames. In: Workshop on models versus exemplars in computer vision, volume 1

  93. Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3d points. In: CVPR Workshops. IEEE

  94. Li W, Zhang Z, Liu Z (2008) Expandable data-driven graphical modeling of human actions based on salient postures. IEEE transactions on Circuits and Systems for Video Technology

  95. Devanne M, Wannous H, Berretti S, Pala P, Daoudi M, Del Bimbo A (2014) 3-d human action recognition by shape analysis of motion trajectories on riemannian manifold. IEEE transactions on cybernetics

  96. Ha VHS, Moura JMF (2005) Affine-permutation invariance of 2-d shapes. IEEE TIP

  97. Eldar Y, Lindenbaum M, Porat M, Zeevi YY (1997) The farthest point strategy for progressive image sampling. IEEE TIP

  98. Moenning C, Dodgson NA (2003) Fast marching farthest point sampling. Technical report, University of Cambridge, Computer Laboratory

  99. Parameswaran V, Chellappa R (2006) View invariance for human action recognition. IJCV

  100. Ahmad M, Lee S-W (2008) Human action recognition using shape and clg-motion flow from multi-view image sequences. PR

  101. Christopher M, et al. (1995) Bishop Neural networks for pattern recognition. Oxford University Press

  102. Vinyals O, Bengio S, Kudlur M (2015) Order matters: Sequence to sequence for sets. Computer Science

  103. Qi CR, Su H., Mo K, Guibas LJ (2017) Pointnet: Deep learning on point sets for 3d classification and segmentation. In: CVPR

  104. Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: CVPR

  105. Li Z, Gavrilyuk K, Gavves E, Jain M, Snoek CGM (2018) Videolstm convolves, attends and flows for action recognition. CVIU

  106. Goyal R, Kahou SE, Michalski V, Materzynska J, Westphal S, Kim H, Haenel V, Fruend I, Yianilos P, Mueller-Freitag M, et al. (2017) The “something something” video database for learning and evaluating visual common sense. In: ICCV

  107. Ryoo MS, Aggarwal JK (2010) Ut-interaction dataset, icpr contest on semantic description of human activities (sdha). In: IEEE International Conference on Pattern Recognition Workshops, vol 2, p 4

  108. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, et al. (2015) Imagenet large scale visual recognition challenge. IJCV

  109. Qiu Z, Yao T, Mei T (2017) Deep quantization: Encoding convolutional activations with deep generative model. In: CVPR

  110. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558

  111. Mahmood M, Jalal A, Sidduqi MA (2018) Robust spatio-temporal features for human interaction recognition via artificial neural network. 2018 International Conference on Frontiers of Information Technology (FIT), pp 218–223

  112. Jalal A, Mahmood M (2019) Students’ behavior mining in e-learning environment using cognitive processes with information technologies. Educ Inf Technol, pp 1–25

  113. Nour el Houda Slimani K, Benezeth Y, Souami F (2020) Learning bag of spatio-temporal features for human interaction recognition. In: International Conference on Machine Vision

  114. Chattopadhyay C, Das S (2016) Supervised framework for automatic recognition and retrieval of interaction: a framework for classification and retrieving videos with similar human interactions. IET Comput Vis 10:220–227

    Article  Google Scholar 

  115. Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine Intelligence

  116. Akbarian MSA, Saleh F, Salzmann M, Fernando B, Petersson L, Andersson L (2017) Encouraging lstms to anticipate actions very early. 2017 IEEE International Conference on Computer Vision (ICCV), pp 280–289

  117. Kong Y, Fu Y (2016) Max-margin action prediction machine. IEEE Trans Pattern Anal Mach Intell 38:1844–1858

    Article  Google Scholar 

  118. Raptis M, Sigal L (2013) Poselet key-framing: A model for human activity recognition. 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp 2650–2657

  119. Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2018) Leveraging structural context models and ranking score fusion for human interaction prediction. IEEE Transactions on Multimedia 20:1712–1723

    Article  Google Scholar 

  120. Chen L, Lu J, Song Z, Zhou J (2018) Part-activated deep reinforcement learning for action prediction. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 421–436

  121. Xu W, Yu J, Miao Z, Wan L, Ji Q (2019) Prediction-cgan: Human action prediction with conditional generative adversarial networks. Proceedings of the 27th ACM International Conference on Multimedia

  122. Perez M, Liu J, Kot AC (2019) Interaction relational network for mutual action recognition. arXiv:1910.04963

  123. Cai Z, Wang L, Peng X, Qiao Y u (2014) Multi-view super vector for action recognition. In: CVPR

  124. Peng X., Wang L, Xingxing W., Yu Q (2016) Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. CVIU

  125. Wang L, Yu Q, Tang X (2016) Mofap: A multi-level representation for action recognition. IJCV

  126. Wang X, Farhadi A, Gupta A (2016) Actions transformations. In: CVPR

  127. Ng JY-H, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: Deep networks for video classification. In: CVPR

  128. Sun L, Jia K, Yeung D-Y, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: ICCV

  129. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: CVPR

  130. Yang HT, Yuan C, Li B, Du Y, Xing J, Hu W, Maybank SJ (2019) Asymmetric 3d convolutional neural networks for action recognition. Pattern Recognit 85:1–12

    Article  Google Scholar 

  131. Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. In: NIPS

  132. Li D, Yao T, Duan Ly, Mei T, Rui Y (2019) Unified spatio-temporal attention networks for action recognition in videos. IEEE Trans Multimed 21:416–428

    Article  Google Scholar 

  133. Li Y, Song S, Li Y, Liu J (2019) Temporal bilinear networks for video action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 33, pp 8674–8681

  134. Li Y, Ji B, Shi X, Zhang J, Kang B, Wang L (2020) Tea: Temporal excitation and aggregation for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 909–918

  135. Du T, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: CVPR

  136. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. 2017 IEEE Conference on Computer Vision and Patter Recognition (CVPR) pp 4724–4733

Download references

Acknowledgements

This work is supported by Fundamental Research Funds for the Central Universities (2018YJS045, 2019JBZ104).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ya-Ping Huang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, XY., Huang, YP., Mi, Y. et al. Video sketch: A middle-level representation for action recognition. Appl Intell 51, 2589–2608 (2021). https://doi.org/10.1007/s10489-020-01905-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-020-01905-y

Keywords

Navigation