Skip to main content
Log in

DKD–DAD: a novel framework with discriminative kinematic descriptor and deep attention-pooled descriptor for action recognition

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

In order to improve action recognition accuracy, the discriminative kinematic descriptor and deep attention-pooled descriptor are proposed. Firstly, the optical flow field is transformed into a set of kinematic fields with more discriminativeness. Subsequently, two kinematic features are constructed, which more accurately depict the dynamic characteristics of action subject from the multi-order divergence and curl fields. Secondly, by introducing both of the tight-loose constraint and anti-confusion constraint, a discriminative fusion method is proposed, which guarantees better within-class compactness and between-class separability, meanwhile reduces the confusion caused by outliers. Furthermore, a discriminative kinematic descriptor is constructed. Thirdly, a prediction-attentional pooling method is proposed, which accurately focuses its attention on the discriminative local regions. On this basis, a deep attention-pooled descriptor (DKD–DAD) is constructed. Finally, a novel framework with discriminative kinematic descriptor and deep attention-pooled descriptor is presented, which comprehensively obtains the discriminative dynamic and static information in a video. Consequently, accuracies are improved. Experiments on two challenging datasets verify the effectiveness of our methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402

  2. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: Proceedings of IEEE international conference on computer vision (ICCV), pp 2556–2563

  3. Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR), pp 1–8

  4. Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123

    Article  Google Scholar 

  5. Dollar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: Proceedings of the 14th international conference on computer communications and networks (ICCCN), pp 65–72

  6. Yuan C, Li X, Hu W, Ling H, Maybank SJ (2014) Modeling geometric-temporal context with directional pyramid co-occurrence for action recognition. IEEE Trans Image Process 23(2):658–672

    Article  MathSciNet  Google Scholar 

  7. Zhang J, Shum HP, Han J, Shao L (2018) Action recognition from arbitrary views using transferable dictionary learning. IEEE Trans Image Process 27(10):4709–4723

    Article  MathSciNet  Google Scholar 

  8. Wang H, Oneata D, Verbeek J, Schmid C (2016) A robust and efficient video representation for action recognition. Int J Comput Vis 119(3):219–238

    Article  MathSciNet  Google Scholar 

  9. Wang H, Kläser A, Schmid C, Liu CL (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79

    Article  MathSciNet  Google Scholar 

  10. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems (NIPS), pp 568–576

  11. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of IEEE international conference on computer vision (ICCV), pp 4489–4497

  12. Yi Y, Lin M (2016) Human action recognition with graph-based multiple-instance learning. Pattern Recognit 53:148–162

    Article  Google Scholar 

  13. Singh S, Arora C, Jawahar CV (2017) Trajectory aligned features for first person action recognition. Pattern Recognit 62:45–55

    Article  Google Scholar 

  14. Zhang H, Sun Y, Liu L, Wang X, Li L, Liu W (2018) ClothingOut: a category-supervised GAN model for clothing segmentation and retrieval. Neural Comput Appl. https://doi.org/10.1007/s00521-018-3691-y

    Article  Google Scholar 

  15. Ji Y, Zhang H, Wu QMJ (2018) Saliency detection via conditional adversarial image-to-image network. Neurocomputing 316:357–368

    Article  Google Scholar 

  16. Zhang H, Ji Y, Huang W, Liu L (2018) Sitcom-star-based clothing retrieval for video advertising: a deep learning framework. Neural Comput Appl. https://doi.org/10.1007/s00521-018-3579-x

    Article  Google Scholar 

  17. Wang J, Wang G (2018) Hierarchical spatial sum-product networks for action recognition in still images. IEEE Trans Circuits Syst Video Technol 28(1):90–100

    Article  Google Scholar 

  18. Kwak S, Cho M, Laptev I (2016) Thin-slicing for pose: Learning to understand pose without explicit pose estimation. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR), pp 4938–4947

  19. Qi T, Xu Y, Quan Y, Ling L (2017) Image-based action recognition using hint-enhanced deep neural networks. Neurocomputing 267:475–488

    Article  Google Scholar 

  20. Peng X, Schmid C (2016) Multi-region two-stream R-CNN for action detection. In: Proceedings of European conference on computer vision (ECCV), pp 744–759

  21. Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems (NIPS), pp 91–99

  22. Ni B, Li T, Yang X (2018) Learning semantic-aligned action representation. IEEE Trans Neural Netw Learn Syst 29(8):3715–3725

    Article  Google Scholar 

  23. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE international conference on computer vision and pattern recognition (CVPR), pp 2818–2826

  24. Farnebäck G (2003) Two-frame motion estimation based on polynomial expansion. In: Proceedings of the Scandinavian conference on image analysis (SCIA), pp 363–370

  25. Yang J, Zhang D, Frangi AF, Yang J (2004) Two-dimensional PCA: a new approach to appearance-based face representation and recognition. IEEE Trans Pattern Anal Mach Int 26(1):131–137

    Article  Google Scholar 

  26. Sánchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: theory and practice. Int J Comput Vis 105(3):222–245

    Article  MathSciNet  Google Scholar 

  27. Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Advances in neural information processing systems (NIPS), pp 1097–1105

  28. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167

  29. Girdhar R, Ramanan D (2017) Attentional pooling for action recognition. In: Advances in neural information processing systems (NIPS), pp 34–45

  30. Zhou Q, Fan H, Su H, Yang H, Zheng S, Ling H (2018) Weighted bilinear coding over salient body parts for person re-identification. arXiv:1803.08580

  31. Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv:1511.04119

  32. Zhuang B, Liu L, Shen C, Reid I (2017) Towards context–aware interaction recognition for visual relationship detection. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 589–598

  33. Yan S, Smith JS, Lu W, Zhang B (2017) Multi-branch attention networks for action recognition in still images. IEEE Trans Cognit Develop Syst. https://doi.org/10.1109/TCDS.2017.2783944

    Article  Google Scholar 

  34. Jiang YG, Liu J, Zamir AR, Toderici G, Laptev I, Shah M, Sukthankar R (2013) THUMOS challenge: action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14. Accessed 30 June 2018

  35. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  36. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X (2015) TensorFlow: large-scale machine learning on heterogeneous systems. http://tensorflow.org/. Accessed 30 June 2018. Software available from tensorflow.org

  37. Miao J, Xu X, Qiu S, Qinf C, Tao D (2015) Temporal variance analysis for action recognition. IEEE Trans Image Process 24(12):5904–5915

    Article  MathSciNet  Google Scholar 

  38. Shi F, Laganière R, Petriu E (2016) Local part model for action recognition. Image Vis Comput 46(11):18–28

    Article  Google Scholar 

  39. Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput Vis Image Underst 150:109–125

    Article  Google Scholar 

  40. Nguyen TV, Mirza B (2017) Dual-layer kernel extreme learning machine for action recognition. Neurocomputing 260:123–130

    Article  Google Scholar 

  41. Kobayashi T (2017) Flip-invariant motion representation. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 5628–5637

  42. Jain M, Jegou H, Bouthemy P (2013) Better exploiting motion for better action recognition. In: Proceedings of the IEEE international conference on computer vision and pattern recognition (CVPR), pp 2555–2562

  43. Yu M, Liu L, Shao L (2016) Structure-preserving binary representations for RGB-D action recognition. IEEE Trans Pattern Anal Mach Int 38(8):1651–1664

    Article  Google Scholar 

  44. Caetano C, dos Santos JA, Schwartz WR (2016) Optical flow co-occurrence matrices: a novel spatiotemporal feature descriptor. In: Proceedings of international conference pattern recognition (ICPR), pp 1947–1952

  45. Xu Z, Hu R, Chen J, Chen C, Chen H, Li H, Sun Q (2017) Action recognition by saliency-based dense sampling. Neurocomputing 236:82–92

    Article  Google Scholar 

  46. Miao J, Xu X, Mathew R, Huang H (2015) Residue boundary histograms for action recognition in the compressed domain. In: Proceedings of IEEE international conference on image processing (ICIP), pp 2825–2829

  47. Kihl O, Picard D, Gosselin PH (2015) A unified framework for local visual descriptors evaluation. Pattern Recognit 48(4):1174–1184

    Article  Google Scholar 

  48. Feichtenhofer C, Pinz A, Wildes RP (2015) Dynamically encoded actions based on spacetime saliency. In: Proceedings of the IEEE international conference on computer vision and pattern recognition (CVPR), pp 2755–2764

  49. Fernando B, Anderson P, Hutter M, Gould S (2016) Discriminative hierarchical rank pooling for activity recognition. In: Proceedings of the IEEE international conference on computer vision and pattern recognition (CVPR), pp 1924–1932

  50. Wang L, Qiao Y, Tang X (2016) MoFAP: a multi-level representation for action recognition. Int J Comput Vis 119(3):254–271

    Article  MathSciNet  Google Scholar 

  51. Tu NA, Huynh-The T, Khan KU, Lee YK (2018) ML-HDP: a hierarchical bayesian nonparametric model for recognizing human actions in video. IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2018.2816960

    Article  Google Scholar 

  52. Zheng Y, Yao H, Sun X, Zhao S, Porikli F (2018) Distinctive action sketch for human action recognition. Signal Process 144:323–332

    Article  Google Scholar 

  53. Jiang YG, Dai Q, Liu W, Xue X, Ngo CH (2015) Human action recognition in unconstrained videos by explicit motion modeling. IEEE Trans Image Process 24(11):3781–3795

    Article  MathSciNet  Google Scholar 

  54. Bilinski PT, Bremond F (2015) Video covariance matrix logarithm for human action recognition in videos. In: Proceedings of the 24th international conference on artificial intelligence (IJCAI), pp 2140–2147

  55. Shao L, Liu L, Yu M (2016) Kernelized multiview projection for robust action recognition. Int J Comput Vis 118(2):115–129

    Article  MathSciNet  Google Scholar 

  56. Yang Y, Liu R, Deng C, Gao X (2016) Multi-task human action recognition via exploring super-category. Signal Process 124:36–44

    Article  Google Scholar 

  57. Yao T, Wang Z, Xie Z, Gao J, Feng DD (2017) Learning universal multiview dictionary for human action recognition. Pattern Recognit 64:236–244

    Article  Google Scholar 

  58. Zhu Y, Newsam S (2016) Depth2action: exploring embedded depth for large-scale action recognition. In: Proceedings of European conference on computer vision (ECCV), pp 668–684

  59. Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR), pp 3034–3042

  60. Lan Z, Yu SI, Yao D, Lin M, Raj B, Hauptmann A (2016) The best of both worlds: Combining data-independent and data-driven approaches for action recognition. In: Proceedings of IEEE international conference on computer vision and pattern recognition workshops (CVPR Workshops), pp 123–132

  61. Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. In: Advances in neural information processing systems (NIPS), pp 3468–3476

  62. Wang X, Farhadi A, Gupta A (2016) Actions ~ transformations. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR), pp 2658–2667

  63. Wang P, Cao Y, Shen C, Liu L, Shen HT (2017) Temporal pyramid pooling-based convolutional neural network for action recognition. IEEE Trans Circuits Syst Video Technol 27(12):2613–2622

    Article  Google Scholar 

  64. Kar A, Rai N, Sikka K, Sharma G (2017) Adascan: adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR), pp 3376–3385

  65. Ye Y, Tian Y (2016) Embedding sequential information into spatiotemporal features for action recognition. In: Proceedings of the IEEE international conference on computer vision workshops (ICCV Workshops), pp 37–45

  66. Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2016) Real-time action recognition with enhanced motion vector CNNs. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR), pp 2718–2726

  67. Ma S, Bargal SA, Zhang J, Sigal L, Sclaroff S (2017) Do less and achieve more: training CNNs for action recognition utilizing action images from the web. Pattern Recognit 68:334–345

    Article  Google Scholar 

  68. Yang H, Yuan C, Xing J, Hu W (2017) SCNN: Sequential convolutional neural network for human action recognition in videos. In: Proceedings of the IEEE international conference on image processing (ICIP), pp 355–359

  69. Cherian A, Fernando B, Harandi M, Gould S (2017) Generalized rank pooling for activity recognition. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR), pp 1581–1590

  70. Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2018) Real-time action recognition with deeply transferred motion vector CNNs. IEEE Trans Image Process 27(5):2326–2339

    Article  MathSciNet  Google Scholar 

  71. Varol G, Laptev I, Schmid C (2018) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Int 40(6):1510–1517

    Article  Google Scholar 

Download references

Acknowledgement

This work was supported partially by Shaanxi Province key project of Research and Development Plan research Project S2018-YF-ZDGY-0187 and International Cooperation Project of Shaanxi Province research project S2018-YF-GHMS-0061.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ming Tong.

Ethics declarations

Conflict of interest

All the authors of the manuscript declared that there are no potential conflicts of interest.

Human and animal rights

All the authors of the manuscript declared that there is no research involving human participants and/or animal.

Informed consent

All the authors of the manuscript declared that there is no material that required informed consent.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tong, M., Li, M., Bai, H. et al. DKD–DAD: a novel framework with discriminative kinematic descriptor and deep attention-pooled descriptor for action recognition. Neural Comput & Applic 32, 5285–5302 (2020). https://doi.org/10.1007/s00521-019-04030-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-019-04030-1

Keywords

Navigation