DKD–DAD: a novel framework with discriminative kinematic descriptor and deep attention-pooled descriptor for action recognition

Tong, Ming; Li, Mingyang; Bai, He; Ma, Lei; Zhao, Mengao

doi:10.1007/s00521-019-04030-1

DKD–DAD: a novel framework with discriminative kinematic descriptor and deep attention-pooled descriptor for action recognition

Original Article
Published: 08 February 2019

Volume 32, pages 5285–5302, (2020)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Ming Tong¹,
Mingyang Li¹,
He Bai¹,
Lei Ma¹ &
…
Mengao Zhao¹

445 Accesses
7 Citations
Explore all metrics

Abstract

In order to improve action recognition accuracy, the discriminative kinematic descriptor and deep attention-pooled descriptor are proposed. Firstly, the optical flow field is transformed into a set of kinematic fields with more discriminativeness. Subsequently, two kinematic features are constructed, which more accurately depict the dynamic characteristics of action subject from the multi-order divergence and curl fields. Secondly, by introducing both of the tight-loose constraint and anti-confusion constraint, a discriminative fusion method is proposed, which guarantees better within-class compactness and between-class separability, meanwhile reduces the confusion caused by outliers. Furthermore, a discriminative kinematic descriptor is constructed. Thirdly, a prediction-attentional pooling method is proposed, which accurately focuses its attention on the discriminative local regions. On this basis, a deep attention-pooled descriptor (DKD–DAD) is constructed. Finally, a novel framework with discriminative kinematic descriptor and deep attention-pooled descriptor is presented, which comprehensively obtains the discriminative dynamic and static information in a video. Consequently, accuracies are improved. Experiments on two challenging datasets verify the effectiveness of our methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CBAM: Convolutional Block Attention Module

Attention mechanisms in computer vision: A survey

Article Open access 15 March 2022

Visual attention network

Article Open access 28 July 2023

References

Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: Proceedings of IEEE international conference on computer vision (ICCV), pp 2556–2563
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR), pp 1–8
Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123
Article Google Scholar
Dollar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: Proceedings of the 14th international conference on computer communications and networks (ICCCN), pp 65–72
Yuan C, Li X, Hu W, Ling H, Maybank SJ (2014) Modeling geometric-temporal context with directional pyramid co-occurrence for action recognition. IEEE Trans Image Process 23(2):658–672
Article MathSciNet Google Scholar
Zhang J, Shum HP, Han J, Shao L (2018) Action recognition from arbitrary views using transferable dictionary learning. IEEE Trans Image Process 27(10):4709–4723
Article MathSciNet Google Scholar
Wang H, Oneata D, Verbeek J, Schmid C (2016) A robust and efficient video representation for action recognition. Int J Comput Vis 119(3):219–238
Article MathSciNet Google Scholar
Wang H, Kläser A, Schmid C, Liu CL (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79
Article MathSciNet Google Scholar
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems (NIPS), pp 568–576
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of IEEE international conference on computer vision (ICCV), pp 4489–4497
Yi Y, Lin M (2016) Human action recognition with graph-based multiple-instance learning. Pattern Recognit 53:148–162
Article Google Scholar
Singh S, Arora C, Jawahar CV (2017) Trajectory aligned features for first person action recognition. Pattern Recognit 62:45–55
Article Google Scholar
Zhang H, Sun Y, Liu L, Wang X, Li L, Liu W (2018) ClothingOut: a category-supervised GAN model for clothing segmentation and retrieval. Neural Comput Appl. https://doi.org/10.1007/s00521-018-3691-y
Article Google Scholar
Ji Y, Zhang H, Wu QMJ (2018) Saliency detection via conditional adversarial image-to-image network. Neurocomputing 316:357–368
Article Google Scholar
Zhang H, Ji Y, Huang W, Liu L (2018) Sitcom-star-based clothing retrieval for video advertising: a deep learning framework. Neural Comput Appl. https://doi.org/10.1007/s00521-018-3579-x
Article Google Scholar
Wang J, Wang G (2018) Hierarchical spatial sum-product networks for action recognition in still images. IEEE Trans Circuits Syst Video Technol 28(1):90–100
Article Google Scholar
Kwak S, Cho M, Laptev I (2016) Thin-slicing for pose: Learning to understand pose without explicit pose estimation. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR), pp 4938–4947
Qi T, Xu Y, Quan Y, Ling L (2017) Image-based action recognition using hint-enhanced deep neural networks. Neurocomputing 267:475–488
Article Google Scholar
Peng X, Schmid C (2016) Multi-region two-stream R-CNN for action detection. In: Proceedings of European conference on computer vision (ECCV), pp 744–759
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems (NIPS), pp 91–99
Ni B, Li T, Yang X (2018) Learning semantic-aligned action representation. IEEE Trans Neural Netw Learn Syst 29(8):3715–3725
Article Google Scholar
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE international conference on computer vision and pattern recognition (CVPR), pp 2818–2826
Farnebäck G (2003) Two-frame motion estimation based on polynomial expansion. In: Proceedings of the Scandinavian conference on image analysis (SCIA), pp 363–370
Yang J, Zhang D, Frangi AF, Yang J (2004) Two-dimensional PCA: a new approach to appearance-based face representation and recognition. IEEE Trans Pattern Anal Mach Int 26(1):131–137
Article Google Scholar
Sánchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: theory and practice. Int J Comput Vis 105(3):222–245
Article MathSciNet Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Advances in neural information processing systems (NIPS), pp 1097–1105
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167
Girdhar R, Ramanan D (2017) Attentional pooling for action recognition. In: Advances in neural information processing systems (NIPS), pp 34–45
Zhou Q, Fan H, Su H, Yang H, Zheng S, Ling H (2018) Weighted bilinear coding over salient body parts for person re-identification. arXiv:1803.08580
Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv:1511.04119
Zhuang B, Liu L, Shen C, Reid I (2017) Towards context–aware interaction recognition for visual relationship detection. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 589–598
Yan S, Smith JS, Lu W, Zhang B (2017) Multi-branch attention networks for action recognition in still images. IEEE Trans Cognit Develop Syst. https://doi.org/10.1109/TCDS.2017.2783944
Article Google Scholar
Jiang YG, Liu J, Zamir AR, Toderici G, Laptev I, Shah M, Sukthankar R (2013) THUMOS challenge: action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14. Accessed 30 June 2018
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Article MathSciNet Google Scholar
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X (2015) TensorFlow: large-scale machine learning on heterogeneous systems. http://tensorflow.org/. Accessed 30 June 2018. Software available from tensorflow.org
Miao J, Xu X, Qiu S, Qinf C, Tao D (2015) Temporal variance analysis for action recognition. IEEE Trans Image Process 24(12):5904–5915
Article MathSciNet Google Scholar
Shi F, Laganière R, Petriu E (2016) Local part model for action recognition. Image Vis Comput 46(11):18–28
Article Google Scholar
Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput Vis Image Underst 150:109–125
Article Google Scholar
Nguyen TV, Mirza B (2017) Dual-layer kernel extreme learning machine for action recognition. Neurocomputing 260:123–130
Article Google Scholar
Kobayashi T (2017) Flip-invariant motion representation. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 5628–5637
Jain M, Jegou H, Bouthemy P (2013) Better exploiting motion for better action recognition. In: Proceedings of the IEEE international conference on computer vision and pattern recognition (CVPR), pp 2555–2562
Yu M, Liu L, Shao L (2016) Structure-preserving binary representations for RGB-D action recognition. IEEE Trans Pattern Anal Mach Int 38(8):1651–1664
Article Google Scholar
Caetano C, dos Santos JA, Schwartz WR (2016) Optical flow co-occurrence matrices: a novel spatiotemporal feature descriptor. In: Proceedings of international conference pattern recognition (ICPR), pp 1947–1952
Xu Z, Hu R, Chen J, Chen C, Chen H, Li H, Sun Q (2017) Action recognition by saliency-based dense sampling. Neurocomputing 236:82–92
Article Google Scholar
Miao J, Xu X, Mathew R, Huang H (2015) Residue boundary histograms for action recognition in the compressed domain. In: Proceedings of IEEE international conference on image processing (ICIP), pp 2825–2829
Kihl O, Picard D, Gosselin PH (2015) A unified framework for local visual descriptors evaluation. Pattern Recognit 48(4):1174–1184
Article Google Scholar
Feichtenhofer C, Pinz A, Wildes RP (2015) Dynamically encoded actions based on spacetime saliency. In: Proceedings of the IEEE international conference on computer vision and pattern recognition (CVPR), pp 2755–2764
Fernando B, Anderson P, Hutter M, Gould S (2016) Discriminative hierarchical rank pooling for activity recognition. In: Proceedings of the IEEE international conference on computer vision and pattern recognition (CVPR), pp 1924–1932
Wang L, Qiao Y, Tang X (2016) MoFAP: a multi-level representation for action recognition. Int J Comput Vis 119(3):254–271
Article MathSciNet Google Scholar
Tu NA, Huynh-The T, Khan KU, Lee YK (2018) ML-HDP: a hierarchical bayesian nonparametric model for recognizing human actions in video. IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2018.2816960
Article Google Scholar
Zheng Y, Yao H, Sun X, Zhao S, Porikli F (2018) Distinctive action sketch for human action recognition. Signal Process 144:323–332
Article Google Scholar
Jiang YG, Dai Q, Liu W, Xue X, Ngo CH (2015) Human action recognition in unconstrained videos by explicit motion modeling. IEEE Trans Image Process 24(11):3781–3795
Article MathSciNet Google Scholar
Bilinski PT, Bremond F (2015) Video covariance matrix logarithm for human action recognition in videos. In: Proceedings of the 24th international conference on artificial intelligence (IJCAI), pp 2140–2147
Shao L, Liu L, Yu M (2016) Kernelized multiview projection for robust action recognition. Int J Comput Vis 118(2):115–129
Article MathSciNet Google Scholar
Yang Y, Liu R, Deng C, Gao X (2016) Multi-task human action recognition via exploring super-category. Signal Process 124:36–44
Article Google Scholar
Yao T, Wang Z, Xie Z, Gao J, Feng DD (2017) Learning universal multiview dictionary for human action recognition. Pattern Recognit 64:236–244
Article Google Scholar
Zhu Y, Newsam S (2016) Depth2action: exploring embedded depth for large-scale action recognition. In: Proceedings of European conference on computer vision (ECCV), pp 668–684
Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR), pp 3034–3042
Lan Z, Yu SI, Yao D, Lin M, Raj B, Hauptmann A (2016) The best of both worlds: Combining data-independent and data-driven approaches for action recognition. In: Proceedings of IEEE international conference on computer vision and pattern recognition workshops (CVPR Workshops), pp 123–132
Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. In: Advances in neural information processing systems (NIPS), pp 3468–3476
Wang X, Farhadi A, Gupta A (2016) Actions ~ transformations. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR), pp 2658–2667
Wang P, Cao Y, Shen C, Liu L, Shen HT (2017) Temporal pyramid pooling-based convolutional neural network for action recognition. IEEE Trans Circuits Syst Video Technol 27(12):2613–2622
Article Google Scholar
Kar A, Rai N, Sikka K, Sharma G (2017) Adascan: adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR), pp 3376–3385
Ye Y, Tian Y (2016) Embedding sequential information into spatiotemporal features for action recognition. In: Proceedings of the IEEE international conference on computer vision workshops (ICCV Workshops), pp 37–45
Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2016) Real-time action recognition with enhanced motion vector CNNs. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR), pp 2718–2726
Ma S, Bargal SA, Zhang J, Sigal L, Sclaroff S (2017) Do less and achieve more: training CNNs for action recognition utilizing action images from the web. Pattern Recognit 68:334–345
Article Google Scholar
Yang H, Yuan C, Xing J, Hu W (2017) SCNN: Sequential convolutional neural network for human action recognition in videos. In: Proceedings of the IEEE international conference on image processing (ICIP), pp 355–359
Cherian A, Fernando B, Harandi M, Gould S (2017) Generalized rank pooling for activity recognition. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR), pp 1581–1590
Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2018) Real-time action recognition with deeply transferred motion vector CNNs. IEEE Trans Image Process 27(5):2326–2339
Article MathSciNet Google Scholar
Varol G, Laptev I, Schmid C (2018) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Int 40(6):1510–1517
Article Google Scholar

Download references

Acknowledgement

This work was supported partially by Shaanxi Province key project of Research and Development Plan research Project S2018-YF-ZDGY-0187 and International Cooperation Project of Shaanxi Province research project S2018-YF-GHMS-0061.

Author information

Authors and Affiliations

School of Electronic Engineering, Xidian University, Xi’an, 710071, China
Ming Tong, Mingyang Li, He Bai, Lei Ma & Mengao Zhao

Authors

Ming Tong
View author publications
You can also search for this author in PubMed Google Scholar
Mingyang Li
View author publications
You can also search for this author in PubMed Google Scholar
He Bai
View author publications
You can also search for this author in PubMed Google Scholar
Lei Ma
View author publications
You can also search for this author in PubMed Google Scholar
Mengao Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ming Tong.

Ethics declarations

Conflict of interest

All the authors of the manuscript declared that there are no potential conflicts of interest.

Human and animal rights

All the authors of the manuscript declared that there is no research involving human participants and/or animal.

Informed consent

All the authors of the manuscript declared that there is no material that required informed consent.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tong, M., Li, M., Bai, H. et al. DKD–DAD: a novel framework with discriminative kinematic descriptor and deep attention-pooled descriptor for action recognition. Neural Comput & Applic 32, 5285–5302 (2020). https://doi.org/10.1007/s00521-019-04030-1

Download citation

Received: 08 October 2018
Accepted: 11 January 2019
Published: 08 February 2019
Issue Date: May 2020
DOI: https://doi.org/10.1007/s00521-019-04030-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DKD–DAD: a novel framework with discriminative kinematic descriptor and deep attention-pooled descriptor for action recognition

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Attention mechanisms in computer vision: A survey

Visual attention network

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Human and animal rights

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

DKD–DAD: a novel framework with discriminative kinematic descriptor and deep attention-pooled descriptor for action recognition

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Attention mechanisms in computer vision: A survey

Visual attention network

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Human and animal rights

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation