Abstract
In order to improve action recognition accuracy, the discriminative kinematic descriptor and deep attention-pooled descriptor are proposed. Firstly, the optical flow field is transformed into a set of kinematic fields with more discriminativeness. Subsequently, two kinematic features are constructed, which more accurately depict the dynamic characteristics of action subject from the multi-order divergence and curl fields. Secondly, by introducing both of the tight-loose constraint and anti-confusion constraint, a discriminative fusion method is proposed, which guarantees better within-class compactness and between-class separability, meanwhile reduces the confusion caused by outliers. Furthermore, a discriminative kinematic descriptor is constructed. Thirdly, a prediction-attentional pooling method is proposed, which accurately focuses its attention on the discriminative local regions. On this basis, a deep attention-pooled descriptor (DKD–DAD) is constructed. Finally, a novel framework with discriminative kinematic descriptor and deep attention-pooled descriptor is presented, which comprehensively obtains the discriminative dynamic and static information in a video. Consequently, accuracies are improved. Experiments on two challenging datasets verify the effectiveness of our methods.
Similar content being viewed by others
References
Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: Proceedings of IEEE international conference on computer vision (ICCV), pp 2556–2563
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR), pp 1–8
Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123
Dollar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: Proceedings of the 14th international conference on computer communications and networks (ICCCN), pp 65–72
Yuan C, Li X, Hu W, Ling H, Maybank SJ (2014) Modeling geometric-temporal context with directional pyramid co-occurrence for action recognition. IEEE Trans Image Process 23(2):658–672
Zhang J, Shum HP, Han J, Shao L (2018) Action recognition from arbitrary views using transferable dictionary learning. IEEE Trans Image Process 27(10):4709–4723
Wang H, Oneata D, Verbeek J, Schmid C (2016) A robust and efficient video representation for action recognition. Int J Comput Vis 119(3):219–238
Wang H, Kläser A, Schmid C, Liu CL (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems (NIPS), pp 568–576
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of IEEE international conference on computer vision (ICCV), pp 4489–4497
Yi Y, Lin M (2016) Human action recognition with graph-based multiple-instance learning. Pattern Recognit 53:148–162
Singh S, Arora C, Jawahar CV (2017) Trajectory aligned features for first person action recognition. Pattern Recognit 62:45–55
Zhang H, Sun Y, Liu L, Wang X, Li L, Liu W (2018) ClothingOut: a category-supervised GAN model for clothing segmentation and retrieval. Neural Comput Appl. https://doi.org/10.1007/s00521-018-3691-y
Ji Y, Zhang H, Wu QMJ (2018) Saliency detection via conditional adversarial image-to-image network. Neurocomputing 316:357–368
Zhang H, Ji Y, Huang W, Liu L (2018) Sitcom-star-based clothing retrieval for video advertising: a deep learning framework. Neural Comput Appl. https://doi.org/10.1007/s00521-018-3579-x
Wang J, Wang G (2018) Hierarchical spatial sum-product networks for action recognition in still images. IEEE Trans Circuits Syst Video Technol 28(1):90–100
Kwak S, Cho M, Laptev I (2016) Thin-slicing for pose: Learning to understand pose without explicit pose estimation. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR), pp 4938–4947
Qi T, Xu Y, Quan Y, Ling L (2017) Image-based action recognition using hint-enhanced deep neural networks. Neurocomputing 267:475–488
Peng X, Schmid C (2016) Multi-region two-stream R-CNN for action detection. In: Proceedings of European conference on computer vision (ECCV), pp 744–759
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems (NIPS), pp 91–99
Ni B, Li T, Yang X (2018) Learning semantic-aligned action representation. IEEE Trans Neural Netw Learn Syst 29(8):3715–3725
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE international conference on computer vision and pattern recognition (CVPR), pp 2818–2826
Farnebäck G (2003) Two-frame motion estimation based on polynomial expansion. In: Proceedings of the Scandinavian conference on image analysis (SCIA), pp 363–370
Yang J, Zhang D, Frangi AF, Yang J (2004) Two-dimensional PCA: a new approach to appearance-based face representation and recognition. IEEE Trans Pattern Anal Mach Int 26(1):131–137
Sánchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: theory and practice. Int J Comput Vis 105(3):222–245
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Advances in neural information processing systems (NIPS), pp 1097–1105
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167
Girdhar R, Ramanan D (2017) Attentional pooling for action recognition. In: Advances in neural information processing systems (NIPS), pp 34–45
Zhou Q, Fan H, Su H, Yang H, Zheng S, Ling H (2018) Weighted bilinear coding over salient body parts for person re-identification. arXiv:1803.08580
Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv:1511.04119
Zhuang B, Liu L, Shen C, Reid I (2017) Towards context–aware interaction recognition for visual relationship detection. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 589–598
Yan S, Smith JS, Lu W, Zhang B (2017) Multi-branch attention networks for action recognition in still images. IEEE Trans Cognit Develop Syst. https://doi.org/10.1109/TCDS.2017.2783944
Jiang YG, Liu J, Zamir AR, Toderici G, Laptev I, Shah M, Sukthankar R (2013) THUMOS challenge: action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14. Accessed 30 June 2018
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X (2015) TensorFlow: large-scale machine learning on heterogeneous systems. http://tensorflow.org/. Accessed 30 June 2018. Software available from tensorflow.org
Miao J, Xu X, Qiu S, Qinf C, Tao D (2015) Temporal variance analysis for action recognition. IEEE Trans Image Process 24(12):5904–5915
Shi F, Laganière R, Petriu E (2016) Local part model for action recognition. Image Vis Comput 46(11):18–28
Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput Vis Image Underst 150:109–125
Nguyen TV, Mirza B (2017) Dual-layer kernel extreme learning machine for action recognition. Neurocomputing 260:123–130
Kobayashi T (2017) Flip-invariant motion representation. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 5628–5637
Jain M, Jegou H, Bouthemy P (2013) Better exploiting motion for better action recognition. In: Proceedings of the IEEE international conference on computer vision and pattern recognition (CVPR), pp 2555–2562
Yu M, Liu L, Shao L (2016) Structure-preserving binary representations for RGB-D action recognition. IEEE Trans Pattern Anal Mach Int 38(8):1651–1664
Caetano C, dos Santos JA, Schwartz WR (2016) Optical flow co-occurrence matrices: a novel spatiotemporal feature descriptor. In: Proceedings of international conference pattern recognition (ICPR), pp 1947–1952
Xu Z, Hu R, Chen J, Chen C, Chen H, Li H, Sun Q (2017) Action recognition by saliency-based dense sampling. Neurocomputing 236:82–92
Miao J, Xu X, Mathew R, Huang H (2015) Residue boundary histograms for action recognition in the compressed domain. In: Proceedings of IEEE international conference on image processing (ICIP), pp 2825–2829
Kihl O, Picard D, Gosselin PH (2015) A unified framework for local visual descriptors evaluation. Pattern Recognit 48(4):1174–1184
Feichtenhofer C, Pinz A, Wildes RP (2015) Dynamically encoded actions based on spacetime saliency. In: Proceedings of the IEEE international conference on computer vision and pattern recognition (CVPR), pp 2755–2764
Fernando B, Anderson P, Hutter M, Gould S (2016) Discriminative hierarchical rank pooling for activity recognition. In: Proceedings of the IEEE international conference on computer vision and pattern recognition (CVPR), pp 1924–1932
Wang L, Qiao Y, Tang X (2016) MoFAP: a multi-level representation for action recognition. Int J Comput Vis 119(3):254–271
Tu NA, Huynh-The T, Khan KU, Lee YK (2018) ML-HDP: a hierarchical bayesian nonparametric model for recognizing human actions in video. IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2018.2816960
Zheng Y, Yao H, Sun X, Zhao S, Porikli F (2018) Distinctive action sketch for human action recognition. Signal Process 144:323–332
Jiang YG, Dai Q, Liu W, Xue X, Ngo CH (2015) Human action recognition in unconstrained videos by explicit motion modeling. IEEE Trans Image Process 24(11):3781–3795
Bilinski PT, Bremond F (2015) Video covariance matrix logarithm for human action recognition in videos. In: Proceedings of the 24th international conference on artificial intelligence (IJCAI), pp 2140–2147
Shao L, Liu L, Yu M (2016) Kernelized multiview projection for robust action recognition. Int J Comput Vis 118(2):115–129
Yang Y, Liu R, Deng C, Gao X (2016) Multi-task human action recognition via exploring super-category. Signal Process 124:36–44
Yao T, Wang Z, Xie Z, Gao J, Feng DD (2017) Learning universal multiview dictionary for human action recognition. Pattern Recognit 64:236–244
Zhu Y, Newsam S (2016) Depth2action: exploring embedded depth for large-scale action recognition. In: Proceedings of European conference on computer vision (ECCV), pp 668–684
Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR), pp 3034–3042
Lan Z, Yu SI, Yao D, Lin M, Raj B, Hauptmann A (2016) The best of both worlds: Combining data-independent and data-driven approaches for action recognition. In: Proceedings of IEEE international conference on computer vision and pattern recognition workshops (CVPR Workshops), pp 123–132
Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. In: Advances in neural information processing systems (NIPS), pp 3468–3476
Wang X, Farhadi A, Gupta A (2016) Actions ~ transformations. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR), pp 2658–2667
Wang P, Cao Y, Shen C, Liu L, Shen HT (2017) Temporal pyramid pooling-based convolutional neural network for action recognition. IEEE Trans Circuits Syst Video Technol 27(12):2613–2622
Kar A, Rai N, Sikka K, Sharma G (2017) Adascan: adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR), pp 3376–3385
Ye Y, Tian Y (2016) Embedding sequential information into spatiotemporal features for action recognition. In: Proceedings of the IEEE international conference on computer vision workshops (ICCV Workshops), pp 37–45
Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2016) Real-time action recognition with enhanced motion vector CNNs. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR), pp 2718–2726
Ma S, Bargal SA, Zhang J, Sigal L, Sclaroff S (2017) Do less and achieve more: training CNNs for action recognition utilizing action images from the web. Pattern Recognit 68:334–345
Yang H, Yuan C, Xing J, Hu W (2017) SCNN: Sequential convolutional neural network for human action recognition in videos. In: Proceedings of the IEEE international conference on image processing (ICIP), pp 355–359
Cherian A, Fernando B, Harandi M, Gould S (2017) Generalized rank pooling for activity recognition. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR), pp 1581–1590
Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2018) Real-time action recognition with deeply transferred motion vector CNNs. IEEE Trans Image Process 27(5):2326–2339
Varol G, Laptev I, Schmid C (2018) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Int 40(6):1510–1517
Acknowledgement
This work was supported partially by Shaanxi Province key project of Research and Development Plan research Project S2018-YF-ZDGY-0187 and International Cooperation Project of Shaanxi Province research project S2018-YF-GHMS-0061.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All the authors of the manuscript declared that there are no potential conflicts of interest.
Human and animal rights
All the authors of the manuscript declared that there is no research involving human participants and/or animal.
Informed consent
All the authors of the manuscript declared that there is no material that required informed consent.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Tong, M., Li, M., Bai, H. et al. DKD–DAD: a novel framework with discriminative kinematic descriptor and deep attention-pooled descriptor for action recognition. Neural Comput & Applic 32, 5285–5302 (2020). https://doi.org/10.1007/s00521-019-04030-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-019-04030-1