Abstract
Action recognition in video is one of the important applications in computer vision. In recent years, the two-stream architecture has made significant progress in action recognition, but it has not systematically explored spatial–temporal features. Therefore, this paper proposes an integrated approach using Gaussian mixture model (GMM) and dilated convolution residual network (GD-RN) for action recognition. This method uses ResNet-101 as spatial and temporal stream ConvNet. On the one hand, this paper first sends the action video into the GMM for background subtraction and then sends the video marking the action profile to ResNet-101 for identification and classification. Compared with the baseline, ConvNet takes the original RGB image as input, which not only reduces the complexity of the video background, but also reduces the amount of computation of the learning space information. On the other hand, using the stacked optical flow images as the input of the ResNet-101 added to the dilated convolution, the convolution receptive field is expanded without lowering the resolution of the optical flow image, thereby improving the classification accuracy. The two ConvNet-independent learning spatial and temporal features of the GD-RN network finally fine-tune and fuse the spatio-temporal features to obtain the final action recognition accuracy. The action recognition method proposed in this paper is tested on the challenging UCF101 and HMDB51 datasets, and accuracy rates of 91.3% and 62.4%, respectively, are obtained, which proves the proposed method with the competitive results.
Similar content being viewed by others
References
Chenyang, S.I., et al.: An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 1227–1236 2019.
Tran, D., et al.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp. 4489–4497 (2015).
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS, pp. 568–576 (2014)
Ng, J.Y.-H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 4694–4702 (2015)
Xiusheng, L.U., et al.: Action recognition with multi-scale trajectory-pooled 3D convolutional descriptors. Multimedia Tools Appl 78(1), 507–523 (2019)
Zhang, B., Wang, L., Wang, Z., et al.: Real-time action recognition with deeply-transferred motion vector CNNs. IEEE Trans. Image Process.vol. 27(5), 2326–2339 (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 770–778 (2016)
Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks. https://arxiv.org/abs/1608.06993 (2016)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Neural information processing systems (NIPS), pp. 1097–1105 (2012)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations (ICLR) 2015.
Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 79(3), pp. 299–318 (2008).
Wang, H., Klser, A., Schmid, C., et al.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013)
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: IEEE international conference on computer vision, pp. 3551–3558 (2014)
Karpathy, A., Toderici, G., Shetty, S., et al.: Large-scale video classification with convolutional neural networks. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 1725–1732 (2014)
Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: International conference on machine learning, pp. 843–852 2015
Bilen, H., Fernando, B., Gavves, E., et al.: Action recognition with dynamic image networks. IEEE Trans. Pattern Anal. Mach. Intell. 40(12), 2799–2813 (2017)
Xie, S., Sun, C., Huang, J., et al.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Proceedings of the European conference on computer vision (ECCV), pp. 305–321 (2018)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: ICML, pp. 807–814 (2010)
Zivkovic, Z.: Improved adaptive Gaussian mixture model for background subtraction. In: ICPR (2), pp. 28–31 (2004)
Zivkovic, Z., Van Der Heijden, F.: Efficient adaptive density estimation per image pixel for the task of background subtraction. Pattern Recogn. Lett. 27(7), 773–780 (2006)
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR 2016.
Soomro, K., Zamir A.R., Shah M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint. https://arxiv.org/abs/1212.0402 (2012).
Jhuang, H., Garrote, H., Poggio, E., et al.: A large video database for human motion recognition. In: Proceedings of of IEEE international conference on computer vision, pp. 2556–2563 (2011)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Li, F.F.: Imagenet: a large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition, CVPR 2009, pp. 248–255 2009.
Zach, C., Pock, T., Bischof, H.: A duality based approach for real-time tv-l 1 optical flow. In: Joint pattern recognition symposium, vol. 5, Springer, pp. 214–223 2007
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. arXiv preprint. https://arxiv.org/abs/1604.06573 (2016)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9 (2015)
Wang, L., Xiong, Y., Wang, Z., Qiao, Y.: Towards good practices for very deep two-stream convnets. arXiv preprint. https://arxiv.org/abs/1507.02159 (2015).
Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. CoRR. https://arxiv.org/abs/1405.4506 (2014).
Donahue, J., Hendricks, J., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell. 39, 677–691 (2015)
Ng, J.Y.-H., Hausknecht, M., Vijayanarasimhan, S., et al.: Beyond short snippets: Deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4694–4702 (2015)
Tran, D., Ray, J., Shou, Z., Changm, S.F, Paluri, M. ConvNet architecture search for spatiotemporal feature learning. arXiv. https://arxiv.org/abs/1708.05038 (2017).
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE international conference on computer vision, Venice, Italy, 22–29 October, pp. 4489–4497 (2017)
Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Boston, MA, USA, 7–12 June, pp. 4694–4702 (2015)
Li, Y., Li, W., Mahadevan, V., Vasconcelos, N.: VLAD3: Encoding dynamics of deep features for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 26 June–1 July, pp. 1951–1960 (2016)
Acknowledgements
This work is supported partially by the project of Jilin Provincial Science and Technology Department under the Grant 20180201003GX and the project of Jilin province development and reform commission under the Grant 2019C053-4. The authors gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved the presentation.
Funding
This research was funded by the project of Jilin Provincial Science and Technology Department under the Grant 20180201003GX and the project of Jilin province development and reform commission under the Grant 2019C053-4.
Author information
Authors and Affiliations
Contributions
This study was completed by the co-authors. SL conceived and led the research. The major experiments and analyses were undertaken by MF and JZ. XYB was responsible for data processing and wrote related work. FY wrote the draft and C-CH edited and reviewed the paper. All authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
Authors declare no conflicts of interest.
Additional information
Communicated by H. Lin.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Fang, M., Bai, X., Zhao, J. et al. Integrating Gaussian mixture model and dilated residual network for action recognition in videos. Multimedia Systems 26, 715–725 (2020). https://doi.org/10.1007/s00530-020-00683-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-020-00683-4