Integrating Gaussian mixture model and dilated residual network for action recognition in videos

Fang, Ming; Bai, Xiaoying; Zhao, Jianwei; Yang, Fengqin; Hung, Chih-Cheng; Liu, Shuhua

doi:10.1007/s00530-020-00683-4

Integrating Gaussian mixture model and dilated residual network for action recognition in videos

Regular Paper
Published: 20 August 2020

Volume 26, pages 715–725, (2020)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Ming Fang¹,
Xiaoying Bai¹,
Jianwei Zhao¹,
Fengqin Yang¹,
Chih-Cheng Hung² &
…
Shuhua Liu¹

389 Accesses
5 Citations
Explore all metrics

Abstract

Action recognition in video is one of the important applications in computer vision. In recent years, the two-stream architecture has made significant progress in action recognition, but it has not systematically explored spatial–temporal features. Therefore, this paper proposes an integrated approach using Gaussian mixture model (GMM) and dilated convolution residual network (GD-RN) for action recognition. This method uses ResNet-101 as spatial and temporal stream ConvNet. On the one hand, this paper first sends the action video into the GMM for background subtraction and then sends the video marking the action profile to ResNet-101 for identification and classification. Compared with the baseline, ConvNet takes the original RGB image as input, which not only reduces the complexity of the video background, but also reduces the amount of computation of the learning space information. On the other hand, using the stacked optical flow images as the input of the ResNet-101 added to the dilated convolution, the convolution receptive field is expanded without lowering the resolution of the optical flow image, thereby improving the classification accuracy. The two ConvNet-independent learning spatial and temporal features of the GD-RN network finally fine-tune and fuse the spatio-temporal features to obtain the final action recognition accuracy. The action recognition method proposed in this paper is tested on the challenging UCF101 and HMDB51 datasets, and accuracy rates of 91.3% and 62.4%, respectively, are obtained, which proves the proposed method with the competitive results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 5

CBAM: Convolutional Block Attention Module

Methods for image denoising using convolutional neural network: a review

Article Open access 10 June 2021

Convolutional neural network: a review of models, methodologies and applications to object detection

Article 20 December 2019

References

Chenyang, S.I., et al.: An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 1227–1236 2019.
Tran, D., et al.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp. 4489–4497 (2015).
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS, pp. 568–576 (2014)
Ng, J.Y.-H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 4694–4702 (2015)
Xiusheng, L.U., et al.: Action recognition with multi-scale trajectory-pooled 3D convolutional descriptors. Multimedia Tools Appl 78(1), 507–523 (2019)
Article Google Scholar
Zhang, B., Wang, L., Wang, Z., et al.: Real-time action recognition with deeply-transferred motion vector CNNs. IEEE Trans. Image Process.vol. 27(5), 2326–2339 (2018)
Article MathSciNet Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 770–778 (2016)
Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks. https://arxiv.org/abs/1608.06993 (2016)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Neural information processing systems (NIPS), pp. 1097–1105 (2012)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations (ICLR) 2015.
Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 79(3), pp. 299–318 (2008).
Article Google Scholar
Wang, H., Klser, A., Schmid, C., et al.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013)
Article MathSciNet Google Scholar
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: IEEE international conference on computer vision, pp. 3551–3558 (2014)
Karpathy, A., Toderici, G., Shetty, S., et al.: Large-scale video classification with convolutional neural networks. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 1725–1732 (2014)
Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: International conference on machine learning, pp. 843–852 2015
Bilen, H., Fernando, B., Gavves, E., et al.: Action recognition with dynamic image networks. IEEE Trans. Pattern Anal. Mach. Intell. 40(12), 2799–2813 (2017)
Article Google Scholar
Xie, S., Sun, C., Huang, J., et al.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Proceedings of the European conference on computer vision (ECCV), pp. 305–321 (2018)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: ICML, pp. 807–814 (2010)
Zivkovic, Z.: Improved adaptive Gaussian mixture model for background subtraction. In: ICPR (2), pp. 28–31 (2004)
Zivkovic, Z., Van Der Heijden, F.: Efficient adaptive density estimation per image pixel for the task of background subtraction. Pattern Recogn. Lett. 27(7), 773–780 (2006)
Article Google Scholar
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR 2016.
Soomro, K., Zamir A.R., Shah M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint. https://arxiv.org/abs/1212.0402 (2012).
Jhuang, H., Garrote, H., Poggio, E., et al.: A large video database for human motion recognition. In: Proceedings of of IEEE international conference on computer vision, pp. 2556–2563 (2011)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Li, F.F.: Imagenet: a large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition, CVPR 2009, pp. 248–255 2009.
Zach, C., Pock, T., Bischof, H.: A duality based approach for real-time tv-l 1 optical flow. In: Joint pattern recognition symposium, vol. 5, Springer, pp. 214–223 2007
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. arXiv preprint. https://arxiv.org/abs/1604.06573 (2016)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9 (2015)
Wang, L., Xiong, Y., Wang, Z., Qiao, Y.: Towards good practices for very deep two-stream convnets. arXiv preprint. https://arxiv.org/abs/1507.02159 (2015).
Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. CoRR. https://arxiv.org/abs/1405.4506 (2014).
Donahue, J., Hendricks, J., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell. 39, 677–691 (2015)
Article Google Scholar
Ng, J.Y.-H., Hausknecht, M., Vijayanarasimhan, S., et al.: Beyond short snippets: Deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4694–4702 (2015)
Tran, D., Ray, J., Shou, Z., Changm, S.F, Paluri, M. ConvNet architecture search for spatiotemporal feature learning. arXiv. https://arxiv.org/abs/1708.05038 (2017).
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE international conference on computer vision, Venice, Italy, 22–29 October, pp. 4489–4497 (2017)
Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Boston, MA, USA, 7–12 June, pp. 4694–4702 (2015)
Li, Y., Li, W., Mahadevan, V., Vasconcelos, N.: VLAD3: Encoding dynamics of deep features for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 26 June–1 July, pp. 1951–1960 (2016)

Download references

Acknowledgements

This work is supported partially by the project of Jilin Provincial Science and Technology Department under the Grant 20180201003GX and the project of Jilin province development and reform commission under the Grant 2019C053-4. The authors gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved the presentation.

Funding

This research was funded by the project of Jilin Provincial Science and Technology Department under the Grant 20180201003GX and the project of Jilin province development and reform commission under the Grant 2019C053-4.

Author information

Authors and Affiliations

School of Information Science and Technology, Northeast Normal University, Changchun, China
Ming Fang, Xiaoying Bai, Jianwei Zhao, Fengqin Yang & Shuhua Liu
College of Computing and Software Engineering, Kennesaw State University, Marietta, USA
Chih-Cheng Hung

Authors

Ming Fang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoying Bai
View author publications
You can also search for this author in PubMed Google Scholar
Jianwei Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Fengqin Yang
View author publications
You can also search for this author in PubMed Google Scholar
Chih-Cheng Hung
View author publications
You can also search for this author in PubMed Google Scholar
Shuhua Liu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

This study was completed by the co-authors. SL conceived and led the research. The major experiments and analyses were undertaken by MF and JZ. XYB was responsible for data processing and wrote related work. FY wrote the draft and C-CH edited and reviewed the paper. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Shuhua Liu.

Ethics declarations

Conflict of interest

Authors declare no conflicts of interest.

Additional information

Communicated by H. Lin.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fang, M., Bai, X., Zhao, J. et al. Integrating Gaussian mixture model and dilated residual network for action recognition in videos. Multimedia Systems 26, 715–725 (2020). https://doi.org/10.1007/s00530-020-00683-4

Download citation

Received: 04 May 2020
Accepted: 10 August 2020
Published: 20 August 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s00530-020-00683-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Integrating Gaussian mixture model and dilated residual network for action recognition in videos

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Methods for image denoising using convolutional neural network: a review

Convolutional neural network: a review of models, methodologies and applications to object detection

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Integrating Gaussian mixture model and dilated residual network for action recognition in videos

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Methods for image denoising using convolutional neural network: a review

Convolutional neural network: a review of models, methodologies and applications to object detection

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation