Skip to main content
Log in

Integrating Gaussian mixture model and dilated residual network for action recognition in videos

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Action recognition in video is one of the important applications in computer vision. In recent years, the two-stream architecture has made significant progress in action recognition, but it has not systematically explored spatial–temporal features. Therefore, this paper proposes an integrated approach using Gaussian mixture model (GMM) and dilated convolution residual network (GD-RN) for action recognition. This method uses ResNet-101 as spatial and temporal stream ConvNet. On the one hand, this paper first sends the action video into the GMM for background subtraction and then sends the video marking the action profile to ResNet-101 for identification and classification. Compared with the baseline, ConvNet takes the original RGB image as input, which not only reduces the complexity of the video background, but also reduces the amount of computation of the learning space information. On the other hand, using the stacked optical flow images as the input of the ResNet-101 added to the dilated convolution, the convolution receptive field is expanded without lowering the resolution of the optical flow image, thereby improving the classification accuracy. The two ConvNet-independent learning spatial and temporal features of the GD-RN network finally fine-tune and fuse the spatio-temporal features to obtain the final action recognition accuracy. The action recognition method proposed in this paper is tested on the challenging UCF101 and HMDB51 datasets, and accuracy rates of 91.3% and 62.4%, respectively, are obtained, which proves the proposed method with the competitive results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Chenyang, S.I., et al.: An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 1227–1236 2019.

  2. Tran, D., et al.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp. 4489–4497 (2015).

  3. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS, pp. 568–576 (2014)

  4. Ng, J.Y.-H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 4694–4702 (2015)

  5. Xiusheng, L.U., et al.: Action recognition with multi-scale trajectory-pooled 3D convolutional descriptors. Multimedia Tools Appl 78(1), 507–523 (2019)

    Article  Google Scholar 

  6. Zhang, B., Wang, L., Wang, Z., et al.: Real-time action recognition with deeply-transferred motion vector CNNs. IEEE Trans. Image Process.vol. 27(5), 2326–2339 (2018)

    Article  MathSciNet  Google Scholar 

  7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 770–778 (2016)

  8. Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks. https://arxiv.org/abs/1608.06993 (2016)

  9. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Neural information processing systems (NIPS), pp. 1097–1105 (2012)

  10. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations (ICLR) 2015.

  11. Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 79(3), pp. 299–318 (2008).

    Article  Google Scholar 

  12. Wang, H., Klser, A., Schmid, C., et al.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013)

    Article  MathSciNet  Google Scholar 

  13. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: IEEE international conference on computer vision, pp. 3551–3558 (2014)

  14. Karpathy, A., Toderici, G., Shetty, S., et al.: Large-scale video classification with convolutional neural networks. In: IEEE conference on computer vision and pattern recognition (CVPR), pp. 1725–1732 (2014)

  15. Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: International conference on machine learning, pp. 843–852 2015

  16. Bilen, H., Fernando, B., Gavves, E., et al.: Action recognition with dynamic image networks. IEEE Trans. Pattern Anal. Mach. Intell. 40(12), 2799–2813 (2017)

    Article  Google Scholar 

  17. Xie, S., Sun, C., Huang, J., et al.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Proceedings of the European conference on computer vision (ECCV), pp. 305–321 (2018)

  18. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: ICML, pp. 807–814 (2010)

  19. Zivkovic, Z.: Improved adaptive Gaussian mixture model for background subtraction. In: ICPR (2), pp. 28–31 (2004)

  20. Zivkovic, Z., Van Der Heijden, F.: Efficient adaptive density estimation per image pixel for the task of background subtraction. Pattern Recogn. Lett. 27(7), 773–780 (2006)

    Article  Google Scholar 

  21. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR 2016.

  22. Soomro, K., Zamir A.R., Shah M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint. https://arxiv.org/abs/1212.0402 (2012).

  23. Jhuang, H., Garrote, H., Poggio, E., et al.: A large video database for human motion recognition. In: Proceedings of of IEEE international conference on computer vision, pp. 2556–2563 (2011)

  24. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Li, F.F.: Imagenet: a large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition, CVPR 2009, pp. 248–255 2009.

  25. Zach, C., Pock, T., Bischof, H.: A duality based approach for real-time tv-l 1 optical flow. In: Joint pattern recognition symposium, vol. 5, Springer, pp. 214–223 2007

  26. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. arXiv preprint. https://arxiv.org/abs/1604.06573 (2016)

  27. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9 (2015)

  28. Wang, L., Xiong, Y., Wang, Z., Qiao, Y.: Towards good practices for very deep two-stream convnets. arXiv preprint. https://arxiv.org/abs/1507.02159 (2015).

  29. Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. CoRR. https://arxiv.org/abs/1405.4506 (2014).

  30. Donahue, J., Hendricks, J., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell. 39, 677–691 (2015)

    Article  Google Scholar 

  31. Ng, J.Y.-H., Hausknecht, M., Vijayanarasimhan, S., et al.: Beyond short snippets: Deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4694–4702 (2015)

  32. Tran, D., Ray, J., Shou, Z., Changm, S.F, Paluri, M. ConvNet architecture search for spatiotemporal feature learning. arXiv. https://arxiv.org/abs/1708.05038 (2017).

  33. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE international conference on computer vision, Venice, Italy, 22–29 October, pp. 4489–4497 (2017)

  34. Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Boston, MA, USA, 7–12 June, pp. 4694–4702 (2015)

  35. Li, Y., Li, W., Mahadevan, V., Vasconcelos, N.: VLAD3: Encoding dynamics of deep features for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 26 June–1 July, pp. 1951–1960 (2016)

Download references

Acknowledgements

This work is supported partially by the project of Jilin Provincial Science and Technology Department under the Grant 20180201003GX and the project of Jilin province development and reform commission under the Grant 2019C053-4. The authors gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved the presentation.

Funding

This research was funded by the project of Jilin Provincial Science and Technology Department under the Grant 20180201003GX and the project of Jilin province development and reform commission under the Grant 2019C053-4.

Author information

Authors and Affiliations

Authors

Contributions

This study was completed by the co-authors. SL conceived and led the research. The major experiments and analyses were undertaken by MF and JZ. XYB was responsible for data processing and wrote related work. FY wrote the draft and C-CH edited and reviewed the paper. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Shuhua Liu.

Ethics declarations

Conflict of interest

Authors declare no conflicts of interest.

Additional information

Communicated by H. Lin.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fang, M., Bai, X., Zhao, J. et al. Integrating Gaussian mixture model and dilated residual network for action recognition in videos. Multimedia Systems 26, 715–725 (2020). https://doi.org/10.1007/s00530-020-00683-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-020-00683-4

Keywords

Navigation