Skip to main content
Log in

Enhanced 3D residual network for video event recognition in shipping monitoring

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The three-dimensional convolutional neural network is widely used in video recognition, action recognition and other tasks because it can directly extract temporal and spatial features. Due to the large number of parameters, many computing resources, and difficulty in training, the structure of three-dimensional convolutional neural network is generally shallow. For example, the traditional C3D [17] method uses only the 11-layer VGGNet structure, and the traditional Res3D [18] method adopts a residual network of 18 and 34 layers. Some experience of two-dimensional convolutional neural network shows that the deeper the network structure is, the higher the recognition accuracy will be. Therefore, this paper proposes a new method 3D ResNet-66, which combines a 50-layer 3D residual network and four-layer residual blocks, effectively reducing the number of parameters while increasing the depth of the network, and we finally obtain a better video recognition model through experiments. We evaluate our method on shipping event datasets. Compared to the traditional C3D and Res3D method, our method has improved the accuracy from 91.48% to 96.33%, the model size has been reduced from 561 MB to 135 MB, and the average processing time has become half of the original.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proc IEEE Conf Comput Vis Pattern Recognit, 6299–6308

  2. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1933–1941

  3. Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: Proc IEEE Conf Comput Vis Pattern Recognit, pp 6546–6555

  4. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  5. He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In: European conference on computer vision. Springer, Cham, pp 630–645

  6. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proc IEEE Conf Comput Vis Pattern Recognit, pp 4700–4708

  7. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167

  8. Ji S, Xu W, Yang M, Yu K (2012) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231

  9. Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Suleyman M (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950

  10. Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp 807–814

  11. Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proc IEEE Inter Conf on Comput Vis, pp 5533–5541

  12. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  13. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576

  14. Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-First AAAI Conference on Artificial Intelligence

  15. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Rabinovich A (2015) Going deeper with convolutions. In proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9

  16. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826

  17. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497

  18. Tran D, Ray J, Shou Z, Chang SF, Paluri M (2017) Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038

  19. Wang W, Shen J, Lu X, Hoi SC, Ling H (2020) Paying attention to video object pattern understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence

  20. Zagoruyko S, Komodakis N (2016) Wide residual networks. arXiv preprint arXiv:1605.07146

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hong Zhang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, H., Rong, J. Enhanced 3D residual network for video event recognition in shipping monitoring. Multimed Tools Appl 80, 3337–3348 (2021). https://doi.org/10.1007/s11042-020-09564-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-09564-4

Keywords

Navigation