Fast Video Classification with CNNs in Compressed Domain

Kai, Lorxayxang; Wu, Yang; Dai, Xiaodong; Ma, Ming

doi:10.1007/978-3-030-57884-8_71

Fast Video Classification with CNNs in Compressed Domain

Lorxayxang Kai¹¹,
Yang Wu¹¹,
Xiaodong Dai¹¹ &
…
Ming Ma¹¹

Conference paper
First Online: 01 September 2020

1148 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12239))

Abstract

In recent years, Convolutional Neural Network (CNN) has proven to be one of the most successful tools for video-based action recognition. As the most popular method for video action recognition, the two-stream CNNs method, utilizing the optical flows, is not available for real-time applications because of the high computation requirement. In this paper, we present that accelerating the CNN architecture by replacing optical flows with the Motion Vector (MVs) can achieve a faster process speed that can be used in real-time applications. The MVs is designed to extracted information directly from the compressed video bitstreams. We explored how the proposed video classification method gives the very impressive result. First, using motion vector via taking raw video bitstream as input to directly predict action classes without explicitly computing optical flow. Secondly, we demonstrate a strong base-line two-stream ConvNet using pre-train models and transfer learning for our both spatial stream and temporal stream. The finding of our approach proves to be significantly faster than the original two-stream approaches, and achieves high accuracy and satisfies real-time requirement.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Ciregan, D., Meier, U., Schmidhuber, J.: Multi-column deep neural networks for image classification (2012). Arxiv preprint arXiv:1202.2745
Ballas, N., Yao, L., Pal, C., Courville, A.: Delving deeper into convolutional networks for learning video representations. In: International Conference of Learning Representations (ICLR), November 2016
Google Scholar
Karpathy, A.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 1725–1732 (2014)
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of the Advances in Neural Information Processing Systems (NIPS), pp. 568–576 (2014)
Google Scholar
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition (2016). arXiv preprint arXiv:1604.06573
Tran, D.: Learning Spatiotemporal Features with 3D Convolutional Networks (2015)
Google Scholar
Chadha, A., Abbas, A., Andreopoulos, Y.: Video classification with CNNs: using the codeca as a spatio-temporal activity sensor. IEEE Trans. Circ. Syst. Video Technol. 29(2), 475–485 (2017)
Article Google Scholar
Liu, K., Liu, W., Gan, C., Tan, M., Ma, H.: T-C3D: temporal convolutional 3D network for real-time action recognition. In: AAAI, pp. 7138–7145 (2018)
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)
Google Scholar
Zhang, B., et al.: Real-time Recognition with Enhanced Motion Vector CNNs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2718–2726 (2016)
Google Scholar
Ma, C.Y., et al.: TS-LSTM and temporal-inception: exploiting spatiotemporal dynamics for activity recognition. Sig. Process. Image Commun. 71, 76–87 (2019)
Article Google Scholar
Sun, L., Jia, K., Yeung, D.-Y., Shi, B.E.: Human action recognition using factorized spatial-temporal convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4597–4605 (2015)
Google Scholar
Wang, L., Qiao, Y., Tang, X.: MoFAP: a multi-level representation for action recognition. Int. J. Comput. Vis. 119(3), 254–271 (2016)
Article MathSciNet Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015, pp. 4489–4497 (2015)
Google Scholar
Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectorypooled deep-convolutional descriptors. In: Proceedings of CVPR, Boston, MA, USA, June 2015, pp. 4305–4314 (2015)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. In: CoRR, pp. 1–7, November 2012
Google Scholar
Kuehne,H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2556–2563 (2011)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Blei, D., Bach, F. (eds.) Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 448–456. JMLR Workshop and Conference Proceedings (2015)
Google Scholar
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Google Scholar
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: towards good practices for deep action recognition. In: ECCV (2016)
Google Scholar
Wang, X., Farhadi, A., Gupta, A.: Actions transformations. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016
Google Scholar
Wang, Y., Song, J, Wang, L., Van Gool, L., Hilliges, O.: Two-stream sr-cnns for action recognition in videos. In: BMVC (2016)
Google Scholar
Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015)
Google Scholar
Chadha, A., Abbas, A., Andreopoulos, Y.: Video classification with CNNs: using the codec as a spatio-temporal activity sensor. In: IEEE International Conference (2017)
Google Scholar
Zheng, B., Wang, L., Wang, Z., Qiao, Y.: Real-time action recognition with deeply transferred motion vector CNNs. In: Proceedings of the IEEE Proceedings of the International Conference on Computer Vision (ICCV), pp. 1–13 (2018)
Google Scholar
Kantorov, V., Laptev, I.: Efficient feature extraction, encoding, and classification for action recognition. In: Proceedings of the CVPR, Columbus, OH, USA, September 2014, pp. 2593–2600 (2014)
Google Scholar
Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: C3D: generic features for video analysis (2014). CoRR, abs/1412.0767
Google Scholar
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV’13, pp. 3551–3558 (2013)
Google Scholar
VGG and ResNet. https://github.com/tensorflow/models/tree/master/research/slim
Fang, W., Zhang, F., Sheng, V.S., Ding, Y.: A method for improving CNN-based image recognition using DCGAN. Comput. Mater. Continua 57(1), 167–178 (2018)
Article Google Scholar
Zhou, S., Liang, W., Li, J., Kim, J.-U.: Improved VGG model for road traffic sign recognition. Comput. Mater. Continua 57(1), 11–24 (2018)
Article Google Scholar
Meng, R., Rice, S.G., Wang, J., Sun, X.: A fusion steganographic algorithm based on faster R-CNN. Comput. Mater. Continua 55(1), 001–016 (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science and Engineering, Inner Mongolian University, Hohhot, China
Lorxayxang Kai, Yang Wu, Xiaodong Dai & Ming Ma

Authors

Lorxayxang Kai
View author publications
You can also search for this author in PubMed Google Scholar
Yang Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodong Dai
View author publications
You can also search for this author in PubMed Google Scholar
Ming Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ming Ma .

Editor information

Editors and Affiliations

Nanjing University of Information Science, Nanjing, China
Xingming Sun
Nanjing University of Information Science, Nanjing, China
Jinwei Wang
Purdue University, West Lafayette, IN, USA
Elisa Bertino

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kai, L., Wu, Y., Dai, X., Ma, M. (2020). Fast Video Classification with CNNs in Compressed Domain. In: Sun, X., Wang, J., Bertino, E. (eds) Artificial Intelligence and Security. ICAIS 2020. Lecture Notes in Computer Science(), vol 12239. Springer, Cham. https://doi.org/10.1007/978-3-030-57884-8_71

Download citation

DOI: https://doi.org/10.1007/978-3-030-57884-8_71
Published: 01 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57883-1
Online ISBN: 978-3-030-57884-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics