Multi-scale Motion-Aware Module for Video Action Recognition

Peng, Huai-Wei; Tseng, Yu-Chee

doi:10.1007/978-3-031-25075-0_40

Huai-Wei Peng¹⁰ &
Yu-Chee Tseng¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13806))

Included in the following conference series:

European Conference on Computer Vision

1493 Accesses

Abstract

Due to the lengthy computing time for optical flow, recent works have proposed to use the correlation operation as an alternative approach to extracting motion features. Although using correlation operations shows significant improvement with negligible FLOPs, it introduces much more latency per FLOP than convolution operations and increases noticeable latency as a larger searching patch is applied. Nonetheless, shrinking the searching patch in correlation operation is doomed to degrade its performance owing to the inability to capture larger displacements. In this paper, we propose an effective and low-latency Multi-Scale Motion-Aware (MSMA) module. It uses smaller searching patches at different scales for efficiently extracting motion features from large displacements. It can be installed into and generalizes well on different CNN backbones. When installed into TSM ResNet-50, the MSMA module introduces \(\approx \) 17.6% more latency on NVIDIA Tesla V100 GPU, yet, it achieves state-of-the-art performance on Something-Something V1 & V2 and Diving-48.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anderson, C.H., Bergen, J.R., Burt, P.J., Ogden, J.M.: Pyramid methods in image processing. RCA Eng. 29(6), 33–41 (1984)
Google Scholar
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: VIVIT: a video vision transformer. In: ICCV (2021)
Google Scholar
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML (2021)
Google Scholar
Burt, P.J., Adelson, E.H.: The Laplacian pyramid as a compact image code. IEEE Trans. Commun. 31(4), 532–540 (1983)
Article Google Scholar
Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., Zisserman, A.: A short note about kinetics-600 (2018)
Google Scholar
Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017)
Google Scholar
Chen, C.R., Fan, Q., Mallinar, N., Sercu, T., Feris, R.S.: Big-little net: an efficient multi-scale feature representation for visual and speech recognition. In: ICLR (2019)
Google Scholar
Crasto, N., Weinzaepfel, P., Alahari, K., Schmid, C.: MARS: motion-augmented RGB stream for action recognition. In: CVPR (2019)
Google Scholar
Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: practical automated data augmentation with a reduced search space. In: CVPR (2020)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Li, F.: Imagenet: a large-scale hierarchical image database. In: CVPR (2009)
Google Scholar
Denton, E.L., Chintala, S., Szlam, A., Fergus, R.: Deep generative image models using a Laplacian pyramid of adversarial networks. In: NeurIPS (2015)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: ICCV (2015)
Google Scholar
Fan, L., Huang, W., Gan, C., Ermon, S., Gong, B., Huang, J.: End-to-end learning of motion representation for video understanding. In: CVPR (2018)
Google Scholar
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: ICCV (2019)
Google Scholar
Feng, Y., Liang, Z., Liu, H.: Efficient deep learning for stereo matching with larger image patches. In: CISP-BMEI (2017)
Google Scholar
Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: ICCV (2017)
Google Scholar
Haoqi, F., et al.: Multiscale vision transformers. In: ICCV (2021)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Hu, H., Zhang, Z., Xie, Z., Lin, S.: Local relation networks for image recognition. In: ICCV (2019)
Google Scholar
Huang, G., Sun, Yu., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 646–661. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_39
Chapter Google Scholar
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: evolution of optical flow estimation with deep networks. In: CVPR (2017)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)
Google Scholar
Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Article Google Scholar
Jiang, B., Wang, M., Gan, W., Wu, W., Yan, J.: STM: spatiotemporal and motion encoding for action recognition. In: ICCV (2019)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM. 60, 84–90 (2017)
Article Google Scholar
Kwon, H., Kim, M., Kwak, S., Cho, M.: MotionSqueeze: neural motion feature learning for video understanding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12361, pp. 345–362. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58517-4_21
Chapter Google Scholar
Kwon, H., Kim, M., Kwak, S., Cho, M.: Learning self-similarity in space and time as generalized motion for video action recognition. In: ICCV (2021)
Google Scholar
Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L.: TEA: temporal excitation and aggregation for action recognition. In: CVPR (2020)
Google Scholar
Li, Y., Li, Y., Vasconcelos, N.: RESOUND: towards action recognition without representation bias. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 520–535. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_32
Chapter Google Scholar
Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: ICCV (2019)
Google Scholar
Lin, T., et al.: Feature pyramid networks for object detection. In: CVPR (2017)
Google Scholar
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Chapter Google Scholar
Liu, X., Lee, J., Jin, H.: Learning video representations from correspondence proposals. In: CVPR (2019)
Google Scholar
Liu, Z., et al.: Teinet: towards an efficient architecture for video recognition. In: AAAI (2020)
Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Article Google Scholar
Ma, N., Zhang, X., Zheng, H.-T., Sun, J.: ShuffleNet V2: practical guidelines for efficient CNN architecture design. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 122–138. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_8
Chapter Google Scholar
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: ICML (2010)
Google Scholar
Ng, J.Y., Choi, J., Neumann, J., Davis, L.S.: Actionflownet: learning motion representation for action recognition. In: WACV (2018)
Google Scholar
Parmar, N., Ramachandran, P., Vaswani, A., Bello, I., Levskaya, A., Shlens, J.: Stand-alone self-attention in vision models. In: NeurIPS (2019)
Google Scholar
Piergiovanni, A.J., Ryoo, M.S.: Representation flow for action recognition. In: CVPR (2019)
Google Scholar
Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3d residual networks. In: ICCV (2017)
Google Scholar
Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., Chen, L.: Mobilenetv 2: inverted residuals and linear bottlenecks. In: CVPR (2018)
Google Scholar
Shi, X., Chen, Z., Wang, H., Yeung, D., Wong, W., Woo, W.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: NeurIPS (2015)
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NeurIPS (2014)
Google Scholar
Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014)
MathSciNet MATH Google Scholar
Stroud, J.C., Ross, D.A., Sun, C., Deng, J., Sukthankar, R.: D3D: distilled 3d networks for video action recognition. In: WACV (2020)
Google Scholar
Sun, D., Yang, X., Liu, M., Kautz, J.: PWC-net: CNNs for optical flow using pyramid, warping, and cost volume. In: CVPR (2018)
Google Scholar
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: CVPR (2019)
Google Scholar
Sun, S., Kuang, Z., Sheng, L., Ouyang, W., Zhang, W.: Optical flow guided feature: a fast and robust motion representation for video action recognition. In: CVPR (2018)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR (2016)
Google Scholar
Taylor, G.W., Fergus, R., LeCun, Y., Bregler, C.: Convolutional learning of spatio-temporal features. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6316, pp. 140–153. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15567-3_11
Chapter Google Scholar
Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: ICCV (2015)
Google Scholar
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: CVPR (2018)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Wang, H., Tran, D., Torresani, L., Feiszli, M.: Video modeling with correlation networks. In: CVPR (2020)
Google Scholar
Wang, L., Tong, Z., Ji, B., Wu, G.: TDN: temporal difference networks for efficient action recognition. In: CVPR (2021)
Google Scholar
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Chapter Google Scholar
Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018)
Google Scholar
Weng, J., et al.: Temporal distinct representation learning for action recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 363–378. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_22
Chapter Google Scholar
Xiang, X., Zhai, M., Zhang, R., Lv, N., El-Saddik, A.: Optical flow estimation using spatial-channel combinational attention-based pyramid networks. In: ICIP (2019)
Google Scholar
Xu, J., Ranftl, R., Koltun, V.: Accurate optical flow via direct cost volume processing. In: CVPR (2017)
Google Scholar
Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? In: NeurIPS (2014)
Google Scholar
Yun, S., Han, D., Chun, S., Oh, S.J., Yoo, Y., Choe, J.: Cutmix: regularization strategy to train strong classifiers with localizable features. In: ICCV (2019)
Google Scholar
Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-L¹ optical flow. In: Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 214–223. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74936-3_22
Chapter Google Scholar
Zbontar, J., LeCun, Y.: Stereo matching by training a convolutional neural network to compare image patches. J. Mach. Learn. Res. 17, 2287–2318 (2016)
MATH Google Scholar
Zhang, C., Gupta, A., Zisserman, A.: Temporal query networks for fine-grained video understanding. In: CVPR (2021)
Google Scholar
Zhang, H., Cissé, M., Dauphin, Y.N., Lopez-Paz, D.: Mixup: beyond empirical risk minimization. In: ICLR (2018)
Google Scholar
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR (2017)
Google Scholar
Zhao, Y., Xiong, Y., Lin, D.: Recognize actions by disentangling components of dynamics. In: CVPR (2018)
Google Scholar
Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. In: AAAI (2020)
Google Scholar
Zhu, Y., Lan, Z., Newsam, S.D., Hauptmann, A.G.: Hidden two-stream convolutional networks for action recognition. In: ACCV (2018)
Google Scholar
Zolfaghari, M., Singh, K., Brox, T.: ECO: efficient convolutional network for online video understanding. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 713–730. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_43
Chapter Google Scholar

Download references

Acknowledgments

This research is co-sponsored by ITRI and Ministry of Science and Technology (MoST). This work is also financially supported by “Center for Open Intelligent Connectivity” of “Higher Education Sprout Project” of NYCU and MOE, Taiwan.

Author information

Authors and Affiliations

Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu, Taiwan, Republic of China
Huai-Wei Peng & Yu-Chee Tseng

Authors

Huai-Wei Peng
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Chee Tseng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huai-Wei Peng .

Editor information

Editors and Affiliations

IBM Research - MIT-IBM Watson AI Lab, Massachusetts, USA
Leonid Karlinsky
Technion – Israel Institute of Technology, Haifa, Israel
Tomer Michaeli
Kyoto University, Kyoto, Japan
Ko Nishino

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 10843 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Peng, HW., Tseng, YC. (2023). Multi-scale Motion-Aware Module for Video Action Recognition. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds) Computer Vision – ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science, vol 13806. Springer, Cham. https://doi.org/10.1007/978-3-031-25075-0_40

Download citation

DOI: https://doi.org/10.1007/978-3-031-25075-0_40
Published: 19 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25074-3
Online ISBN: 978-3-031-25075-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multi-scale Motion-Aware Module for Video Action Recognition