research-article

MV2Flow: Learning Motion Representation for Fast Compressed Video Action Recognition

Authors:

Houqiang LiAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 16, Issue 3s

Article No.: 102, Pages 1 - 19

https://doi.org/10.1145/3422360

Published: 31 December 2020 Publication History

Abstract

In video action recognition, motion is a very crucial clue, which is usually represented by optical flow. However, optical flow is computationally expensive to obtain, which becomes the bottleneck for the efficiency of traditional action recognition algorithms. In this article, we propose a network called MV2Flow to learn motion representation efficiently from the signals in the compressed domain. To learn the network, three losses are defined. First, we select the classical TV-L1 flow as proxy ground truth to guide the learning. Besides, an unsupervised image reconstruction loss is proposed to further refine it. Moreover, toward the task of action recognition, the above two losses are combined with a motion content loss. To evaluate our approach, extensive experiments on two benchmark datasets UCF-101 and HMDB-51 are conducted. The motion representation generated with our MV2Flow has shown comparable classification performance on action recognition with TV-L1 flow, while operating at an over 200× faster speed. Based on our MV2Flow and 2D-CNN-based network, we have achieved state-of-the-art performance in the compressed domain. With 3D-CNN-based network, we also achieve comparable accuracy with higher inference speed than methods in the decoded domain setting.

References

[1]

Aria Ahmadi and Ioannis Patras. 2016. Unsupervised convolutional neural networks for motion estimation. In Proceedings of the IEEE International Conference on Image Processing. 1629--1633.

[2]

Thomas Brox, Andrés Bruhn, Nils Papenberg, and Joachim Weickert. 2004. High accuracy optical flow estimation based on a theory for warping. In Proceedings of the European Conference on Computer Vision. 25--36.

[3]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.

[4]

Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, and Jiashi Feng. 2018. Multi-fiber networks for video recognition. In Proceedings of the European Conference on Computer Vision. 352--367.

[5]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 248--255.

[6]

Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. 2015. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 2758--2766.

Digital Library

[7]

Lijie Fan, Wenbing Huang, Chuang Gan, Stefano Ermon, Boqing Gong, and Junzhou Huang. 2018. End-to-end learning of motion representation for video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6016--6025.

[8]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision. 6202--6211.

[9]

Christoph Feichtenhofer, Axel Pinz, and Richard P. Wildes. 2017. Spatiotemporal multiplier networks for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4768--4777.

[10]

Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1933--1941.

[11]

David Gadot and Lior Wolf. 2016. PatchBatch: A batch augmented loss for optical flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4236--4245.

[12]

Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alex G. Hauptmann. 2015. Devnet: A deep event network for multimedia event detection and evidence recounting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2568--2577.

[13]

Ruohan Gao, Bo Xiong, and Kristen Grauman. 2018. Im2flow: Motion hallucination from static images for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5937--5947.

[14]

Fatma Güney and Andreas Geiger. 2016. Deep discrete flow. In Proceedings of the Asian Conference on Computer Vision. 207--224.

[15]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.

[16]

Samitha Herath, Mehrtash Harandi, and Fatih Porikli. 2017. Going deeper into action recognition: A survey. Image Vision Comput. 60 (2017), 4--21.

Digital Library

[17]

Berthold K. P. Horn and Brian G. Schunck. 1981. Determining optical flow. Artificial Intelligence 17, 1–3 (1981), 185--203.

Digital Library

[18]

Wenbing Huang, Lijie Fan, Mehrtash Harandi, Lin Ma, Huaping Liu, Wei Liu, and Chuang Gan. 2019. Toward efficient action recognition: Principal backpropagation for training two-stream networks. IEEE Trans. Image Process. 28, 4 (2019), 1773--1782.

Digital Library

[19]

Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. 2015. Spatial transformer networks. In Proceedings of the Advances in Neural Information Processing Systems. 2017--2025.

[20]

J. Yu Jason, Adam W. Harley, and Konstantinos G. Derpanis. 2016. Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness. In Proceedings of the European Conference on Computer Vision. 3--10.

[21]

H. Jhuang, H. Garrote, E. Poggio, T. Serre, and T. Hmdb. 2011. Hmdb: A large video database for human motion recognition. In Proceedings of IEEE International Conference on Computer Vision. 2556--2563.

[22]

Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1 (2013), 221--231.

Digital Library

[23]

Vadim Kantorov and Ivan Laptev. 2014. Efficient feature extraction, encoding and classification for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2593--2600.

Digital Library

[24]

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1725--1732.

Digital Library

[25]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1097--1105.

Digital Library

[26]

Wei-Sheng Lai, Jia-Bin Huang, and Ming-Hsuan Yang. 2017. Semi-supervised learning for optical flow with generative adversarial networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 354--364.

[27]

Didier Le Gall. 1991. MPEG: A video compression standard for multimedia applications. Commun. ACM 34, 4 (1991), 46--58.

Digital Library

[28]

Jianing Li, Shiliang Zhang, and Tiejun Huang. 2019. Multi-scale 3d convolution network for video-based person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence. 8618--8625.

[29]

Kun Liu, Wu Liu, Chuang Gan, Mingkui Tan, and Huadong Ma. 2018. T-C3D: Temporal convolutional 3d network for real-time action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence.

[30]

Xiang Long, Chuang Gan, Gerard De Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. 2018. Attention clusters: Purely attention-based local feature integration for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7834--7843.

[31]

Simon Meister, Junhwa Hur, and Stefan Roth. 2018. UnFlow: Unsupervised learning of optical flow with a bidirectional census loss. In Proceedings of the AAAI Conference on Artificial Intelligence. 7251--7259.

[32]

Etienne Mémin and Patrick Pérez. 1998. Dense estimation and object-based segmentation of the optical flow with robust techniques. IEEE Trans. Image Process. 7, 5 (1998), 703--719.

Digital Library

[33]

AJ Piergiovanni and Michael S. Ryoo. 2019. Representation flow for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9945--9953.

[34]

Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision. 5533--5541.

[35]

Anurag Ranjan and Michael J. Black. 2017. Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4161--4170.

[36]

Zhe Ren, Junchi Yan, Bingbing Ni, Bin Liu, Xiaokang Yang, and Hongyuan Zha. 2017. Unsupervised deep learning for optical flow estimation. In Proceedings of the AAAI Conference on Artificial Intelligence. 1495--1501.

[37]

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention. 234--241.

[38]

Laura Sevilla-Lara, Yiyi Liao, Fatma Güney, Varun Jampani, Andreas Geiger, and Michael J. Black. 2018. On the integration of optical flow and action recognition. In Proceedings of the German Conference on Pattern Recognition. 281--297.

[39]

Zheng Shou, Zhicheng Yan, Yannis Kalantidis, Laura Sevilla-Lara, Marcus Rohrbach, Xudong Lin, and Shih-Fu Chang. 2019. DMC-Net: Generating discriminative motion cues for fast compressed video action recognition. Retrieved from https://Arxiv:1901.03460.

[40]

Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 568--576.

Digital Library

[41]

Xiaolin Song, Cuiling Lan, Wenjun Zeng, Junliang Xing, Xiaoyan Sun, and Jingyu Yang. 2020. Temporal-spatial mapping for action recognition. IEEE Trans. Circ. Syst. Video Technol. 30, 3 (2020), 748--759.

Digital Library

[42]

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. CRCV-TR-12-01 (2012), 2, 5, 6, 7.

[43]

Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. 2018. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8934--8943.

[44]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489--4497.

Digital Library

[45]

Du Tran, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. 2017. Convnet architecture search for spatiotemporal feature learning. Retrieved from https://Arxiv:1708.05038.

[46]

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6450--6459.

[47]

Zhigang Tu, Wei Xie, Justin Dauwels, Baoxin Li, and Junsong Yuan. 2019. Semantic cues enhanced multimodality multistream CNN for action recognition. IEEE Trans. Circ. Syst. Video Technol. 29, 5 (2019), 1423--1437.

Digital Library

[48]

Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision. 3551--3558.

Digital Library

[49]

Limin Wang, Wei Li, Wen Li, and Luc Van Gool. 2018. Appearance-and-relation networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1430--1439.

[50]

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. 20--36.

[51]

Philippe Weinzaepfel, Jerome Revaud, Zaid Harchaoui, and Cordelia Schmid. 2013. DeepFlow: Large displacement optical flow with deep matching. In Proceedings of the IEEE International Conference on Computer Vision. 1385--1392.

Digital Library

[52]

Chao-Yuan Wu, Manzil Zaheer, Hexiang Hu, R Manmatha, Alexander J. Smola, and Philipp Krähenbühl. 2018. Compressed video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6026--6035.

[53]

Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision. 305--321.

Digital Library

[54]

Chuohao Yeo, Parvez Ahammad, Kannan Ramchandran, and S. Shankar Sastry. 2008. High-speed action recognition and localization in compressed domain videos. IEEE Trans. Circ. Syst. Video Technol. 18, 8 (2008), 1006--1015.

Digital Library

[55]

Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4694--4702.

[56]

Christopher Zach, Thomas Pock, and Horst Bischof. 2007. A duality-based approach for realtime TV-L1 optical flow. In Pattern Recognition. 214--223.

[57]

Bowen Zhang, Limin Wang, Zhe Wang, Yu Qiao, and Hanli Wang. 2016. Real-time action recognition with enhanced motion vector CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2718--2726.

[58]

Wangjiang Zhu, Jie Hu, Gang Sun, Xudong Cao, and Yu Qiao. 2016. A key volume mining deep framework for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1991--1999.

Cited By

Mou YJiang XXu KSun TWang Z(2024)Compressed Video Action Recognition With Dual-Stream and Dual-Modal TransformerIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.331914034:5(3299-3312)Online publication date: May-2024
https://doi.org/10.1109/TCSVT.2023.3319140
Ming YZhou JHu NFeng FZhao PLyu BYu H(2024)Action recognition in compressed domains: A surveyNeurocomputing10.1016/j.neucom.2024.127389577(127389)Online publication date: Apr-2024
https://doi.org/10.1016/j.neucom.2024.127389
Ming YZhou JJia XZheng QXiong LFeng FHu N(2024)F2D-SIFPNet: a frequency 2D Slow-I-Fast-P network for faster compressed video action recognitionApplied Intelligence10.1007/s10489-024-05408-y54:7(5197-5215)Online publication date: 16-Apr-2024
https://doi.org/10.1007/s10489-024-05408-y
Show More Cited By

Index Terms

MV2Flow: Learning Motion Representation for Fast Compressed Video Action Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Fine-grained Action Recognition with Robust Motion Representation Decoupling and Concentration
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Fine-grained action recognition is a challenging task that requires identifying discriminative and subtle motion variations among fine-grained action classes. Existing methods typically focus on spatio-temporal feature extraction and long-temporal ...
Action Recognition with a Bio–inspired Feedforward Motion Processing Model: The Richness of Center-Surround Interactions
Computer Vision – ECCV 2008
Abstract
Here we show that reproducing the functional properties of MT cells with various center–surround interactions enriches motion representation and improves the action recognition performance. To do so, we propose a simplified bio–inspired model of ...
Using Phase Instead of Optical Flow for Action Recognition
Computer Vision – ECCV 2018 Workshops
Abstract
Currently, the most common motion representation for action recognition is optical flow. Optical flow is based on particle tracking which adheres to a Lagrangian perspective on dynamics. In contrast to the Lagrangian perspective, the Eulerian ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 16, Issue 3s

Special Issue on Privacy and Security in Evolving Internet of Multimedia Things and Regular Papers

October 2020

190 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3444536

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 December 2020

Accepted: 01 July 2020

Revised: 01 March 2020

Received: 01 October 2019

Published in TOMM Volume 16, Issue 3s

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Youth Innovation Promotion Association CAS
National Natural Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
366
Total Downloads

Downloads (Last 12 months)43
Downloads (Last 6 weeks)6

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mou YJiang XXu KSun TWang Z(2024)Compressed Video Action Recognition With Dual-Stream and Dual-Modal TransformerIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.331914034:5(3299-3312)Online publication date: May-2024
https://doi.org/10.1109/TCSVT.2023.3319140
Ming YZhou JHu NFeng FZhao PLyu BYu H(2024)Action recognition in compressed domains: A surveyNeurocomputing10.1016/j.neucom.2024.127389577(127389)Online publication date: Apr-2024
https://doi.org/10.1016/j.neucom.2024.127389
Ming YZhou JJia XZheng QXiong LFeng FHu N(2024)F2D-SIFPNet: a frequency 2D Slow-I-Fast-P network for faster compressed video action recognitionApplied Intelligence10.1007/s10489-024-05408-y54:7(5197-5215)Online publication date: 16-Apr-2024
https://doi.org/10.1007/s10489-024-05408-y
Liang SMa WXie C(2023)Relation with Free Objects for Action RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/361759620:2(1-19)Online publication date: 26-Aug-2023
https://dl.acm.org/doi/10.1145/3617596
Abrams SNarayanan V(2023)Extending Action Recognition in the Compressed Domain2023 36th International Conference on VLSI Design and 2023 22nd International Conference on Embedded Systems (VLSID)10.1109/VLSID57277.2023.00058(246-251)Online publication date: Jan-2023
https://doi.org/10.1109/VLSID57277.2023.00058
Hu HPu JZhou WLi H(2023)Collaborative Multilingual Continuous Sign Language Recognition: A Unified FrameworkIEEE Transactions on Multimedia10.1109/TMM.2022.322326025(7559-7570)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3223260
Shahabinejad MKezele INabavi SLiu WPatel SYu YWang YTang J(2023)Video Action Recognition with Adaptive Zooming Using Motion Residuals2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW60793.2023.00131(1206-1215)Online publication date: 2-Oct-2023
https://doi.org/10.1109/ICCVW60793.2023.00131
Jiang ZZhang YHu S(2023)ESTI: an action recognition network with enhanced spatio-temporal informationInternational Journal of Machine Learning and Cybernetics10.1007/s13042-023-01820-x14:9(3059-3070)Online publication date: 22-Mar-2023
https://doi.org/10.1007/s13042-023-01820-x
Kumar AAbrams SKumar ANarayanan V(2023)STAR: Efficient SpatioTemporal Modeling for Action RecognitionCircuits, Systems, and Signal Processing10.1007/s00034-022-02160-x42:2(705-723)Online publication date: 1-Feb-2023
https://dl.acm.org/doi/10.1007/s00034-022-02160-x
Xiaoyan XPanyu CZhaozhe Z(2022)A Dynamic Gesture Recognition Method Based on Encoded VideoProceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition10.1145/3573942.3574084(711-716)Online publication date: 23-Sep-2022
https://dl.acm.org/doi/10.1145/3573942.3574084
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents