skip to main content
research-article

MV2Flow: Learning Motion Representation for Fast Compressed Video Action Recognition

Published: 31 December 2020 Publication History

Abstract

In video action recognition, motion is a very crucial clue, which is usually represented by optical flow. However, optical flow is computationally expensive to obtain, which becomes the bottleneck for the efficiency of traditional action recognition algorithms. In this article, we propose a network called MV2Flow to learn motion representation efficiently from the signals in the compressed domain. To learn the network, three losses are defined. First, we select the classical TV-L1 flow as proxy ground truth to guide the learning. Besides, an unsupervised image reconstruction loss is proposed to further refine it. Moreover, toward the task of action recognition, the above two losses are combined with a motion content loss. To evaluate our approach, extensive experiments on two benchmark datasets UCF-101 and HMDB-51 are conducted. The motion representation generated with our MV2Flow has shown comparable classification performance on action recognition with TV-L1 flow, while operating at an over 200× faster speed. Based on our MV2Flow and 2D-CNN-based network, we have achieved state-of-the-art performance in the compressed domain. With 3D-CNN-based network, we also achieve comparable accuracy with higher inference speed than methods in the decoded domain setting.

References

[1]
Aria Ahmadi and Ioannis Patras. 2016. Unsupervised convolutional neural networks for motion estimation. In Proceedings of the IEEE International Conference on Image Processing. 1629--1633.
[2]
Thomas Brox, Andrés Bruhn, Nils Papenberg, and Joachim Weickert. 2004. High accuracy optical flow estimation based on a theory for warping. In Proceedings of the European Conference on Computer Vision. 25--36.
[3]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.
[4]
Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, and Jiashi Feng. 2018. Multi-fiber networks for video recognition. In Proceedings of the European Conference on Computer Vision. 352--367.
[5]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 248--255.
[6]
Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. 2015. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 2758--2766.
[7]
Lijie Fan, Wenbing Huang, Chuang Gan, Stefano Ermon, Boqing Gong, and Junzhou Huang. 2018. End-to-end learning of motion representation for video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6016--6025.
[8]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision. 6202--6211.
[9]
Christoph Feichtenhofer, Axel Pinz, and Richard P. Wildes. 2017. Spatiotemporal multiplier networks for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4768--4777.
[10]
Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1933--1941.
[11]
David Gadot and Lior Wolf. 2016. PatchBatch: A batch augmented loss for optical flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4236--4245.
[12]
Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alex G. Hauptmann. 2015. Devnet: A deep event network for multimedia event detection and evidence recounting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2568--2577.
[13]
Ruohan Gao, Bo Xiong, and Kristen Grauman. 2018. Im2flow: Motion hallucination from static images for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5937--5947.
[14]
Fatma Güney and Andreas Geiger. 2016. Deep discrete flow. In Proceedings of the Asian Conference on Computer Vision. 207--224.
[15]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[16]
Samitha Herath, Mehrtash Harandi, and Fatih Porikli. 2017. Going deeper into action recognition: A survey. Image Vision Comput. 60 (2017), 4--21.
[17]
Berthold K. P. Horn and Brian G. Schunck. 1981. Determining optical flow. Artificial Intelligence 17, 1–3 (1981), 185--203.
[18]
Wenbing Huang, Lijie Fan, Mehrtash Harandi, Lin Ma, Huaping Liu, Wei Liu, and Chuang Gan. 2019. Toward efficient action recognition: Principal backpropagation for training two-stream networks. IEEE Trans. Image Process. 28, 4 (2019), 1773--1782.
[19]
Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. 2015. Spatial transformer networks. In Proceedings of the Advances in Neural Information Processing Systems. 2017--2025.
[20]
J. Yu Jason, Adam W. Harley, and Konstantinos G. Derpanis. 2016. Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness. In Proceedings of the European Conference on Computer Vision. 3--10.
[21]
H. Jhuang, H. Garrote, E. Poggio, T. Serre, and T. Hmdb. 2011. Hmdb: A large video database for human motion recognition. In Proceedings of IEEE International Conference on Computer Vision. 2556--2563.
[22]
Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1 (2013), 221--231.
[23]
Vadim Kantorov and Ivan Laptev. 2014. Efficient feature extraction, encoding and classification for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2593--2600.
[24]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1725--1732.
[25]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1097--1105.
[26]
Wei-Sheng Lai, Jia-Bin Huang, and Ming-Hsuan Yang. 2017. Semi-supervised learning for optical flow with generative adversarial networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 354--364.
[27]
Didier Le Gall. 1991. MPEG: A video compression standard for multimedia applications. Commun. ACM 34, 4 (1991), 46--58.
[28]
Jianing Li, Shiliang Zhang, and Tiejun Huang. 2019. Multi-scale 3d convolution network for video-based person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence. 8618--8625.
[29]
Kun Liu, Wu Liu, Chuang Gan, Mingkui Tan, and Huadong Ma. 2018. T-C3D: Temporal convolutional 3d network for real-time action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence.
[30]
Xiang Long, Chuang Gan, Gerard De Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. 2018. Attention clusters: Purely attention-based local feature integration for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7834--7843.
[31]
Simon Meister, Junhwa Hur, and Stefan Roth. 2018. UnFlow: Unsupervised learning of optical flow with a bidirectional census loss. In Proceedings of the AAAI Conference on Artificial Intelligence. 7251--7259.
[32]
Etienne Mémin and Patrick Pérez. 1998. Dense estimation and object-based segmentation of the optical flow with robust techniques. IEEE Trans. Image Process. 7, 5 (1998), 703--719.
[33]
AJ Piergiovanni and Michael S. Ryoo. 2019. Representation flow for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9945--9953.
[34]
Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision. 5533--5541.
[35]
Anurag Ranjan and Michael J. Black. 2017. Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4161--4170.
[36]
Zhe Ren, Junchi Yan, Bingbing Ni, Bin Liu, Xiaokang Yang, and Hongyuan Zha. 2017. Unsupervised deep learning for optical flow estimation. In Proceedings of the AAAI Conference on Artificial Intelligence. 1495--1501.
[37]
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention. 234--241.
[38]
Laura Sevilla-Lara, Yiyi Liao, Fatma Güney, Varun Jampani, Andreas Geiger, and Michael J. Black. 2018. On the integration of optical flow and action recognition. In Proceedings of the German Conference on Pattern Recognition. 281--297.
[39]
Zheng Shou, Zhicheng Yan, Yannis Kalantidis, Laura Sevilla-Lara, Marcus Rohrbach, Xudong Lin, and Shih-Fu Chang. 2019. DMC-Net: Generating discriminative motion cues for fast compressed video action recognition. Retrieved from https://Arxiv:1901.03460.
[40]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 568--576.
[41]
Xiaolin Song, Cuiling Lan, Wenjun Zeng, Junliang Xing, Xiaoyan Sun, and Jingyu Yang. 2020. Temporal-spatial mapping for action recognition. IEEE Trans. Circ. Syst. Video Technol. 30, 3 (2020), 748--759.
[42]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. CRCV-TR-12-01 (2012), 2, 5, 6, 7.
[43]
Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. 2018. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8934--8943.
[44]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489--4497.
[45]
Du Tran, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. 2017. Convnet architecture search for spatiotemporal feature learning. Retrieved from https://Arxiv:1708.05038.
[46]
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6450--6459.
[47]
Zhigang Tu, Wei Xie, Justin Dauwels, Baoxin Li, and Junsong Yuan. 2019. Semantic cues enhanced multimodality multistream CNN for action recognition. IEEE Trans. Circ. Syst. Video Technol. 29, 5 (2019), 1423--1437.
[48]
Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision. 3551--3558.
[49]
Limin Wang, Wei Li, Wen Li, and Luc Van Gool. 2018. Appearance-and-relation networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1430--1439.
[50]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. 20--36.
[51]
Philippe Weinzaepfel, Jerome Revaud, Zaid Harchaoui, and Cordelia Schmid. 2013. DeepFlow: Large displacement optical flow with deep matching. In Proceedings of the IEEE International Conference on Computer Vision. 1385--1392.
[52]
Chao-Yuan Wu, Manzil Zaheer, Hexiang Hu, R Manmatha, Alexander J. Smola, and Philipp Krähenbühl. 2018. Compressed video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6026--6035.
[53]
Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision. 305--321.
[54]
Chuohao Yeo, Parvez Ahammad, Kannan Ramchandran, and S. Shankar Sastry. 2008. High-speed action recognition and localization in compressed domain videos. IEEE Trans. Circ. Syst. Video Technol. 18, 8 (2008), 1006--1015.
[55]
Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4694--4702.
[56]
Christopher Zach, Thomas Pock, and Horst Bischof. 2007. A duality-based approach for realtime TV-L1 optical flow. In Pattern Recognition. 214--223.
[57]
Bowen Zhang, Limin Wang, Zhe Wang, Yu Qiao, and Hanli Wang. 2016. Real-time action recognition with enhanced motion vector CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2718--2726.
[58]
Wangjiang Zhu, Jie Hu, Gang Sun, Xudong Cao, and Yu Qiao. 2016. A key volume mining deep framework for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1991--1999.

Cited By

View all
  • (2024)Compressed Video Action Recognition With Dual-Stream and Dual-Modal TransformerIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.331914034:5(3299-3312)Online publication date: May-2024
  • (2024)Action recognition in compressed domains: A surveyNeurocomputing10.1016/j.neucom.2024.127389577(127389)Online publication date: Apr-2024
  • (2024)F2D-SIFPNet: a frequency 2D Slow-I-Fast-P network for faster compressed video action recognitionApplied Intelligence10.1007/s10489-024-05408-y54:7(5197-5215)Online publication date: 16-Apr-2024
  • Show More Cited By

Index Terms

  1. MV2Flow: Learning Motion Representation for Fast Compressed Video Action Recognition

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 3s
    Special Issue on Privacy and Security in Evolving Internet of Multimedia Things and Regular Papers
    October 2020
    190 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3444536
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 31 December 2020
    Accepted: 01 July 2020
    Revised: 01 March 2020
    Received: 01 October 2019
    Published in TOMM Volume 16, Issue 3s

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. MV2Flow
    2. action recognition
    3. compressed domain
    4. motion representation

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)43
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 17 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Compressed Video Action Recognition With Dual-Stream and Dual-Modal TransformerIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.331914034:5(3299-3312)Online publication date: May-2024
    • (2024)Action recognition in compressed domains: A surveyNeurocomputing10.1016/j.neucom.2024.127389577(127389)Online publication date: Apr-2024
    • (2024)F2D-SIFPNet: a frequency 2D Slow-I-Fast-P network for faster compressed video action recognitionApplied Intelligence10.1007/s10489-024-05408-y54:7(5197-5215)Online publication date: 16-Apr-2024
    • (2023)Relation with Free Objects for Action RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/361759620:2(1-19)Online publication date: 26-Aug-2023
    • (2023)Extending Action Recognition in the Compressed Domain2023 36th International Conference on VLSI Design and 2023 22nd International Conference on Embedded Systems (VLSID)10.1109/VLSID57277.2023.00058(246-251)Online publication date: Jan-2023
    • (2023)Collaborative Multilingual Continuous Sign Language Recognition: A Unified FrameworkIEEE Transactions on Multimedia10.1109/TMM.2022.322326025(7559-7570)Online publication date: 1-Jan-2023
    • (2023)Video Action Recognition with Adaptive Zooming Using Motion Residuals2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW60793.2023.00131(1206-1215)Online publication date: 2-Oct-2023
    • (2023)ESTI: an action recognition network with enhanced spatio-temporal informationInternational Journal of Machine Learning and Cybernetics10.1007/s13042-023-01820-x14:9(3059-3070)Online publication date: 22-Mar-2023
    • (2023)STAR: Efficient SpatioTemporal Modeling for Action RecognitionCircuits, Systems, and Signal Processing10.1007/s00034-022-02160-x42:2(705-723)Online publication date: 1-Feb-2023
    • (2022)A Dynamic Gesture Recognition Method Based on Encoded VideoProceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition10.1145/3573942.3574084(711-716)Online publication date: 23-Sep-2022
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media