A Coarse-to-Fine Framework for Resource Efficient Video Recognition

Wu, Zuxuan; Li, Hengduo; Zheng, Yingbin; Xiong, Caiming; Jiang, Yu-Gang; Davis, Larry S

doi:10.1007/s11263-021-01508-1

A Coarse-to-Fine Framework for Resource Efficient Video Recognition

Published: 18 August 2021

Volume 129, pages 2965–2977, (2021)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Zuxuan Wu ORCID: orcid.org/0000-0002-8689-5807¹,
Hengduo Li²,
Yingbin Zheng³,
Caiming Xiong⁴,
Yu-Gang Jiang¹ &
…
Larry S Davis²

1032 Accesses
1 Altmetric
Explore all metrics

Abstract

Deep neural networks have demonstrated remarkable recognition results on video classification, however great improvements in accuracies come at the expense of large amounts of computational resources. In this paper, we introduce LiteEval for resource efficient video recognition. LiteEval is a coarse-to-fine framework that dynamically allocates computation on a per-video basis, and can be deployed in both online and offline settings. Operating by default on low-cost features that are computed with images at a coarse scale, LiteEval adaptively determines on-the-fly when to read in more discriminative yet computationally expensive features. This is achieved by the interactions of a coarse RNN and a fine RNN, together with a conditional gating module that automatically learns when to use more computation conditioned on incoming frames. We conduct extensive experiments on three large-scale video benchmarks, FCVID, ActivityNet and Kinetics, and demonstrate, among other things, that LiteEval offers impressive recognition performance while using significantly less computation for both online and offline settings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

AdaFocusV3: On Unified Spatial-Temporal Dynamic Video Recognition

Learning hierarchical video representation for action recognition

Article 15 February 2017

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

We absorb the weights of the classifier $\varvec{W}_{p}$ into $\Theta _{\texttt {fRNN}}$.
For SlowFast, a frame indicates a snippet with 8 frames.

References

Bejnordi, B. E., Blankevoort, T., & Welling, M. (2020). Batch-shaping for learning conditional channel gated networks. In ICLR.
Bolukbasi, T., Wang, J., Dekel, O., & Saligrama, V. (2017). Adaptive neural networks for fast test-time prediction. In ICML.
Chen, W., Wilson, J., Tyree, S., Weinberger, K., & Chen, Y. (2015). Compressing neural networks with the hashing trick. In ICML.
Chen, Y., Li, J., Xiao, H., Jin, X., Yan, S., & Feng, J. (2017). Dual path networks. In NIPS.
Chen, Y., Kalantidis, Y., Li, J., Yan, S., & Feng, J. (2018). Multi-fiber networks for video recognition. In ECCV.
Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder–decoder approaches. arXiv:1409.1259.
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In CVPR.
Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In CVPR.
Dong, X., & Yang, Y. (2019). Network pruning via transformable architecture search. In NeurIPS.
Fan, H., Xu, Z., Zhu, L., Yan, C., Ge, J., & Yang, Y. (2018). Watching a small portion could be as good as watching all: Towards efficient video classification. In IJCAI. https://doi.org/10.24963/ijcai.2018/98.
Feichtenhofer, C. (2020). X3d: Expanding architectures for efficient video recognition. In CVPR.
Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In CVPR.
Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In ICCV.
Gao, M., Yu, R., Li, A., Morariu, V. I., & Davis, L. S. (2018). Dynamic zoom-in network for fast object detection in large images. In CVPR.
Hazan, T., & Jaakkola, T. S. (2012). On the partition function and random maximum a-posteriori perturbations. In ICML.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In ICCV.
He, Y., Lin, J., Liu, Z., Wang, H., Li, L. J., & Han, S. (2018). Amc: Automl for model compression and acceleration on mobile devices. In ECCV.
Heilbron, F. C., Escorcia, V., Ghanem, B., & Niebles, J. C. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In CVPR.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation.
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. In CVPR.
Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In CVPR.
Huang, G., Chen, D., Li, T., Wu, F., van der Maaten, L., & Weinberger, K. Q. (2018a). Multi-scale dense convolutional networks for efficient prediction. In ICLR.
Huang, G., Chen, D., Li, T., Wu, F., van der Maaten, L., & Weinberger, K. Q. (2018b). Multi-scale dense networks for resource efficient image classification. In ICLR.
Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2016). Squeezenet: Alexnet-level accuracy with 50x fewer parameters and $<$ 0.5 mb model size. arXiv:1602.07360.
Jang, E., Gu, S., & Poole, B. (2017). Categorical reparameterization with gumbel-softmax. In ICLR.
Jiang, Y. G., Wu, Z., Wang, J., Xue, X., & Chang, S. F. (2018). Exploiting feature and class relationships in video categorization with regularized deep neural networks. In IEEE TPAMI.
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al. (2017). The kinetics human action video dataset. arXiv:1705.06950.
Köpüklü, O., Gunduz, A., Kose, N., & Rigoll, G. (2019). Real-time hand gesture detection and classification using convolutional neural networks. In FG.
Korbar, B., Tran, D., & Torresani, L. (2019). Scsampler: Sampling salient clips from video for efficient action recognition. In ICCV.
Kuehne, H., Arslan, A., & Serre, T. (2014). The language of actions: Recovering the syntax and semantics of goal-directed human activities. In CVPR.
Lei, T., Zhang, Y., Wang, S. I., Dai, H., & Artzi, Y. (2017). Simple recurrent units for highly parallelizable recurrence. arXiv:1709.02755.
Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. (2017). Pruning filters for efficient convnets. In ICLR.
Li, Z., Gavves, E., Jain, M., & Snoek, C. G. (2016). Videolstm convolves, attends and flows for action recognition. arXiv preprint arXiv:1607.01794.
Lin, J., Rao, Y., Lu, J., & Zhou, J. (2017). Runtime neural pruning. In NIPS.
Lin, J., Gan, C., & Han, S. (2019). Tsm: Temporal shift module for efficient video understanding. In ICCV.
Lin, T. Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., & Dollár, P. (2014). Microsoft coco: Common objects in context. In ECCV.
Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., & Yu, S. X. (2019). Large-scale long-tailed recognition in an open world. In CVPR.
Maddison, C. J., Mnih, A., & Teh, Y. W. (2017). The concrete distribution: A continuous relaxation of discrete random variables. In ICLR.
Molchanov, P., Gupta, S., Kim, K., & Pulli, K. (2015). Multi-sensor system for driver’s hand-gesture recognition. In FG.
Najibi, M., Singh, B., & Davis, L. S. (2019). Autofocus: Efficient multi-scale inference. In ICCV.
Ng, J. Y. H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In CVPR.
Qiu, Z., Yao, T., & Mei, T. (2016). Deep quantization: Encoding convolutional activations with deep generative model. arXiv:1611.09502.
Qiu, Z., Yao, T., & Mei, T. (2017). Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV.
Rastegari, M., Ordonez, V., Redmon, J., & Farhadi, A. (2016). Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV.
Ren, S., He, K., Girshick, R., & Sun, J. (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR.
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge: MIT Press.
MATH Google Scholar
Tran, D., Bourdev, L. D., Fergus, R., Torresani, L., & Paluri, M. (2015). C3d: Generic features for video analysis. In ICCV.
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In CVPR.
Tran, D., Wang, H., Torresani, L., & Feiszli, M. (2019). Video classification with channel-separated convolutional networks. In ICCV.
Uzkent, B., & Ermon, S. (2020). Learning when and where to zoom with deep reinforcement learning. In CVPR.
Veit, A., & Belongie, S. (2018) Convolutional networks with adaptive inference graphs. In ECCV.
Viola, P., & Jones, M. J. (2004) Robust real-time face detection. In IJCV.
Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In ICCV.
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In ECCV.
Wang, X., Girshick, R., Gupta, A., & He, K. (2018a). Non-local neural networks. In CVPR.
Wang, X., Yu, F., Dou, Z. Y., & Gonzalez, J. E. (2018b). Skipnet: Learning dynamic routing in convolutional networks. In ECCV.
Wu, C. Y., Zaheer, M., Hu, H., Manmatha, R., Smola, A. J., & Krähenbühl, P. (2018a). Compressed video action recognition. In CVPR.
Wu, W., He, D., Tan, X., Chen, S., & Wen, S. (2019a). Multi-agent reinforcement learning based frame sampling for effective untrimmed video recognition. In ICCV.
Wu, Z., Nagarajan, T., Kumar, A., Rennie, S., Davis, L. S., Grauman, K., & Feris, R. (2018b). Blockdrop: Dynamic inference paths in residual networks. In CVPR.
Wu, Z., Xiong, C., Jiang, Y. G., & Davis, L. S. (2019b). Liteeval: A coarse-to-fine framework for resource efficient video recognition. In NeurIPS.
Wu, Z., Xiong, C., Ma, C. Y., Socher, R., & Davis, L. S. (2019c). Adaframe: Adaptive frame selection for fast video recognition. In CVPR.
Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In CVPR.
Yang, L., Han, Y., Chen, X., Song, S., Dai, J., & Huang, G. (2020). Resolution adaptive networks for efficient inference. In CVPR.
Yao, T., Ngo, C. W., & Zhu, S. (2012). Predicting domain adaptivity: Redo or recycle? In ACM Multimedia.
Yeung, S., Russakovsky, O., Mori, G., & Fei-Fei, L. (2016). End-to-end learning of action detection from frame glimpses in videos. In CVPR.
Zhang, B., Wang, L., Wang, Z., Qiao, Y., & Wang, H. (2016). Real-time action recognition with enhanced motion vector CNNs. In CVPR.
Zhou, B., Andonian, A., Oliva, A., & Torralba, A. (2018). Temporal relational reasoning in videos. In ECCV.
Zhu, C., Tan, X., Zhou, F., Liu, X., Yue, K., Ding, E., & Ma, Y. (2018). Fine-grained video categorization with redundancy reduction attention. In ECCV.
Zolfaghari, M., Singh, K., & Brox, T. (2018). Eco: Efficient convolutional network for online video understanding. In ECCV.

Download references

Author information

Authors and Affiliations

Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, China
Zuxuan Wu & Yu-Gang Jiang
University of Maryland, College Park, MD, USA
Hengduo Li & Larry S Davis
Videt Lab, Shanghai, China
Yingbin Zheng
Salesforce Research, Palo Alto, CA, USA
Caiming Xiong

Authors

Zuxuan Wu
View author publications
You can also search for this author in PubMed Google Scholar
Hengduo Li
View author publications
You can also search for this author in PubMed Google Scholar
Yingbin Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Caiming Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Gang Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Larry S Davis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zuxuan Wu.

Additional information

Communicated by Dong Xu.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported in part by National Natural Science Foundation of China (#62032006)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, Z., Li, H., Zheng, Y. et al. A Coarse-to-Fine Framework for Resource Efficient Video Recognition. Int J Comput Vis 129, 2965–2977 (2021). https://doi.org/10.1007/s11263-021-01508-1

Download citation

Received: 09 December 2020
Accepted: 24 June 2021
Published: 18 August 2021
Issue Date: November 2021
DOI: https://doi.org/10.1007/s11263-021-01508-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Coarse-to-Fine Framework for Resource Efficient Video Recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

AdaFocusV3: On Unified Spatial-Temporal Dynamic Video Recognition

Learning hierarchical video representation for action recognition

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A Coarse-to-Fine Framework for Resource Efficient Video Recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

AdaFocusV3: On Unified Spatial-Temporal Dynamic Video Recognition

Learning hierarchical video representation for action recognition

Explore related subjects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation