Abstract
Deep neural networks have demonstrated remarkable recognition results on video classification, however great improvements in accuracies come at the expense of large amounts of computational resources. In this paper, we introduce LiteEval for resource efficient video recognition. LiteEval is a coarse-to-fine framework that dynamically allocates computation on a per-video basis, and can be deployed in both online and offline settings. Operating by default on low-cost features that are computed with images at a coarse scale, LiteEval adaptively determines on-the-fly when to read in more discriminative yet computationally expensive features. This is achieved by the interactions of a coarse RNN and a fine RNN, together with a conditional gating module that automatically learns when to use more computation conditioned on incoming frames. We conduct extensive experiments on three large-scale video benchmarks, FCVID, ActivityNet and Kinetics, and demonstrate, among other things, that LiteEval offers impressive recognition performance while using significantly less computation for both online and offline settings.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-021-01508-1/MediaObjects/11263_2021_1508_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-021-01508-1/MediaObjects/11263_2021_1508_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-021-01508-1/MediaObjects/11263_2021_1508_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-021-01508-1/MediaObjects/11263_2021_1508_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-021-01508-1/MediaObjects/11263_2021_1508_Fig5_HTML.png)
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
We absorb the weights of the classifier \(\varvec{W}_{p}\) into \(\Theta _{\texttt {fRNN}}\).
For SlowFast, a frame indicates a snippet with 8 frames.
References
Bejnordi, B. E., Blankevoort, T., & Welling, M. (2020). Batch-shaping for learning conditional channel gated networks. In ICLR.
Bolukbasi, T., Wang, J., Dekel, O., & Saligrama, V. (2017). Adaptive neural networks for fast test-time prediction. In ICML.
Chen, W., Wilson, J., Tyree, S., Weinberger, K., & Chen, Y. (2015). Compressing neural networks with the hashing trick. In ICML.
Chen, Y., Li, J., Xiao, H., Jin, X., Yan, S., & Feng, J. (2017). Dual path networks. In NIPS.
Chen, Y., Kalantidis, Y., Li, J., Yan, S., & Feng, J. (2018). Multi-fiber networks for video recognition. In ECCV.
Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder–decoder approaches. arXiv:1409.1259.
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In CVPR.
Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In CVPR.
Dong, X., & Yang, Y. (2019). Network pruning via transformable architecture search. In NeurIPS.
Fan, H., Xu, Z., Zhu, L., Yan, C., Ge, J., & Yang, Y. (2018). Watching a small portion could be as good as watching all: Towards efficient video classification. In IJCAI. https://doi.org/10.24963/ijcai.2018/98.
Feichtenhofer, C. (2020). X3d: Expanding architectures for efficient video recognition. In CVPR.
Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In CVPR.
Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In ICCV.
Gao, M., Yu, R., Li, A., Morariu, V. I., & Davis, L. S. (2018). Dynamic zoom-in network for fast object detection in large images. In CVPR.
Hazan, T., & Jaakkola, T. S. (2012). On the partition function and random maximum a-posteriori perturbations. In ICML.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In ICCV.
He, Y., Lin, J., Liu, Z., Wang, H., Li, L. J., & Han, S. (2018). Amc: Automl for model compression and acceleration on mobile devices. In ECCV.
Heilbron, F. C., Escorcia, V., Ghanem, B., & Niebles, J. C. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In CVPR.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation.
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. In CVPR.
Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In CVPR.
Huang, G., Chen, D., Li, T., Wu, F., van der Maaten, L., & Weinberger, K. Q. (2018a). Multi-scale dense convolutional networks for efficient prediction. In ICLR.
Huang, G., Chen, D., Li, T., Wu, F., van der Maaten, L., & Weinberger, K. Q. (2018b). Multi-scale dense networks for resource efficient image classification. In ICLR.
Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2016). Squeezenet: Alexnet-level accuracy with 50x fewer parameters and \(<\) 0.5 mb model size. arXiv:1602.07360.
Jang, E., Gu, S., & Poole, B. (2017). Categorical reparameterization with gumbel-softmax. In ICLR.
Jiang, Y. G., Wu, Z., Wang, J., Xue, X., & Chang, S. F. (2018). Exploiting feature and class relationships in video categorization with regularized deep neural networks. In IEEE TPAMI.
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al. (2017). The kinetics human action video dataset. arXiv:1705.06950.
Köpüklü, O., Gunduz, A., Kose, N., & Rigoll, G. (2019). Real-time hand gesture detection and classification using convolutional neural networks. In FG.
Korbar, B., Tran, D., & Torresani, L. (2019). Scsampler: Sampling salient clips from video for efficient action recognition. In ICCV.
Kuehne, H., Arslan, A., & Serre, T. (2014). The language of actions: Recovering the syntax and semantics of goal-directed human activities. In CVPR.
Lei, T., Zhang, Y., Wang, S. I., Dai, H., & Artzi, Y. (2017). Simple recurrent units for highly parallelizable recurrence. arXiv:1709.02755.
Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. (2017). Pruning filters for efficient convnets. In ICLR.
Li, Z., Gavves, E., Jain, M., & Snoek, C. G. (2016). Videolstm convolves, attends and flows for action recognition. arXiv preprint arXiv:1607.01794.
Lin, J., Rao, Y., Lu, J., & Zhou, J. (2017). Runtime neural pruning. In NIPS.
Lin, J., Gan, C., & Han, S. (2019). Tsm: Temporal shift module for efficient video understanding. In ICCV.
Lin, T. Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., & Dollár, P. (2014). Microsoft coco: Common objects in context. In ECCV.
Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., & Yu, S. X. (2019). Large-scale long-tailed recognition in an open world. In CVPR.
Maddison, C. J., Mnih, A., & Teh, Y. W. (2017). The concrete distribution: A continuous relaxation of discrete random variables. In ICLR.
Molchanov, P., Gupta, S., Kim, K., & Pulli, K. (2015). Multi-sensor system for driver’s hand-gesture recognition. In FG.
Najibi, M., Singh, B., & Davis, L. S. (2019). Autofocus: Efficient multi-scale inference. In ICCV.
Ng, J. Y. H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In CVPR.
Qiu, Z., Yao, T., & Mei, T. (2016). Deep quantization: Encoding convolutional activations with deep generative model. arXiv:1611.09502.
Qiu, Z., Yao, T., & Mei, T. (2017). Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV.
Rastegari, M., Ordonez, V., Redmon, J., & Farhadi, A. (2016). Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV.
Ren, S., He, K., Girshick, R., & Sun, J. (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR.
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge: MIT Press.
Tran, D., Bourdev, L. D., Fergus, R., Torresani, L., & Paluri, M. (2015). C3d: Generic features for video analysis. In ICCV.
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In CVPR.
Tran, D., Wang, H., Torresani, L., & Feiszli, M. (2019). Video classification with channel-separated convolutional networks. In ICCV.
Uzkent, B., & Ermon, S. (2020). Learning when and where to zoom with deep reinforcement learning. In CVPR.
Veit, A., & Belongie, S. (2018) Convolutional networks with adaptive inference graphs. In ECCV.
Viola, P., & Jones, M. J. (2004) Robust real-time face detection. In IJCV.
Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In ICCV.
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In ECCV.
Wang, X., Girshick, R., Gupta, A., & He, K. (2018a). Non-local neural networks. In CVPR.
Wang, X., Yu, F., Dou, Z. Y., & Gonzalez, J. E. (2018b). Skipnet: Learning dynamic routing in convolutional networks. In ECCV.
Wu, C. Y., Zaheer, M., Hu, H., Manmatha, R., Smola, A. J., & Krähenbühl, P. (2018a). Compressed video action recognition. In CVPR.
Wu, W., He, D., Tan, X., Chen, S., & Wen, S. (2019a). Multi-agent reinforcement learning based frame sampling for effective untrimmed video recognition. In ICCV.
Wu, Z., Nagarajan, T., Kumar, A., Rennie, S., Davis, L. S., Grauman, K., & Feris, R. (2018b). Blockdrop: Dynamic inference paths in residual networks. In CVPR.
Wu, Z., Xiong, C., Jiang, Y. G., & Davis, L. S. (2019b). Liteeval: A coarse-to-fine framework for resource efficient video recognition. In NeurIPS.
Wu, Z., Xiong, C., Ma, C. Y., Socher, R., & Davis, L. S. (2019c). Adaframe: Adaptive frame selection for fast video recognition. In CVPR.
Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In CVPR.
Yang, L., Han, Y., Chen, X., Song, S., Dai, J., & Huang, G. (2020). Resolution adaptive networks for efficient inference. In CVPR.
Yao, T., Ngo, C. W., & Zhu, S. (2012). Predicting domain adaptivity: Redo or recycle? In ACM Multimedia.
Yeung, S., Russakovsky, O., Mori, G., & Fei-Fei, L. (2016). End-to-end learning of action detection from frame glimpses in videos. In CVPR.
Zhang, B., Wang, L., Wang, Z., Qiao, Y., & Wang, H. (2016). Real-time action recognition with enhanced motion vector CNNs. In CVPR.
Zhou, B., Andonian, A., Oliva, A., & Torralba, A. (2018). Temporal relational reasoning in videos. In ECCV.
Zhu, C., Tan, X., Zhou, F., Liu, X., Yue, K., Ding, E., & Ma, Y. (2018). Fine-grained video categorization with redundancy reduction attention. In ECCV.
Zolfaghari, M., Singh, K., & Brox, T. (2018). Eco: Efficient convolutional network for online video understanding. In ECCV.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Dong Xu.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was supported in part by National Natural Science Foundation of China (#62032006)
Rights and permissions
About this article
Cite this article
Wu, Z., Li, H., Zheng, Y. et al. A Coarse-to-Fine Framework for Resource Efficient Video Recognition. Int J Comput Vis 129, 2965–2977 (2021). https://doi.org/10.1007/s11263-021-01508-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-021-01508-1