Optimizing Factorized Encoder Models: Time and Memory Reduction for Scalable and Efficient Action Recognition

Gowda, Shreyank N.; Arnab, Anurag; Huang, Jonathan

doi:10.1007/978-3-031-72684-2_26

Shreyank N. Gowda¹³,
Anurag Arnab¹⁴ &
Jonathan Huang¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15068))

Included in the following conference series:

European Conference on Computer Vision

300 Accesses

Abstract

In this paper, we address the challenges posed by the substantial training time and memory consumption associated with video transformers, focusing on the ViViT (Video Vision Transformer) model, in particular the Factorised Encoder version, as our baseline for action recognition tasks. The factorised encoder variant follows the late-fusion approach that is adopted by many state of the art approaches. Despite standing out for its favorable speed/accuracy tradeoffs among the different variants of ViViT, its considerable training time and memory requirements still pose a significant barrier to entry. Our method is designed to lower this barrier and is based on the idea of freezing the spatial transformer during training. This leads to a low accuracy model if naively done. But we show that by (1) appropriately initializing the temporal transformer (a module responsible for processing temporal information) (2) introducing a compact adapter model connecting frozen spatial representations (a module that selectively focuses on regions of the input image) to the temporal transformer, we can enjoy the benefits of freezing the spatial transformer without sacrificing accuracy. Through extensive experimentation over 6 benchmarks, we demonstrate that our proposed training strategy significantly reduces training costs (by) and memory consumption while maintaining or slightly improving performance by up to 1.79% compared to the baseline model. Our approach additionally unlocks the capability to utilize larger image transformer models as our spatial transformer and access more frames with the same memory consumption. We also show the generalization of this approach to other factorized encoder models. The advancements made in this work have the potential to advance research in the video understanding domain and provide valuable insights for researchers and practitioners with limited resources, paving the way for more efficient and scalable alternatives in the action recognition field.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Memory-Augmented Dense Predictive Coding for Video Representation Learning

STAR: Efficient SpatioTemporal Modeling for Action Recognition

Article 14 September 2022

ST-HViT: spatial-temporal hierarchical vision transformer for action recognition

Article 13 January 2025

References

Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: a video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836–6846 (2021)
Google Scholar
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML, vol. 2, p. 4 (2021)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020, Part I. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., Zisserman, A.: A short note about kinetics-600. arXiv preprint arXiv:1808.01340 (2018)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Google Scholar
Chen, J., Ho, C.M.: Mm-vit: multi-modal video transformer for compressed video action recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1910–1921 (2022)
Google Scholar
Cheng, F., et al.: Stochastic backpropagation: a memory efficient strategy for training video models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8301–8310 (2022)
Google Scholar
Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 720–736 (2018)
Google Scholar
Dehghani, M., Arnab, A., Beyer, L., Vaswani, A., Tay, Y.: The efficiency misnomer. arXiv preprint arXiv:2110.12894 (2021)
Dosovitskiy, A., Beyer, L., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Fan, H., et al.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824–6835 (2021)
Google Scholar
Foteinopoulou, N.M., Patras, I.: Emoclip: a vision-language method for zero-shot video facial expression recognition. arXiv preprint arXiv:2310.16640 (2023)
Gowda, S.N.: Human activity recognition using combinatorial deep belief networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–6 (2017)
Google Scholar
Gowda, S.N., Gao, B., Clifton, D.: Fe-adapter: adapting image-based emotion classifiers to videos (2024)
Google Scholar
Gowda, S.N., Hao, X., Li, G., Sevilla-Lara, L., Gowda, S.N.: Watt for what: rethinking deep learning’s energy-performance relationship. arXiv preprint arXiv:2310.06522 (2023)
Gowda, S.N., Rohrbach, M., Keller, F., Sevilla-Lara, L.: Learn2Augment: learning to composite videos for data augmentation in action recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, ECCV 2022, LNCS, Part XXXI, vol. 13691, pp. 242–259. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19821-2_14
Gowda, S.N., Rohrbach, M., Sevilla-Lara, L.: Smart frame selection for action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1451–1459 (2021)
Google Scholar
Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5842–5850 (2017)
Google Scholar
Gritsenko, A., et al.: End-to-end spatio-temporal action localisation with video transformers. arXiv preprint arXiv:2304.12160 (2023)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2012)
Article Google Scholar
Kim, K., Gowda, S.N., Mac Aodha, O., Sevilla-Lara, L.: Capturing temporal information in a single frame: channel sampling strategies for action recognition. arXiv preprint arXiv:2201.10394 (2022)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: Hmdb: a large video database for human motion recognition. In: 2011 International Conference on Computer Vision, pp. 2556–2563. IEEE (2011)
Google Scholar
Laptev, I.: On space-time interest points. Int. J. Comput. Vision 64, 107–123 (2005)
Article Google Scholar
Li, K., Wang, Y., Gao, P., Song, G., Liu, Y., Li, H., Qiao, Y.: Uniformer: unified transformer for efficient spatiotemporal representation learning. arXiv preprint arXiv:2201.04676 (2022)
Liang, Y., Zhou, P., Zimmermann, R., Yan, S.: DualFormer: local-global stratified transformer for efficient video recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, ECCV 2022, LNCS, vol. 13694, pp 577–595. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19830-4_33
Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7083–7093 (2019)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Liu, Z., et al.: Video swin transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3202–3211 (2022)
Google Scholar
Materzynska, J., Xiao, T., Herzig, R., Xu, H., Wang, X., Darrell, T.: Something-else: Compositional action recognition with spatial-temporal interaction networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1049–1059 (2020)
Google Scholar
Monfort, M., et al.: Moments in time dataset: one million videos for event understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42(2), 502–508 (2019)
Article Google Scholar
Pan, J., Lin, Z., Zhu, X., Shao, J., Li, H.: St-adapter: parameter-efficient image-to-video transfer learning. Adv. Neural. Inf. Process. Syst. 35, 26462–26477 (2022)
Google Scholar
Piergiovanni, A., Kuo, W., Angelova, A.: Rethinking video vits: sparse video tubes for joint image and video learning. arXiv preprint arXiv:2212.03229 (2022)
Ridnik, T., Ben-Baruch, E., Noy, A., Zelnik-Manor, L.: Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972 (2021)
Ryoo, M., Piergiovanni, A., Arnab, A., Dehghani, M., Angelova, A.: Tokenlearner: Adaptive space-time tokenization for videos. Adv. Neural. Inf. Process. Syst. 34, 12786–12797 (2021)
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Tong, Z., Song, Y., Wang, J., Wang, L.: Videomae: masked autoencoders are data-efficient learners for self-supervised video pre-training. arXiv preprint arXiv:2203.12602 (2022)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Google Scholar
Vaswani, A., et al.: Attention is all you need. IN: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vision 103, 60–79 (2013)
Article MathSciNet Google Scholar
Wang, J., et al.: Git: a generative image-to-text transformer for vision and language. Transactions of Machine Learning Research (2022)
Google Scholar
Wang, J., Yang, X., Li, H., Liu, L., Wu, Z., Jiang, Y.G.: Efficient video transformers with spatial-temporal token selection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, ECCV 2022, Part XXXV, LNCS, vol. 13695. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_5
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision - ECCV 2016, ECCV 2016, LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Wasim, S.T., Khattak, M.U., Naseer, M., Khan, S., Shah, M., Khan, F.S.: Video-focalnets: Spatio-temporal focal modulation for video action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13778–13789 (2023)
Google Scholar
Xiong, X., Arnab, A., Nagrani, A., Schmid, C.: M &m mix: a multimodal multiview transformer ensemble. arXiv preprint arXiv:2206.09852 (2022)
Yan, S., et al.: Multiview transformers for video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3333–3343 (2022)
Google Scholar
Yang, A., et al.: Vid2seq: large-scale pretraining of a visual language model for dense video captioning. In: CVPR 2023-IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
Google Scholar
Zhang, H., Hao, Y., Ngo, C.W.: Token shift transformer for video classification. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 917–925 (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Oxford, Oxford, England
Shreyank N. Gowda
Google Research, Mountain View, USA
Anurag Arnab & Jonathan Huang

Authors

Shreyank N. Gowda
View author publications
You can also search for this author in PubMed Google Scholar
Anurag Arnab
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shreyank N. Gowda .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 288 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gowda, S.N., Arnab, A., Huang, J. (2025). Optimizing Factorized Encoder Models: Time and Memory Reduction for Scalable and Efficient Action Recognition. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15068. Springer, Cham. https://doi.org/10.1007/978-3-031-72684-2_26

Download citation

DOI: https://doi.org/10.1007/978-3-031-72684-2_26
Published: 03 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72683-5
Online ISBN: 978-3-031-72684-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Optimizing Factorized Encoder Models: Time and Memory Reduction for Scalable and Efficient Action Recognition