Abstract
In this paper, we address the challenges posed by the substantial training time and memory consumption associated with video transformers, focusing on the ViViT (Video Vision Transformer) model, in particular the Factorised Encoder version, as our baseline for action recognition tasks. The factorised encoder variant follows the late-fusion approach that is adopted by many state of the art approaches. Despite standing out for its favorable speed/accuracy tradeoffs among the different variants of ViViT, its considerable training time and memory requirements still pose a significant barrier to entry. Our method is designed to lower this barrier and is based on the idea of freezing the spatial transformer during training. This leads to a low accuracy model if naively done. But we show that by (1) appropriately initializing the temporal transformer (a module responsible for processing temporal information) (2) introducing a compact adapter model connecting frozen spatial representations (a module that selectively focuses on regions of the input image) to the temporal transformer, we can enjoy the benefits of freezing the spatial transformer without sacrificing accuracy. Through extensive experimentation over 6 benchmarks, we demonstrate that our proposed training strategy significantly reduces training costs (by) and memory consumption while maintaining or slightly improving performance by up to 1.79% compared to the baseline model. Our approach additionally unlocks the capability to utilize larger image transformer models as our spatial transformer and access more frames with the same memory consumption. We also show the generalization of this approach to other factorized encoder models. The advancements made in this work have the potential to advance research in the video understanding domain and provide valuable insights for researchers and practitioners with limited resources, paving the way for more efficient and scalable alternatives in the action recognition field.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: a video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836–6846 (2021)
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML, vol. 2, p. 4 (2021)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020, Part I. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., Zisserman, A.: A short note about kinetics-600. arXiv preprint arXiv:1808.01340 (2018)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Chen, J., Ho, C.M.: Mm-vit: multi-modal video transformer for compressed video action recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1910–1921 (2022)
Cheng, F., et al.: Stochastic backpropagation: a memory efficient strategy for training video models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8301–8310 (2022)
Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 720–736 (2018)
Dehghani, M., Arnab, A., Beyer, L., Vaswani, A., Tay, Y.: The efficiency misnomer. arXiv preprint arXiv:2110.12894 (2021)
Dosovitskiy, A., Beyer, L., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Fan, H., et al.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824–6835 (2021)
Foteinopoulou, N.M., Patras, I.: Emoclip: a vision-language method for zero-shot video facial expression recognition. arXiv preprint arXiv:2310.16640 (2023)
Gowda, S.N.: Human activity recognition using combinatorial deep belief networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–6 (2017)
Gowda, S.N., Gao, B., Clifton, D.: Fe-adapter: adapting image-based emotion classifiers to videos (2024)
Gowda, S.N., Hao, X., Li, G., Sevilla-Lara, L., Gowda, S.N.: Watt for what: rethinking deep learning’s energy-performance relationship. arXiv preprint arXiv:2310.06522 (2023)
Gowda, S.N., Rohrbach, M., Keller, F., Sevilla-Lara, L.: Learn2Augment: learning to composite videos for data augmentation in action recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, ECCV 2022, LNCS, Part XXXI, vol. 13691, pp. 242–259. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19821-2_14
Gowda, S.N., Rohrbach, M., Sevilla-Lara, L.: Smart frame selection for action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1451–1459 (2021)
Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5842–5850 (2017)
Gritsenko, A., et al.: End-to-end spatio-temporal action localisation with video transformers. arXiv preprint arXiv:2304.12160 (2023)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2012)
Kim, K., Gowda, S.N., Mac Aodha, O., Sevilla-Lara, L.: Capturing temporal information in a single frame: channel sampling strategies for action recognition. arXiv preprint arXiv:2201.10394 (2022)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: Hmdb: a large video database for human motion recognition. In: 2011 International Conference on Computer Vision, pp. 2556–2563. IEEE (2011)
Laptev, I.: On space-time interest points. Int. J. Comput. Vision 64, 107–123 (2005)
Li, K., Wang, Y., Gao, P., Song, G., Liu, Y., Li, H., Qiao, Y.: Uniformer: unified transformer for efficient spatiotemporal representation learning. arXiv preprint arXiv:2201.04676 (2022)
Liang, Y., Zhou, P., Zimmermann, R., Yan, S.: DualFormer: local-global stratified transformer for efficient video recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, ECCV 2022, LNCS, vol. 13694, pp 577–595. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19830-4_33
Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7083–7093 (2019)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Liu, Z., et al.: Video swin transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3202–3211 (2022)
Materzynska, J., Xiao, T., Herzig, R., Xu, H., Wang, X., Darrell, T.: Something-else: Compositional action recognition with spatial-temporal interaction networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1049–1059 (2020)
Monfort, M., et al.: Moments in time dataset: one million videos for event understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42(2), 502–508 (2019)
Pan, J., Lin, Z., Zhu, X., Shao, J., Li, H.: St-adapter: parameter-efficient image-to-video transfer learning. Adv. Neural. Inf. Process. Syst. 35, 26462–26477 (2022)
Piergiovanni, A., Kuo, W., Angelova, A.: Rethinking video vits: sparse video tubes for joint image and video learning. arXiv preprint arXiv:2212.03229 (2022)
Ridnik, T., Ben-Baruch, E., Noy, A., Zelnik-Manor, L.: Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972 (2021)
Ryoo, M., Piergiovanni, A., Arnab, A., Dehghani, M., Angelova, A.: Tokenlearner: Adaptive space-time tokenization for videos. Adv. Neural. Inf. Process. Syst. 34, 12786–12797 (2021)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Tong, Z., Song, Y., Wang, J., Wang, L.: Videomae: masked autoencoders are data-efficient learners for self-supervised video pre-training. arXiv preprint arXiv:2203.12602 (2022)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Vaswani, A., et al.: Attention is all you need. IN: Advances in Neural Information Processing Systems, vol. 30 (2017)
Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vision 103, 60–79 (2013)
Wang, J., et al.: Git: a generative image-to-text transformer for vision and language. Transactions of Machine Learning Research (2022)
Wang, J., Yang, X., Li, H., Liu, L., Wu, Z., Jiang, Y.G.: Efficient video transformers with spatial-temporal token selection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, ECCV 2022, Part XXXV, LNCS, vol. 13695. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_5
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision - ECCV 2016, ECCV 2016, LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Wasim, S.T., Khattak, M.U., Naseer, M., Khan, S., Shah, M., Khan, F.S.: Video-focalnets: Spatio-temporal focal modulation for video action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13778–13789 (2023)
Xiong, X., Arnab, A., Nagrani, A., Schmid, C.: M &m mix: a multimodal multiview transformer ensemble. arXiv preprint arXiv:2206.09852 (2022)
Yan, S., et al.: Multiview transformers for video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3333–3343 (2022)
Yang, A., et al.: Vid2seq: large-scale pretraining of a visual language model for dense video captioning. In: CVPR 2023-IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
Zhang, H., Hao, Y., Ngo, C.W.: Token shift transformer for video classification. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 917–925 (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Gowda, S.N., Arnab, A., Huang, J. (2025). Optimizing Factorized Encoder Models: Time and Memory Reduction for Scalable and Efficient Action Recognition. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15068. Springer, Cham. https://doi.org/10.1007/978-3-031-72684-2_26
Download citation
DOI: https://doi.org/10.1007/978-3-031-72684-2_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72683-5
Online ISBN: 978-3-031-72684-2
eBook Packages: Computer ScienceComputer Science (R0)