Skip to main content

Telling Stories for Common Sense Zero-Shot Action Recognition

  • Conference paper
  • First Online:
Computer Vision – ACCV 2024 (ACCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15474))

Included in the following conference series:

  • 109 Accesses

Abstract

Video understanding has long suffered from reliance on large labeled datasets, motivating research into zero-shot learning. Recent progress in language modeling presents opportunities to advance zero-shot video analysis, but constructing an effective semantic space relating action classes remains challenging. We address this by introducing a novel dataset, Stories, which contains rich textual descriptions for diverse action classes extracted from WikiHow articles. For each class, we extract multi-sentence narratives detailing the necessary steps, scenes, objects, and verbs that characterize the action. This contextual data enables modeling of nuanced relationships between actions, paving the way for zero-shot transfer. We also propose an approach that harnesses Stories to improve feature generation for training zero-shot classification. Without any target dataset fine-tuning, our method achieves new state-of-the-art on multiple benchmarks, improving top-1 accuracy by up to 6.1%. We believe Stories provides a valuable resource that can catalyze progress in zero-shot action recognition. The textual narratives forge connections between seen and unseen classes, overcoming the bottleneck of labeled data that has long impeded advancements in this exciting domain. The data can be found here: https://github.com/kini5gowda/Stories.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.wikihow.com/.

  2. 2.

    https://www.prolific.co/.

  3. 3.

    In the case of ER, this is done for some of the classes.

References

  1. Akata, Z., Reed, S., Walter, D., Lee, H., Schiele, B.: Evaluation of output embeddings for fine-grained image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2927–2936 (2015)

    Google Scholar 

  2. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: A video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6836–6846 (2021)

    Google Scholar 

  3. Belghazi, M.I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A., Hjelm, D.: Mutual information neural estimation. In: International conference on machine learning. pp. 531–540. PMLR (2018)

    Google Scholar 

  4. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML. vol. 2, p. 4 (2021)

    Google Scholar 

  5. Brattoli, B., Tighe, J., Zhdanov, F., Perona, P., Chalupka, K.: Rethinking zero-shot video classification: End-to-end training for realistic applications. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4613–4623 (2020)

    Google Scholar 

  6. Bucher, M., Herbin, S., Jurie, F.: Generating visual representations for zero-shot classification. In: Proceedings of the IEEE International Conference on Computer Vision Workshops. pp. 2666–2673 (2017)

    Google Scholar 

  7. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: IEEE Conf. Comput. Vis. Pattern Recog. (2017)

    Google Scholar 

  8. Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., Zisserman, A.: A short note about kinetics-600. arXiv preprint arXiv:1808.01340 (2018)

  9. Chen, S., Huang, D.: Elaborative rehearsal for zero-shot action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13638–13647 (2021)

    Google Scholar 

  10. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)

    Google Scholar 

  11. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  12. Estevam, V., Laroca, R., Menotti, D., Pedrini, H.: Tell me what you see: A zero-shot action recognition method based on natural language descriptions. arXiv preprint arXiv:2112.09976 (2021)

  13. Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., Feichtenhofer, C.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6824–6835 (2021)

    Google Scholar 

  14. Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: Devise: A deep visual-semantic embedding model. Advances in neural information processing systems 26 (2013)

    Google Scholar 

  15. Gan, C., Lin, M., Yang, Y., De Melo, G., Hauptmann, A.G.: Concepts not alone: Exploring pairwise relationships for zero-shot video activity recognition. In: Thirtieth AAAI conference on artificial intelligence (2016)

    Google Scholar 

  16. Gan, C., Lin, M., Yang, Y., Zhuang, Y., Hauptmann, A.G.: Exploring semantic inter-class relationships (sir) for zero-shot action recognition. In: Proceedings of the National Conference on Artificial Intelligence (2015)

    Google Scholar 

  17. Gan, C., Yang, T., Gong, B.: Learning attributes equals multi-source domain generalization. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 87–97 (2016)

    Google Scholar 

  18. Gao, J., Zhang, T., Xu, C.: I know the relationships: Zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 8303–8311 (2019)

    Google Scholar 

  19. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)

    Article  MathSciNet  Google Scholar 

  20. Gowda, S.N.: Synthetic sample selection for generalized zero-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 58–67 (2023)

    Google Scholar 

  21. Gowda, S.N., Hao, X., Li, G., Sevilla-Lara, L., Gowda, S.N.: Watt for what: Rethinking deep learning’s energy-performance relationship. arXiv preprint arXiv:2310.06522 (2023)

  22. Gowda, S.N., Rohrbach, M., Keller, F., Sevilla-Lara, L.: Learn2augment: Learning to composite videos for data augmentation in action recognition. In: European Conference on Computer Vision. pp. 242–259. Springer (2022)

    Google Scholar 

  23. Gowda, S.N., Rohrbach, M., Sevilla-Lara, L.: Smart frame selection for action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 1451–1459 (2021)

    Google Scholar 

  24. Gowda, S.N., Sevilla-Lara, L., Keller, F., Rohrbach, M.: Claster: clustering with reinforcement learning for zero-shot action recognition. In: European Conference on Computer Vision. pp. 187–203. Springer (2022)

    Google Scholar 

  25. Gowda, S.N., Sevilla-Lara, L., Kim, K., Keller, F., Rohrbach, M.: A new split for evaluating true zero-shot action recognition. arXiv preprint arXiv:2107.13029 (2021)

  26. Gowda, S.N., Yuan, C.: Colornet: Investigating the importance of color spaces for image classification. In: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018. pp. 581–596. Springer (2019)

    Google Scholar 

  27. Han, Z., Fu, Z., Li, G., Yang, J.: Inference guided feature generation for generalized zero-shot learning. Neurocomputing 430, 150–158 (2021)

    Article  Google Scholar 

  28. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

    Google Scholar 

  29. Kim, K., Gowda, S.N., Mac Aodha, O., Sevilla-Lara, L.: Capturing temporal information in a single frame: Channel sampling strategies for action recognition. arXiv preprint arXiv:2201.10394 (2022)

  30. Kodirov, E., Xiang, T., Fu, Z., Gong, S.: Unsupervised domain adaptation for zero-shot learning. In: Proceedings of the IEEE international conference on computer vision. pp. 2452–2460 (2015)

    Google Scholar 

  31. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: Hmdb: a large video database for human motion recognition. In: 2011 International Conference on Computer Vision. pp. 2556–2563. IEEE (2011)

    Google Scholar 

  32. Lin, C.C., Lin, K., Wang, L., Liu, Z., Li, L.: Cross-modal representation learning for zero-shot action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19978–19988 (2022)

    Google Scholar 

  33. Lin, J., Gan, C., Han, S.: Tsm: Temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 7083–7093 (2019)

    Google Scholar 

  34. Liu, J., Bai, H., Zhang, H., Liu, L.: Beyond normal distribution: More factual feature generation network for generalized zero-shot learning. IEEE MultiMedia (2022)

    Google Scholar 

  35. Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H.: Video swin transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3202–3211 (2022)

    Google Scholar 

  36. Maaten, L.v.d., Hinton, G.: Visualizing data using t-sne. Journal of machine learning research 9(Nov), 2579–2605 (2008)

    Google Scholar 

  37. Mandal, D., Narayan, S., Dwivedi, S.K., Gupta, V., Ahmed, S., Khan, F.S., Shao, L.: Out-of-distribution detection for generalized zero-shot action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 9985–9993 (2019)

    Google Scholar 

  38. Mettes, P., Snoek, C.G.: Spatial-aware object embeddings for zero-shot localization and classification of actions. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 4443–4452 (2017)

    Google Scholar 

  39. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems. pp. 3111–3119 (2013)

    Google Scholar 

  40. Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)

  41. Mishra, A., Verma, V.K., Reddy, M.S.K., Arulkumar, S., Rai, P., Mittal, A.: A generative approach to zero-shot and few-shot action recognition. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 372–380. IEEE (2018)

    Google Scholar 

  42. Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S., Ling, H.: Expanding language-image pretrained models for general video recognition. In: European Conference on Computer Vision. pp. 1–18. Springer (2022)

    Google Scholar 

  43. Pagliardini, M., Gupta, P., Jaggi, M.: Unsupervised learning of sentence embeddings using compositional n-gram features. In: Proceedings of NAACL-HLT. pp. 528–540 (2018)

    Google Scholar 

  44. Perrett, T., Masullo, A., Burghardt, T., Mirmehdi, M., Damen, D.: Temporal-relational crosstransformers for few-shot action recognition. arXiv preprint arXiv:2101.06184 (2021)

  45. Qian, Y., Yu, L., Liu, W., Hauptmann, A.G.: Rethinking zero-shot action recognition: Learning from latent atomic actions. In: European Conference on Computer Vision. pp. 104–120. Springer (2022)

    Google Scholar 

  46. Qin, J., Liu, L., Shao, L., Shen, F., Ni, B., Chen, J., Wang, Y.: Zero-shot action recognition with error-correcting output codes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2833–2842 (2017)

    Google Scholar 

  47. Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (11 2019), https://arxiv.org/abs/1908.10084

  48. Rohrbach, M., Regneri, M., Andriluka, M., Amin, S., Pinkal, M., Schiele, B.: Script data for attribute-based recognition of composite activities. In: Eur. Conf. Comput. Vis. (2012)

    Google Scholar 

  49. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems. pp. 568–576 (2014)

    Google Scholar 

  50. Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)

  51. Verma, V.K., Arora, G., Mishra, A., Rai, P.: Generalized zero-shot learning via synthesized examples. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4281–4289 (2018)

    Google Scholar 

  52. Xian, Y., Lorenz, T., Schiele, B., Akata, Z.: Feature generating networks for zero-shot learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5542–5551 (2018)

    Google Scholar 

  53. Xu, X., Hospedales, T., Gong, S.: Transductive zero-shot action recognition by word-vector embedding. Int. J. Comput. Vision 123(3), 309–333 (2017)

    Article  MathSciNet  Google Scholar 

  54. Xu, X., Hospedales, T.M., Gong, S.: Multi-task zero-shot action recognition with prioritised data augmentation. In: European Conference on Computer Vision. pp. 343–359. Springer (2016)

    Google Scholar 

  55. Zhu, Y., Long, Y., Guan, Y., Newsam, S., Shao, L.: Towards universal representation for unseen action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 9436–9445 (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shreyank N. Gowda .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 326 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gowda, S.N., Sevilla-Lara, L. (2025). Telling Stories for Common Sense Zero-Shot Action Recognition. In: Cho, M., Laptev, I., Tran, D., Yao, A., Zha, H. (eds) Computer Vision – ACCV 2024. ACCV 2024. Lecture Notes in Computer Science, vol 15474. Springer, Singapore. https://doi.org/10.1007/978-981-96-0908-6_26

Download citation

  • DOI: https://doi.org/10.1007/978-981-96-0908-6_26

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-96-0907-9

  • Online ISBN: 978-981-96-0908-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics