Abstract
We introduce a new task called Referring Atomic Video Action Recognition (RAVAR), aimed at identifying atomic actions of a particular person based on a textual description and the video data of this person. This task differs from traditional action recognition and localization, where predictions are delivered for all present individuals. In contrast, we focus on recognizing the correct atomic action of a specific individual, guided by text. To explore this task, we present the RefAVA dataset, containing 36, 630 instances with manually annotated textual descriptions of the individuals. To establish a strong initial benchmark, we implement and validate baselines from various domains, e.g., atomic action localization, video question answering, and text-video retrieval. Since these existing methods underperform on RAVAR, we introduce RefAtomNet – a novel cross-stream attention-driven method specialized for the unique challenges of RAVAR: the need to interpret a textual referring expression for the targeted individual, utilize this reference to guide the spatial localization and harvest the prediction of the atomic actions for the referring person. The key ingredients are: (1) a multi-stream architecture that connects video, text, and a new location-semantic stream, and (2) cross-stream agent attention fusion and agent token fusion which amplify the most relevant information across these streams and consistently surpasses standard attention-based fusion on RAVAR. Extensive experiments demonstrate the effectiveness of RefAtomNet and its building blocks for recognizing the action of the described individual. The dataset and code will be made publicly available at RAVAR.
K. Peng and J. Fu—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bagad, P., Tapaswi, M., Snoek, C.G.M.: Test of time: instilling video-language models with a sense of time. In: CVPR (2023)
Bu, Y., et al.: Scene-text oriented referring expression comprehension. TMM 25, 7208–7221 (2022)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017)
Castro, S., Deng, N., Huang, P., Burzo, M., Mihalcea, R.: In-the-wild video question answering. In: COLING (2022)
Chai, W., Guo, X., Wang, G., Lu, Y.: StableVideo: text-driven consistency-aware diffusion video editing. In: ICCV (2023)
Chen, J., Zhu, D., Haydarov, K., Li, X., Elhoseiny, M.: Video ChatCaptioner: towards enriched spatiotemporal descriptions. arXiv preprint arXiv:2304.04227 (2023)
Chen, X., et al.: Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Chen, Y., Wang, J., Lin, L., Qi, Z., Ma, J., Shan, Y.: Tagging before alignment: integrating multi-modal tags for video-text retrieval. arXiv preprint arXiv:2301.12644 (2023)
Chen, Z., Ma, L., Luo, W., Wong, K.Y.K.: Weakly-supervised spatio-temporally grounding natural sentence in video. arXiv preprint arXiv:1906.02549 (2019)
Chung, J., Wuu, C.H., Yang, H.R., Tai, Y.W., Tang, C.K.: HAA500: human-centric atomic action dataset with curated videos. In: ICCV (2021)
Dang, R., et al.: InstructDET: diversifying referring object detection with generalized instructions. arXiv preprint arXiv:2310.05136 (2023)
Deruyttere, T., Vandenhende, S., Grujicic, D., Van Gool, L., Moens, M.F.: Talk2Car: taking control of your self-driving car. In: EMNLP (2019)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: ACL (2019)
Dosovitskiy, A., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale. In: ICLR (2021)
Feichtenhofer, C.: X3D: expanding architectures for efficient video recognition. In: CVPR (2020)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: ICCV (2019)
Gandhi, M., Gul, M.O., Prakash, E., Grunde-McLaughlin, M., Krishna, R., Agrawala, M.: Measuring compositional consistency for video question answering. In: CVPR (2022)
Gao, D., Zhou, L., Ji, L., Zhu, L., Yang, Y., Shou, M.Z.: MIST: multi-modal iterative spatial-temporal transformer for long-form video question answering. In: CVPR (2023)
Garcia, N., Otani, M., Chu, C., Nakashima, Y.: KnowIT VQA: answering knowledge-based questions about videos. In: AAAI (2020)
Gavrilyuk, K., Ghodrati, A., Li, Z., Snoek, C.G.: Actor and action video segmentation from a sentence. In: CVPR (2018)
Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: ICCV (2017)
Gritsenko, A., et al.: End-to-end spatio-temporal action localisation with video transformers. arXiv preprint arXiv:2304.12160 (2023)
Gu, C., et al.: AVA: a video dataset of spatio-temporally localized atomic visual actions. In: CVPR (2018)
Guo, W., Zhang, Y., Yang, J., Yuan, X.: End-to-end object detection with transformers. TIP 30, 6730–6743 (2021)
Han, D., Ye, T., Han, Y., Xia, Z., Song, S., Huang, G.: Agent attention: on the integration of softmax and linear attention. arXiv preprint arXiv:2312.08874 (2023)
Ji, Y., Zhan, Y., Yang, Y., Xu, X., Shen, F., Shen, H.T.: A context knowledge map guided coarse-to-fine action recognition. TIP 29, 2742–2752 (2020)
Jiang, J., Chen, Z., Lin, H., Zhao, X., Gao, Y.: Divide and conquer: question-guided spatio-temporal contextual attention for video question answering. In: AAAI (2020)
Jin, L., et al.: RefCLIP: a universal teacher for weakly supervised referring expression comprehension. In: CVPR (2023)
Khoreva, A., Rohrbach, A., Schiele, B.: Video object segmentation with language referring expressions. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11364, pp. 123–141. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20870-7_8
Kim, M., Spinola, F., Benz, P., Kim, T.H.: A*: atrous spatial temporal action recognition for real time applications. In: WACV (2024)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kirillov, A., et al.: Segment anything. In: CVPR (2023)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV (2011)
Laput, G., Harrison, C.: Sensing fine-grained hand activity with smartwatches. In: CHI (2019)
Le, T.M., Le, V., Venkatesh, S., Tran, T.: Hierarchical conditional relation networks for video question answering. In: CVPR (2020)
Lea, C., Vidal, R., Hager, G.D.: Learning convolutional action primitives for fine-grained action recognition. In: ICRA (2016)
Lei, J., Berg, T.L., Bansal, M.: Revealing single frame bias for video-and-language learning. arXiv preprint arXiv:2206.03428 (2022)
Lei, J., et al.: Less is more: ClipBERT for video-and-language learning via sparse sampling. In: CVPR (2021)
Li, G., Wei, Y., Tian, Y., Xu, C., Wen, J.R., Hu, D.: Learning to answer questions in dynamic audio-visual scenarios. In: CVPR (2022)
Li, J., Niu, L., Zhang, L.: From representation to reasoning: towards both evidence and commonsense reasoning for video question-answering. In: CVPR (2022)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: ICML (2023)
Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML (2022)
Li, K., et al.: VideoChat: chat-centric video understanding. arXiv preprint arXiv:2305.06355 (2023)
Li, L., et al.: LAVENDER: unifying video-language understanding as masked language modeling. In: CVPR (2023)
Li, R., et al.: Referring image segmentation via recurrent refinement networks. In: CVPR (2018)
Li, Y., et al.: MViTv2: improved multiscale vision transformers for classification and detection. In: CVPR (2022)
Lin, J., et al.: EchoTrack: auditory referring multi-object tracking for autonomous driving. arXiv preprint arXiv:2402.18302 (2024)
Lin, X., et al.: Towards fast adaptation of pretrained contrastive models for multi-channel video-language retrieval. In: CVPR (2023)
Liu, J., Wang, L., Yang, M.H.: Referring expression generation and comprehension via attributes. In: ICCV (2017)
Liu, R., et al.: Open scene understanding: grounded situation recognition meets segment anything for helping people with visual impairments. In: ICCVW (2023)
Liu, R., Liu, C., Bai, Y., Yuille, A.L.: CLEVR-Ref+: diagnosing visual reasoning with referring expressions. In: CVPR (2019)
Liu, S., Hui, T., Huang, S., Wei, Y., Li, B., Li, G.: Cross-modal progressive comprehension for referring segmentation. TPAMI 44(9), 4761–4775 (2021)
Liu, Y., Li, G., Lin, L.: Cross-modal causal relational reasoning for event-level visual question answering. TPAMI 45(10), 11624–11641 (2023)
Luo, H., et al.: CLIP4Clip: an empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing 508, 293–304 (2022)
Ma, Y., Xu, G., Sun, X., Yan, M., Zhang, J., Ji, R.: X-CLIP: end-to-end multi-grained contrastive learning for video-text retrieval. In: MM (2022)
Madasu, A., Aflalo, E., Ben Melech Stan, G., Tseng, S.Y., Bertasius, G., Lal, V.: Improving video retrieval using multilingual knowledge transfer. In: Kamps, J., et al. (eds.) ECIR 2023. LNCS, vol. 13980, pp. 669–684. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-28244-7_42
McIntosh, B., Duarte, K., Rawat, Y.S., Shah, M.: Visual-textual capsule routing for text-based video segmentation. In: CVPR (2020)
Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottlenecks for multimodal fusion. In: NeuIPS (2021)
OpenAI: ChatGPT: optimizing language models for dialogue (2022). https://openai.com/
Ordonez, V., Kulkarni, G., Berg, T.: Im2Text: describing images using 1 million captioned photographs. In: NeurIPS (2011)
Ou, W., et al.: Indoor navigation assistance for visually impaired people via dynamic SLAM and panoptic segmentation with an RGB-D sensor. In: Miesenberger, K., Kouroupetroglou, G., Mavrou, K., Manduchi, R., Covarrubias Rodriguez, M., Penáz, P. (eds.) ICCHP 2022. LNCS, vol. 13341, pp. 160–168. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-08648-9_19
Peng, K., Roitberg, A., Yang, K., Zhang, J., Stiefelhagen, R.: TransDARC: transformer-based driver activity recognition with latent space feature calibration. In: IROS (2022)
Pramanick, P., Sarkar, C., Paul, S., dev Roychoudhury, R., Bhowmick, B.: DoRO: Disambiguation of referred object for embodied agents. RA-L 7(4), 10826–10833 (2022)
Pramono, R.R.A., Chen, Y.T., Fang, W.H.: Spatial-temporal action localization with hierarchical self-attention. TMM 24, 625–639 (2021)
Qiu, H., et al.: Language-aware fine-grained object representation for referring expression comprehension. In: MM (2020)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Rajasegaran, J., Pavlakos, G., Kanazawa, A., Feichtenhofer, C., Malik, J.: On the benefits of 3D pose and tracking for human action recognition. In: CVPR (2023)
Ryali, C., et al.: Hiera: a hierarchical vision transformer without the bells-and-whistles. In: ICML (2023)
Saha, J., Chowdhury, C., Chowdury, I.R., Roy, P.: Fine grained activity recognition using smart handheld. In: ICDCN (2018)
Seibold, C.M., Reiß, S., Kleesiek, J., Stiefelhagen, R.: Reference-guided pseudo-label generation for medical semantic segmentation. In: AAAI (2022)
Seo, S., Lee, J.-Y., Han, B.: URVOS: unified referring video object segmentation network with a large-scale benchmark. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 208–223. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_13
Shao, D., Zhao, Y., Dai, B., Lin, D.: FineGym: a hierarchical video dataset for fine-grained action understanding. In: CVPR (2020)
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018)
Shi, H., Pan, W., Zhao, Z., Zhang, M., Wu, F.: Unsupervised domain adaptation for referring semantic segmentation. In: MM (2023)
Shi, H., Li, H., Meng, F., Wu, Q.: Key-word-aware network for referring expression image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 38–54. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_3
Shi, Y., Xu, H., Yuan, C., Li, B., Hu, W., Zha, Z.J.: Learning video-text aligned representations for video captioning. TOMM 19(2), 1–21 (2023)
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Su, Y., Wang, W., Liu, J., Ma, S., Yang, X.: Sequence as a whole: a unified framework for video action localization with long-range text query. TIP 32, 1403–1418 (2023)
Vasudevan, A.B., Dai, D., Van Gool, L.: Object referring in videos with language and human gaze. In: CVPR (2018)
Wang, L., et al.: VideoMAE V2: scaling video masked autoencoders with dual masking. In: CVPR (2023)
Wang, M., Xing, J., Mei, J., Liu, Y., Jiang, Y.: ActionCLIP: adapting language-image pretrained models for video action recognition. TNNLS (2023)
Wang, R., et al.: Masked video distillation: rethinking masked feature modeling for self-supervised video representation learning. In: CVPR (2023)
Wang, S., Yan, R., Huang, P., Dai, G., Song, Y., Shu, X.: Com-STAL: compositional spatio-temporal action localization. TCSVT 33(12), 7645–7657 (2023)
Wang, Y., et al.: InternVideo: general video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191 (2022)
Wu, D., Han, W., Wang, T., Dong, X., Zhang, X., Shen, J.: Referring multi-object tracking. In: CVPR (2023)
Wu, W., Luo, H., Fang, B., Wang, J., Ouyang, W.: Cap4Video: what can auxiliary captions do for text-video retrieval? In: CVPR (2023)
Xiao, J., Shang, X., Yao, A., Chua, T.S.: NExT-QA: next phase of question-answering to explaining temporal actions. In: CVPR (2021)
Yang, A., Miech, A., Sivic, J., Laptev, I., Schmid, C.: Just ask: learning to answer questions from millions of narrated videos. In: ICCV (2021)
Yang, P., et al.: AVQA: a dataset for audio-visual question answering on videos. In: MM (2022)
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_5
Yu, Z., et al.: ActivityNet-QA: a dataset for understanding complex web videos via question answering. In: AAAI (2019)
Yuan, Z., et al.: InstanceRefer: cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In: ICCV (2021)
Zeng, F., Dong, B., Zhang, Y., Wang, T., Zhang, X., Wei, Y.: MOTR: end-to-end multiple-object tracking with transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13687, pp. 659–675. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19812-0_38
Zhang, G., Ren, J., Gu, J., Tresp, V.: Multi-event video-text retrieval. In: CVPR (2023)
Zhang, H., Li, X., Bing, L.: Video-LLaMA: an instruction-tuned audio-visual language model for video understanding. In: EMNLP (2023)
Zheng, J., Zhang, J., Yang, K., Peng, K., Stiefelhagen, R.: MateRobot: material recognition in wearable robotics for people with visual impairments. In: ICRA (2024)
Zong, D., Sun, S.: McOmet: multimodal fusion transformer for physical audiovisual commonsense reasoning. In: AAAI (2023)
Acknowledgements
The project served to prepare the SFB 1574 Circular Factory for the Perpetual Product (project ID: 471687386), approved by the German Research Foundation (DFG, German Research Foundation) with a start date of April 1, 2024. This work was also partially supported in part by the SmartAge project sponsored by the Carl Zeiss Stiftung (P2019-01-003; 2021–2026). This work was performed on the HoreKa supercomputer funded by the Ministry of Science, Research and the Arts Baden-W"urttemberg and by the Federal Ministry of Education and Research. The authors also acknowledge support by the state of Baden-W"urttemberg through bwHPC and the German Research Foundation (DFG) through grant INST 35/1597-1 FUGG. This project is also supported by the National Key RD Program under Grant 2022YFB4701400. Lastly, the authors thank for the support of Dr. Sepideh Pashami, the Swedish Innovation Agency VINNOVA, the Digital Futures.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Peng, K. et al. (2025). Referring Atomic Video Action Recognition. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15077. Springer, Cham. https://doi.org/10.1007/978-3-031-72655-2_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-72655-2_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72654-5
Online ISBN: 978-3-031-72655-2
eBook Packages: Computer ScienceComputer Science (R0)