Skip to main content

ActionVOS: Actions as Prompts for Video Object Segmentation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Delving into the realm of egocentric vision, the advancement of referring video object segmentation (RVOS) stands as pivotal in understanding human activities. However, existing RVOS task primarily relies on static attributes such as object names to segment target objects, posing challenges in distinguishing target objects from background objects and in identifying objects undergoing state changes. To address these problems, this work proposes a novel action-aware RVOS setting called ActionVOS, aiming at segmenting only active objects in egocentric videos using human actions as a key language prompt. This is because human actions precisely describe the behavior of humans, thereby helping to identify the objects truly involved in the interaction and to understand possible state changes. We also build a method tailored to work under this specific setting. Specifically, we develop an action-aware labeling module with an efficient action-guided focal loss. Such designs enable ActionVOS model to prioritize active objects with existing readily-available annotations. Experimental results on the VISOR dataset reveal that ActionVOS significantly reduces the mis-segmentation of inactive objects, confirming that actions help the ActionVOS model understand objects’ involvement. Further evaluations on VOST and VSCOS datasets show that the novel ActionVOS setting enhances segmentation performance when encountering challenging circumstances involving object state changes. We will make our implementation available at https://github.com/ut-vision/ActionVOS.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bertasius, G., Park, H.S., Stella, X.Y., Shi, J.: First-person action-object detection with egonet. In: Robotics: Science and Systems (2017)

    Google Scholar 

  2. Botach, A., Zheltonozhskii, E., Baskin, C.: End-to-end referring video object segmentation with multimodal transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4985–4995 (2022)

    Google Scholar 

  3. Cai, M., Kitani, K.M., Sato, Y.: Understanding hand-object manipulation with grasp types and object attributes. In: Robotics: Science and Systems, vol. 3 (2016)

    Google Scholar 

  4. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  5. Chen, Z., Ma, L., Luo, W., Wong, K.Y.K.: Weakly-supervised spatio-temporally grounding natural sentence in video. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1884–1894 (2019)

    Google Scholar 

  6. Cheng, H.K., Oh, S.W., Price, B., Schwing, A., Lee, J.Y.: Tracking anything with decoupled video segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1316–1326 (2023)

    Google Scholar 

  7. Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. In: Proceedings of the European Conference on Computer Vision, pp. 720–736 (2018)

    Google Scholar 

  8. Damen, D., et al.: Rescaling egocentric vision: collection, pipeline and challenges for epic-kitchens-100. Int. J. Comput. Vision 1–23 (2022)

    Google Scholar 

  9. Darkhalil, A., et al.: Epic-kitchens visor benchmark: video segmentations and object relations. Adv. Neural. Inf. Process. Syst. 35, 13745–13758 (2022)

    Google Scholar 

  10. De Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., Courville, A.: GuessWhat?! Visual object discovery through multi-modal dialogue. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5503–5512 (2017)

    Google Scholar 

  11. Deruyttere, T., Vandenhende, S., Grujicic, D., Van Gool, L., Moens, M.F.: Talk2Car: taking control of your self-driving car. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 2088–2098 (2019)

    Google Scholar 

  12. Ding, H., Liu, C., He, S., Jiang, X., Loy, C.C.: MeViS: a large-scale benchmark for video segmentation with motion expressions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2694–2703 (2023)

    Google Scholar 

  13. Ding, H., Liu, C., Wang, S., Jiang, X.: Vlt: Vision-language transformer and query generation for referring segmentation. IEEE Trans. Pattern Anal. Mach. Intell. (2022)

    Google Scholar 

  14. Fu, Q., Liu, X., Kitani, K.: Sequential voting with relational box fields for active object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2374–2383 (2022)

    Google Scholar 

  15. Gavrilyuk, K., Ghodrati, A., Li, Z., Snoek, C.G.: Actor and action video segmentation from a sentence. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5958–5966 (2018)

    Google Scholar 

  16. Grauman, K., et al.: EGO4D: around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18995–19012 (2022)

    Google Scholar 

  17. Gupta, A., Davis, L.S.: Objects in action: an approach for combining action understanding and object perception. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007)

    Google Scholar 

  18. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  19. He, S., Ding, H., Liu, C., Jiang, X.: GREC: generalized referring expression comprehension. arXiv preprint arXiv:2308.16182 (2023)

  20. Higgins, R.E.L., Fouhey, D.F.: MOVES: manipulated objects in video enable segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6334–6343 (2023)

    Google Scholar 

  21. Hu, R., Rohrbach, M., Darrell, T.: Segmentation from natural language expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 108–124. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_7

    Chapter  Google Scholar 

  22. Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: MDETR-modulated detection for end-to-end multi-modal understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1780–1790 (2021)

    Google Scholar 

  23. Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferItGame: referring to objects in photographs of natural scenes. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 787–798 (2014)

    Google Scholar 

  24. Khoreva, A., Rohrbach, A., Schiele, B.: Video object segmentation with language referring expressions. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11364, pp. 123–141. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20870-7_8

    Chapter  Google Scholar 

  25. Kjellström, H., Romero, J., Kragić, D.: Visual object-action recognition: inferring object affordances from human demonstration. Comput. Vis. Image Underst. 115(1), 81–90 (2011)

    Article  Google Scholar 

  26. Krüger, N., et al.: Object-action complexes: grounded abstractions of sensory-motor processes. Robot. Auton. Syst. 59(10), 740–757 (2011)

    Article  Google Scholar 

  27. Kurita, S., Katsura, N., Onami, E.: RefEgo: referring expression comprehension dataset from first-person perception of ego4d. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15214–15224 (2023)

    Google Scholar 

  28. Lee, C., Kumar, M.G., Tan, C.: DetermiNet: a large-scale diagnostic dataset for complex visually-grounded referencing using determiners. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 20019–20028 (2023)

    Google Scholar 

  29. Li, L.H., et al.: Grounded language-image pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10965–10975 (2022)

    Google Scholar 

  30. Li, X., Wang, J., Xu, X., Li, X., Raj, B., Lu, Y.: Robust referring video object segmentation with cyclic structural consensus. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22236–22245 (2023)

    Google Scholar 

  31. Li, Z., Tao, R., Gavves, E., Snoek, C.G., Smeulders, A.W.: Tracking by natural language specification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6495–6503 (2017)

    Google Scholar 

  32. Lin, K.Q., et al.: Egocentric video-language pretraining. Adv. Neural. Inf. Process. Syst. 35, 7575–7586 (2022)

    Google Scholar 

  33. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)

    Google Scholar 

  34. Liu, C., Ding, H., Jiang, X.: GRES: generalized referring expression segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23592–23601 (2023)

    Google Scholar 

  35. Liu, J., et al.: PolyFormer: referring image segmentation as sequential polygon generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18653–18663 (2023)

    Google Scholar 

  36. Liu, R., Ohkawa, T., Zhang, M., Sato, Y.: Single-to-dual-view adaptation for egocentric 3D hand pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 677–686 (2024)

    Google Scholar 

  37. Liu, R., Liu, C., Bai, Y., Yuille, A.L.: CLEVR-Ref+: diagnosing visual reasoning with referring expressions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4185–4194 (2019)

    Google Scholar 

  38. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  39. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)

    Google Scholar 

  40. Liu, Z., et al.: Video swin transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3202–3211 (2022)

    Google Scholar 

  41. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2018)

    Google Scholar 

  42. Lüddecke, T., Ecker, A.: Image segmentation using text and image prompts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7086–7096 (2022)

    Google Scholar 

  43. Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11–20 (2016)

    Google Scholar 

  44. Mei, J., Piergiovanni, A., Hwang, J.N., Li, W.: SLVP: self-supervised language-video pre-training for referring video object segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 507–517 (2024)

    Google Scholar 

  45. Miao, Z., Zhao, K., Tsuruoka, Y.: Improving arithmetic reasoning ability of large language models through relation tuples, verification and dynamic feedback. arXiv preprint arXiv:2406.17873 (2024)

  46. Milletari, F., Navab, N., Ahmadi, S.A.: V-net: fully convolutional neural networks for volumetric medical image segmentation. In: 2016 fourth International Conference on 3D Vision, pp. 565–571 (2016)

    Google Scholar 

  47. Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 8026–8037 (2019)

    Google Scholar 

  48. Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 724–732 (2016)

    Google Scholar 

  49. Qi, Y., et al.: REVERIE: remote embodied visual referring expression in real indoor environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9982–9991 (2020)

    Google Scholar 

  50. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)

    Google Scholar 

  51. Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 658–666 (2019)

    Google Scholar 

  52. Rodin, I., Furnari, A., Min, K., Tripathi, S., Farinella, G.M.: Action scene graphs for long-form understanding of egocentric videos. arXiv preprint arXiv:2312.03391 (2023)

  53. Seo, S., Lee, J.-Y., Han, B.: URVOS: unified referring video object segmentation network with a large-scale benchmark. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 208–223. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_13

    Chapter  Google Scholar 

  54. Shan, D., Geng, J., Shu, M., Fouhey, D.F.: Understanding human hands in contact at internet scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9869–9878 (2020)

    Google Scholar 

  55. Shin, T., Razeghi, Y., Logan IV, R.L., Wallace, E., Singh, S.: AutoPrompt: eliciting knowledge from language models with automatically generated prompts. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 4222–4235 (2020)

    Google Scholar 

  56. Tateno, M., Yagi, T., Furuta, R., Sato, Y.: Learning object states from actions via large language models. arXiv preprint arXiv:2405.01090 (2024)

  57. Tokmakov, P., Li, J., Gaidon, A.: Breaking the “object” in video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22836–22845 (2023)

    Google Scholar 

  58. Wang, P., et al.: One-peace: exploring one general representation model toward unlimited modalities. arXiv preprint arXiv:2305.11172 (2023)

  59. Wang, P., et al.: OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: International Conference on Machine Learning, pp. 23318–23340. PMLR (2022)

    Google Scholar 

  60. Wang, W., et al.: Beyond literal descriptions: understanding and locating open-world objects aligned with human intentions. arXiv preprint arXiv:2402.11265 (2024)

  61. Wang, X., et al.: Towards more flexible and accurate object tracking with natural language: algorithms and benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13763–13773 (2021)

    Google Scholar 

  62. Wang, X., Zhang, X., Cao, Y., Wang, W., Shen, C., Huang, T.: SegGPT: towards segmenting everything in context. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1130–1140 (2023)

    Google Scholar 

  63. Wang, Y., et al.: End-to-end video instance segmentation with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8741–8750 (2021)

    Google Scholar 

  64. Wu, C., Lin, Z., Cohen, S., Bui, T., Maji, S.: PhraseCut: language-based image segmentation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10216–10225 (2020)

    Google Scholar 

  65. Wu, D., Han, W., Wang, T., Dong, X., Zhang, X., Shen, J.: Referring multi-object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14633–14642 (2023)

    Google Scholar 

  66. Wu, J., Jiang, Y., Sun, P., Yuan, Z., Luo, P.: Language as queries for referring video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4974–4984 (2022)

    Google Scholar 

  67. Wu, J., Jiang, Y., Yan, B., Lu, H., Yuan, Z., Luo, P.: Segment every reference object in spatial and temporal spaces. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2538–2550 (2023)

    Google Scholar 

  68. Wu, T.L., Zhou, Y., Peng, N.: Localizing active objects from egocentric vision with symbolic world knowledge. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 4991–5006 (2023)

    Google Scholar 

  69. Xu, N., et al.: YouTube-VOS: sequence-to-sequence video object segmentation. In: Proceedings of the European Conference on Computer Vision, pp. 585–601 (2018)

    Google Scholar 

  70. Yamaguchi, M., Saito, K., Ushiku, Y., Harada, T.: Spatio-temporal person retrieval via natural language queries. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1453–1462 (2017)

    Google Scholar 

  71. Yan, B., et al.: Universal instance perception as object discovery and retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15325–15336 (2023)

    Google Scholar 

  72. Yu, J., Li, X., Zhao, X., Zhang, H., Wang, Y.X.: Video state-changing object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 20439–20448 (2023)

    Google Scholar 

  73. Yu, L., et al.: MattNet: modular attention network for referring expression comprehension. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1307–1315 (2018)

    Google Scholar 

  74. Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_5

    Chapter  Google Scholar 

  75. Zhang, C., Gupta, A., Zisserman, A.: Helping hands: an object-aware ego-centric video recognition model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13901–13912 (2023)

    Google Scholar 

  76. Zhang, H., et al.: GLIPv2: unifying localization and vision-language understanding. Adv. Neural. Inf. Process. Syst. 35, 36067–36080 (2022)

    Google Scholar 

  77. Zhang, L., Zhou, S., Stent, S., Shi, J.: Fine-grained egocentric hand-object segmentation: Dataset, model, and applications. In: Proceedings of the European Conference on Computer Vision, pp. 127–145 (2022)

    Google Scholar 

  78. Zhang, M., Huang, Y., Liu, R., Sato, Y.: Masked video and body-worn IMU autoencoder for egocentric action recognition. arXiv preprint arXiv:2407.06628 (2024)

  79. Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. Int. J. Comput. Vision 130(9), 2337–2348 (2022)

    Article  Google Scholar 

  80. Zhu, C., et al.: EgoObjects: a large-scale egocentric dataset for fine-grained object understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)

    Google Scholar 

  81. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: International Conference on Learning Representations (2020)

    Google Scholar 

Download references

Acknowledgements

This work was supported by JST ASPIRE Grant Number JPMJAP2303, JST SPRING Grant Number JPMJSP2108, JSPS KAKENHI Grant Numbers JP22KF0119, JP23H00488, JP24K02956.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yifei Huang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 42531 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ouyang, L., Liu, R., Huang, Y., Furuta, R., Sato, Y. (2025). ActionVOS: Actions as Prompts for Video Object Segmentation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15068. Springer, Cham. https://doi.org/10.1007/978-3-031-72684-2_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72684-2_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72683-5

  • Online ISBN: 978-3-031-72684-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics