Skip to main content

Self-supervised Meta Auxiliary Learning for Actor and Action Video Segmentation from Natural Language

  • Conference paper
  • First Online:
Artificial Intelligence (CICAI 2023)

Abstract

This paper addresses the problem of actor and action video segmentation from natural language. Given a video and a language query, the goal is to segment the actor and its action described by the query. Existing methods focus on exploring elaborated multimodal feature fusion networks to combine visual and linguistic features for an effective multimodal representation directly learnt from this labeled segmentation task. In this paper, we propose a novel self-supervised meta auxiliary learning method to improve the primary segmentation task by adding an auxiliary task for better generalization. The auxiliary task is established to reconstruct the input sentence representation so that the multimodal representation can be adapted to a specific query. In addition, the auxiliary task does not require additional labels. It can also be used in test time to update a multimodal representation according to a specific query in a self-supervised way.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bellver, M., Ventura, C., Silberer, C., Kazakos, I., Torres, J., Giro-i Nieto, X.: A closer look at referring expressions for video object segmentation. Multimedia Tools Appl. 82(3), 4419–4438 (2023)

    Article  Google Scholar 

  2. Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: Activitynet: a large-scale video benchmark for human activity understanding. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)

    Google Scholar 

  3. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)

    Article  Google Scholar 

  4. Gavrilyuk, K., Ghodrati, A., Li, Z., Snoek, C.G.: Actor and action video segmentation from a sentence. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5958–5966 (2018)

    Google Scholar 

  5. Hu, R., Rohrbach, M., Darrell, T.: Segmentation from natural language expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 108–124. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_7

    Chapter  Google Scholar 

  6. Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: IEEE International Conference on Computer Vision, pp. 3192–3199 (2013)

    Google Scholar 

  7. Li, Z., Tao, R., Gavves, E., Snoek, C.G., Smeulders, A.W.: Tracking by natural language specification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6495–6503 (2017)

    Google Scholar 

  8. Liu, C., Lin, Z., Shen, X., Yang, J., Lu, X., Yuille, A.L.: Recurrent multimodal interaction for referring image segmentation. In: IEEE International Conference on Computer Vision (2017)

    Google Scholar 

  9. Liu, S., Davison, A., Johns, E.: Self-supervised generalisation with meta auxiliary learning. In: Advances in Neural Information Processing Systems, pp. 1677–1687 (2019)

    Google Scholar 

  10. Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)

    Google Scholar 

  11. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)

    Google Scholar 

  12. Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 817–834. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_49

    Chapter  Google Scholar 

  13. Sun, Y., Wang, X., Liu, Z., Miller, J., Efros, A.A., Hardt, M.: Test-time training for out-of-distribution generalization. arXiv:1909.13231 (2019)

  14. Wang, H., Deng, C., Yan, J., Tao, D.: Asymmetric cross-guided attention network for actor and action video segmentation from natural language query. In: IEEE International Conference on Computer Vision, pp. 3939–3948 (2019)

    Google Scholar 

  15. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)

    Google Scholar 

  16. Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional lstm network: a machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems, pp. 802–810 (2015)

    Google Scholar 

  17. Xu, C., Hsieh, S.H., Xiong, C., Corso, J.J.: Can humans fly? action understanding with multiple classes of actors. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2264–2273 (2015)

    Google Scholar 

  18. Yang, J., Huang, Y., Niu, K., Huang, L., Ma, Z., Wang, L.: Actor and action modular network for text-based video segmentation. IEEE Trans. Image Process. 31, 4474–4489 (2022)

    Article  Google Scholar 

  19. Yang, Y., Deng, C., Gao, S., Liu, W., Tao, D., Gao, X.: Discriminative multi-instance multitask learning for 3d action recognition. IEEE Trans. Multimedia 19(3), 519–529 (2016)

    Article  Google Scholar 

  20. Ye, L., Rochan, M., Liu, Z., Zhang, X., Wang, Y.: Referring segmentation in images and videos with cross-modal self-attention network. IEEE Trans. Pattern Anal. Mach. Intell. 44(7), 3719–3732 (2021)

    Google Scholar 

Download references

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (Grant No. 62102289) and in part by the Zhejiang Provincial Natural Science Foundation (Grant No. LQ22F020005).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Linwei Ye .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ye, L., Wang, Z. (2024). Self-supervised Meta Auxiliary Learning for Actor and Action Video Segmentation from Natural Language. In: Fang, L., Pei, J., Zhai, G., Wang, R. (eds) Artificial Intelligence. CICAI 2023. Lecture Notes in Computer Science(), vol 14473. Springer, Singapore. https://doi.org/10.1007/978-981-99-8850-1_26

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8850-1_26

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8849-5

  • Online ISBN: 978-981-99-8850-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics