Self-supervised Meta Auxiliary Learning for Actor and Action Video Segmentation from Natural Language

Ye, Linwei; Wang, Zhenhua

doi:10.1007/978-981-99-8850-1_26

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14473))

Included in the following conference series:

CAAI International Conference on Artificial Intelligence

158 Accesses

Abstract

This paper addresses the problem of actor and action video segmentation from natural language. Given a video and a language query, the goal is to segment the actor and its action described by the query. Existing methods focus on exploring elaborated multimodal feature fusion networks to combine visual and linguistic features for an effective multimodal representation directly learnt from this labeled segmentation task. In this paper, we propose a novel self-supervised meta auxiliary learning method to improve the primary segmentation task by adding an auxiliary task for better generalization. The auxiliary task is established to reconstruct the input sentence representation so that the multimodal representation can be adapted to a specific query. In addition, the auxiliary task does not require additional labels. It can also be used in test time to update a multimodal representation according to a specific query in a self-supervised way.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bellver, M., Ventura, C., Silberer, C., Kazakos, I., Torres, J., Giro-i Nieto, X.: A closer look at referring expressions for video object segmentation. Multimedia Tools Appl. 82(3), 4419–4438 (2023)
Article Google Scholar
Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: Activitynet: a large-scale video benchmark for human activity understanding. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)
Google Scholar
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)
Article Google Scholar
Gavrilyuk, K., Ghodrati, A., Li, Z., Snoek, C.G.: Actor and action video segmentation from a sentence. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5958–5966 (2018)
Google Scholar
Hu, R., Rohrbach, M., Darrell, T.: Segmentation from natural language expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 108–124. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_7
Chapter Google Scholar
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: IEEE International Conference on Computer Vision, pp. 3192–3199 (2013)
Google Scholar
Li, Z., Tao, R., Gavves, E., Snoek, C.G., Smeulders, A.W.: Tracking by natural language specification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6495–6503 (2017)
Google Scholar
Liu, C., Lin, Z., Shen, X., Yang, J., Lu, X., Yuille, A.L.: Recurrent multimodal interaction for referring image segmentation. In: IEEE International Conference on Computer Vision (2017)
Google Scholar
Liu, S., Davison, A., Johns, E.: Self-supervised generalisation with meta auxiliary learning. In: Advances in Neural Information Processing Systems, pp. 1677–1687 (2019)
Google Scholar
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)
Google Scholar
Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 817–834. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_49
Chapter Google Scholar
Sun, Y., Wang, X., Liu, Z., Miller, J., Efros, A.A., Hardt, M.: Test-time training for out-of-distribution generalization. arXiv:1909.13231 (2019)
Wang, H., Deng, C., Yan, J., Tao, D.: Asymmetric cross-guided attention network for actor and action video segmentation from natural language query. In: IEEE International Conference on Computer Vision, pp. 3939–3948 (2019)
Google Scholar
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Google Scholar
Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional lstm network: a machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems, pp. 802–810 (2015)
Google Scholar
Xu, C., Hsieh, S.H., Xiong, C., Corso, J.J.: Can humans fly? action understanding with multiple classes of actors. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2264–2273 (2015)
Google Scholar
Yang, J., Huang, Y., Niu, K., Huang, L., Ma, Z., Wang, L.: Actor and action modular network for text-based video segmentation. IEEE Trans. Image Process. 31, 4474–4489 (2022)
Article Google Scholar
Yang, Y., Deng, C., Gao, S., Liu, W., Tao, D., Gao, X.: Discriminative multi-instance multitask learning for 3d action recognition. IEEE Trans. Multimedia 19(3), 519–529 (2016)
Article Google Scholar
Ye, L., Rochan, M., Liu, Z., Zhang, X., Wang, Y.: Referring segmentation in images and videos with cross-modal self-attention network. IEEE Trans. Pattern Anal. Mach. Intell. 44(7), 3719–3732 (2021)
Google Scholar

Download references

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (Grant No. 62102289) and in part by the Zhejiang Provincial Natural Science Foundation (Grant No. LQ22F020005).

Author information

Authors and Affiliations

College of Computer Science and Artificial Intelligence, Wenzhou University, Wenzhou, China
Linwei Ye & Zhenhua Wang

Authors

Linwei Ye
View author publications
You can also search for this author in PubMed Google Scholar
Zhenhua Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Linwei Ye .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Lu Fang
Duke University, Durham, NC, USA
Jian Pei
Shanghai Jiao Tong Univeristy, Shanghai, China
Guangtao Zhai
Chinese Academy of Sciences, Beijing, China
Ruiping Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ye, L., Wang, Z. (2024). Self-supervised Meta Auxiliary Learning for Actor and Action Video Segmentation from Natural Language. In: Fang, L., Pei, J., Zhai, G., Wang, R. (eds) Artificial Intelligence. CICAI 2023. Lecture Notes in Computer Science(), vol 14473. Springer, Singapore. https://doi.org/10.1007/978-981-99-8850-1_26

Download citation

DOI: https://doi.org/10.1007/978-981-99-8850-1_26
Published: 04 February 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8849-5
Online ISBN: 978-981-99-8850-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Self-supervised Meta Auxiliary Learning for Actor and Action Video Segmentation from Natural Language