Abstract
We propose a unified referring video object segmentation network (URVOS). URVOS takes a video and a referring expression as inputs, and estimates the object masks referred by the given language expression in the whole video frames. Our algorithm addresses the challenging problem by performing language-based object segmentation and mask propagation jointly using a single deep neural network with a proper combination of two attention models. In addition, we construct the first large-scale referring video object segmentation dataset called Refer-Youtube-VOS. We evaluate our model on two benchmark datasets including ours and demonstrate the effectiveness of the proposed approach. The dataset is released at https://github.com/skynbe/Refer-Youtube-VOS.
S. Seo—This work was done during an internship at Adobe Research.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
We use Res5 and Res4 feature maps in our model.
- 2.
For each spatial grid (h, w), \(\mathbf {s}_p\) = [\(h_{\text {min}}\), \(h_{\text {avg}}\), \(h_{\text {max}}\), \(w_{\text {min}}\), \(w_{\text {avg}}\), \(w_{\text {max}}\), \(\frac{1}{H}\), \(\frac{1}{W}\)], where \(h_{*}, w_{*} \in [-1, 1]\) are relative coordinates of the grid. H and W denotes the height and width of the whole spatial feature map.
- 3.
\(\mathbf {\widetilde{s}}_{tp} = [t_{\text {min}}, t_{\text {avg}}, t_{\text {max}}, h_{\text {min}}, h_{\text {avg}}, h_{\text {max}}, w_{\text {min}}, w_{\text {avg}}, w_{\text {max}}, \frac{1}{T}, \frac{1}{H}, \frac{1}{W}]\).
References
Benard, A., Gygli, M.: Interactive video object segmentation in the wild. arXiv preprint arXiv:1801.00269 (2017)
Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: CVPR (2017)
Caelles, S., et al.: The 2018 davis challenge on video object segmentation. arXiv preprint arXiv:1803.00557 (2018)
Chen, Y., Pont-Tuset, J., Montes, A., Van Gool, L.: Blazingly fast video object segmentation with pixel-wise metric learning. In: CVPR (2018)
Fan, C., Zhang, X., Zhang, S., Wang, W., Zhang, C., Huang, H.: Heterogeneous memory enhanced multimodal attention model for video question answering. In: CVPR (2019)
Gavrilyuk, K., Ghodrati, A., Li, Z., Snoek, C.G.: Actor and action video segmentation from a sentence. In: CVPR (2018)
Goel, V., Weng, J., Poupart, P.: Unsupervised video object segmentation for deep reinforcement learning. In: NIPS (2018)
Hu, R., Rohrbach, M., Darrell, T.: Segmentation from natural language expressions. In: ECCV (2016)
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: CVPR (2013)
Khoreva, A., Rohrbach, A., Schiele, B.: Video object segmentation with language referring expressions. In: ACCV (2018)
Kirillov, A., Girshick, R., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR (2019)
Li, R., et al.: Referring image segmentation via recurrent refinement networks. In: CVPR (2018)
Li, S., Seybold, B., Vorobyov, A., Lei, X., Jay Kuo, C.C.: Unsupervised video object segmentation with motion-based bilateral networks. In: ECCV (2018)
Li, X., Change Loy, C.: Video object segmentation with joint re-identification and attention-aware mask propagation. In: ECCV (2018)
Li, Z., Tao, R., Gavves, E., Snoek, C.G., Smeulders, A.W.: Tracking by natural language specification. In: CVPR (2017)
Liu, C., Lin, Z., Shen, X., Yang, J., Lu, X., Yuille, A.: Recurrent multimodal interaction for referring image segmentation. In: ICCV (2017)
Maninis, K.K., Caelles, S., Pont-Tuset, J., Van Gool, L.: Deep extreme cut: From extreme points to object segmentation. In: CVPR (2018)
Margffoy-Tuay, E., Pérez, J.C., Botero, E., Arbeláez, P.: Dynamic multimodal instance segmentation guided by natural language queries. In: ECCV (2018)
Mun, J., Yang, L., Ren, Z., Xu, N., Han, B.: Streamlined dense video captioning. In: CVPR (2019)
Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 792–807. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_48
Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Fast user-guided video object segmentation by interaction-and-propagation networks. In: CVPR (2019)
Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: ICCV (2019)
Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W., Lin, D.: Libra r-cnn: towards balanced learning for object detection. In: CVPR (2019)
Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. In: CVPR (2017)
Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: CVPR (2016)
Tokmakov, P., Alahari, K., Schmid, C.: Learning video object segmentation with visual memory. In: ICCV (2017)
Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)
Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., Chen, L.C.: Feelvos: fast end-to-end embedding learning for video object segmentation. In: CVPR (2019)
Wang, H., Deng, C., Yan, J., Tao, D.: Asymmetric cross-guided attention network for actor and action video segmentation from natural language query. In: ICCV (2019)
Wang, W., et al.: Learning unsupervised video object segmentation through visual attention. In: CVPR (2019)
Wug Oh, S., Lee, J.Y., Sunkavalli, K., Joo Kim, S.: Fast video object segmentation by reference-guided mask propagation. In: CVPR (2018)
Xu, C., Hsieh, S.H., Xiong, C., Corso, J.J.: Can humans fly? action understanding with multiple classes of actors. In: CVPR (2015)
Xu, N., et al.: Youtube-vos: sequence-to-sequence video object segmentation. In: ECCV (2018)
Yang, L., Wang, Y., Xiong, X., Yang, J., Katsaggelos, A.K.: Efficient video object segmentation via network modulation. In: CVPR (2018)
Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: CVPR (2019)
Yu, L., et al.: Mattnet: modular attention network for referring expression comprehension. In: CVPR (2018)
Zhou, T., Wang, S., Zhou, Y., Yao, Y., Li, J., Shao, L.: Motion-attentive transition for zero-shot video object segmentation. In: AAAI (2020)
Acknowledgement
This work was supported by Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) [2017-0-01779, 2017-0-01780].
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Seo, S., Lee, JY., Han, B. (2020). URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12360. Springer, Cham. https://doi.org/10.1007/978-3-030-58555-6_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-58555-6_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58554-9
Online ISBN: 978-3-030-58555-6
eBook Packages: Computer ScienceComputer Science (R0)