Abstract
One-Shot Video Object Segmentation (OSVOS) is a CNN architecture to tackle the problem of semi-supervised video object segmentation, which performs the separation of an object from the background in a frame-independent way with the aid of one manually-segmented frame. However, in the scenarios of real applications, the requirement of one manually-segmented frame would do harm to user-friendliness of a system. To tackle the problem above, we propose a video object segmentation based on referring expression (named as REVOS), which obtains the segmented frame by a referring expression (a noun phrase whose function is to identify on specific object). The main task of our method is to select the target from all candidate objects which have the highest matching score with the referring expression by using the language analysis module. Then generate the annotation of the first frame and continue to segment all the remaining frames with OSVOS. The results of experiment show that our method can achieve similar accuracy to OSVOS and more convenient and flexible for system design.
The first author of this paper is a student. This work was supported by National Natural Science Foundation of China (No. 61373104, No. 61405143) and the Excellent Science and Technology Enterprise Specialist Project of Tianjin (No. 18JCTPJC59000) and the Tianjin Natural Science Foundation (No. 16JCYBJC42300).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Andreas, J., Rohrbach, M., Darrell, T., Dan, K.: Learning to compose neural networks for question answering (2016)
Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Neural module networks. In: IEEE Conference on Computer Vision and Pattern Recognition (2015)
Avinash Ramakanth, S., Venkatesh Babu, R.: SeamSeg: video object segmentation using patch seams. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 376–383 (2014)
Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 221–230 (2017)
Chang, J., Wei, D., Fisher, J.W.: A video representation using temporal superpixels. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2051–2058 (2013)
Fan, Q., Zhong, F., Lischinski, D., Cohen-Or, D., Chen, B.: JumpCut: non-successive mask transfer and interpolation for video cutout. ACM Trans. Graph. 34(6), Article No. 195 (2015)
Grundmann, M., Kwatra, V., Han, M., Essa, I.: Efficient hierarchical graph-based video segmentation. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2141–2148. IEEE (2010)
Hu, R., Rohrbach, M., Darrell, T.: Segmentation from natural language expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 108–124. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_7
Kan, C., Kovvuri, R., Nevatia, R.: Query-guided regression network with context policy for phrase grounding (2017)
Lee, Y.J., Kim, J., Grauman, K.: Key-segments for video object segmentation. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 1995–2002. IEEE (2011)
Li, F., Kim, T., Humayun, A., Tsai, D., Rehg, J.M.: Video segmentation by tracking many figure-ground segments. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2192–2199 (2013)
Li, J., Mu, L., Zan, H., Zhang, K.: Research on Chinese parsing based on the improved compositional vector grammar. In: Lu, Q., Gao, H. (eds.) Chinese Lexical Semantics. LNCS (LNAI), vol. 9332, pp. 649–658. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-27194-1_64
Liu, C., Zhe, L., Shen, X., Yang, J., Xin, L., Yuille, A.: Recurrent multimodal interaction for referring image segmentation (2017)
Liu, J., Liang, W., Yang, M.H.: Referring expression generation and comprehension via attributes. In: IEEE International Conference on Computer Vision (2017)
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A., Murphy, K.: Generation and comprehension of unambiguous object descriptions (2016)
Märki, N., Perazzi, F., Wang, O., Sorkine-Hornung, A.: Bilateral space video segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 743–751 (2016)
Mitchell, M., Deemter, K.V., Reiter, E.: Natural reference to objects in a visual domain. In: INLG - Sixth International Natural Language Generation Conference (2010)
Paraboni, I., Galindo, M.R., Iacovelli, D.: Stars2: a corpus of object descriptions in a visual domain. Lang. Resour. Eval. 51(2), 1–24 (2016)
Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 724–732 (2016)
Perazzi, F., Wang, O., Gross, M., Sorkine-Hornung, A.: Fully connected object proposals for video segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3227–3234 (2015)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 817–834. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_49
Wang, L., Yin, L., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: Computer Vision and Pattern Recognition (2016)
Xie, S., Tu, Z.: Holistically-nested edge detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1395–1403 (2015)
Yu, L., Hao, T., Bansal, M., Berg, T.L.: A joint speaker-listener-reinforcer model for referring expressions (2016)
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_5
Yu, L., Zhe, L., Shen, X., Yang, J., Berg, T.L.: MAttNet: modular attention network for referring expression comprehension (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Bu, X., Wang, J., Liang, J., Liu, K., Sun, Y., Jin, G. (2019). One-Shot Video Object Segmentation Initialized with Referring Expression. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2019. Lecture Notes in Computer Science(), vol 11858. Springer, Cham. https://doi.org/10.1007/978-3-030-31723-2_35
Download citation
DOI: https://doi.org/10.1007/978-3-030-31723-2_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-31722-5
Online ISBN: 978-3-030-31723-2
eBook Packages: Computer ScienceComputer Science (R0)