Skip to main content

One-Shot Video Object Segmentation Initialized with Referring Expression

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2019)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11858))

Included in the following conference series:

  • 2391 Accesses

Abstract

One-Shot Video Object Segmentation (OSVOS) is a CNN architecture to tackle the problem of semi-supervised video object segmentation, which performs the separation of an object from the background in a frame-independent way with the aid of one manually-segmented frame. However, in the scenarios of real applications, the requirement of one manually-segmented frame would do harm to user-friendliness of a system. To tackle the problem above, we propose a video object segmentation based on referring expression (named as REVOS), which obtains the segmented frame by a referring expression (a noun phrase whose function is to identify on specific object). The main task of our method is to select the target from all candidate objects which have the highest matching score with the referring expression by using the language analysis module. Then generate the annotation of the first frame and continue to segment all the remaining frames with OSVOS. The results of experiment show that our method can achieve similar accuracy to OSVOS and more convenient and flexible for system design.

The first author of this paper is a student. This work was supported by National Natural Science Foundation of China (No. 61373104, No. 61405143) and the Excellent Science and Technology Enterprise Specialist Project of Tianjin (No. 18JCTPJC59000) and the Tianjin Natural Science Foundation (No. 16JCYBJC42300).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Andreas, J., Rohrbach, M., Darrell, T., Dan, K.: Learning to compose neural networks for question answering (2016)

    Google Scholar 

  2. Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Neural module networks. In: IEEE Conference on Computer Vision and Pattern Recognition (2015)

    Google Scholar 

  3. Avinash Ramakanth, S., Venkatesh Babu, R.: SeamSeg: video object segmentation using patch seams. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 376–383 (2014)

    Google Scholar 

  4. Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 221–230 (2017)

    Google Scholar 

  5. Chang, J., Wei, D., Fisher, J.W.: A video representation using temporal superpixels. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2051–2058 (2013)

    Google Scholar 

  6. Fan, Q., Zhong, F., Lischinski, D., Cohen-Or, D., Chen, B.: JumpCut: non-successive mask transfer and interpolation for video cutout. ACM Trans. Graph. 34(6), Article No. 195 (2015)

    Google Scholar 

  7. Grundmann, M., Kwatra, V., Han, M., Essa, I.: Efficient hierarchical graph-based video segmentation. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2141–2148. IEEE (2010)

    Google Scholar 

  8. Hu, R., Rohrbach, M., Darrell, T.: Segmentation from natural language expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 108–124. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_7

    Chapter  Google Scholar 

  9. Kan, C., Kovvuri, R., Nevatia, R.: Query-guided regression network with context policy for phrase grounding (2017)

    Google Scholar 

  10. Lee, Y.J., Kim, J., Grauman, K.: Key-segments for video object segmentation. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 1995–2002. IEEE (2011)

    Google Scholar 

  11. Li, F., Kim, T., Humayun, A., Tsai, D., Rehg, J.M.: Video segmentation by tracking many figure-ground segments. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2192–2199 (2013)

    Google Scholar 

  12. Li, J., Mu, L., Zan, H., Zhang, K.: Research on Chinese parsing based on the improved compositional vector grammar. In: Lu, Q., Gao, H. (eds.) Chinese Lexical Semantics. LNCS (LNAI), vol. 9332, pp. 649–658. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-27194-1_64

    Chapter  Google Scholar 

  13. Liu, C., Zhe, L., Shen, X., Yang, J., Xin, L., Yuille, A.: Recurrent multimodal interaction for referring image segmentation (2017)

    Google Scholar 

  14. Liu, J., Liang, W., Yang, M.H.: Referring expression generation and comprehension via attributes. In: IEEE International Conference on Computer Vision (2017)

    Google Scholar 

  15. Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A., Murphy, K.: Generation and comprehension of unambiguous object descriptions (2016)

    Google Scholar 

  16. Märki, N., Perazzi, F., Wang, O., Sorkine-Hornung, A.: Bilateral space video segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 743–751 (2016)

    Google Scholar 

  17. Mitchell, M., Deemter, K.V., Reiter, E.: Natural reference to objects in a visual domain. In: INLG - Sixth International Natural Language Generation Conference (2010)

    Google Scholar 

  18. Paraboni, I., Galindo, M.R., Iacovelli, D.: Stars2: a corpus of object descriptions in a visual domain. Lang. Resour. Eval. 51(2), 1–24 (2016)

    Google Scholar 

  19. Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 724–732 (2016)

    Google Scholar 

  20. Perazzi, F., Wang, O., Gross, M., Sorkine-Hornung, A.: Fully connected object proposals for video segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3227–3234 (2015)

    Google Scholar 

  21. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)

    Google Scholar 

  22. Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 817–834. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_49

    Chapter  Google Scholar 

  23. Wang, L., Yin, L., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: Computer Vision and Pattern Recognition (2016)

    Google Scholar 

  24. Xie, S., Tu, Z.: Holistically-nested edge detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1395–1403 (2015)

    Google Scholar 

  25. Yu, L., Hao, T., Bansal, M., Berg, T.L.: A joint speaker-listener-reinforcer model for referring expressions (2016)

    Google Scholar 

  26. Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_5

    Chapter  Google Scholar 

  27. Yu, L., Zhe, L., Shen, X., Yang, J., Berg, T.L.: MAttNet: modular attention network for referring expression comprehension (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guanghao Jin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bu, X., Wang, J., Liang, J., Liu, K., Sun, Y., Jin, G. (2019). One-Shot Video Object Segmentation Initialized with Referring Expression. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2019. Lecture Notes in Computer Science(), vol 11858. Springer, Cham. https://doi.org/10.1007/978-3-030-31723-2_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-31723-2_35

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-31722-5

  • Online ISBN: 978-3-030-31723-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics