One-Shot Video Object Segmentation Initialized with Referring Expression

Bu, XiaoQing; Wang, Jianming; Liang, Jiayu; Liu, Kunliang; Sun, Yukuan; Jin, Guanghao

doi:10.1007/978-3-030-31723-2_35

XiaoQing Bu¹⁶,
Jianming Wang^17,20,
Jiayu Liang¹⁷,
Kunliang Liu¹⁷,
Yukuan Sun¹⁹ &
…
Guanghao Jin^17,18

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11858))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

2527 Accesses

Abstract

One-Shot Video Object Segmentation (OSVOS) is a CNN architecture to tackle the problem of semi-supervised video object segmentation, which performs the separation of an object from the background in a frame-independent way with the aid of one manually-segmented frame. However, in the scenarios of real applications, the requirement of one manually-segmented frame would do harm to user-friendliness of a system. To tackle the problem above, we propose a video object segmentation based on referring expression (named as REVOS), which obtains the segmented frame by a referring expression (a noun phrase whose function is to identify on specific object). The main task of our method is to select the target from all candidate objects which have the highest matching score with the referring expression by using the language analysis module. Then generate the annotation of the first frame and continue to segment all the remaining frames with OSVOS. The results of experiment show that our method can achieve similar accuracy to OSVOS and more convenient and flexible for system design.

The first author of this paper is a student. This work was supported by National Natural Science Foundation of China (No. 61373104, No. 61405143) and the Excellent Science and Technology Enterprise Specialist Project of Tianjin (No. 18JCTPJC59000) and the Tianjin Natural Science Foundation (No. 16JCYBJC42300).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Spatial and Temporal Guidance for Semi-supervised Video Object Segmentation

Joint Multi-object Detection and Segmentation from an Untrimmed Video

Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation

References

Andreas, J., Rohrbach, M., Darrell, T., Dan, K.: Learning to compose neural networks for question answering (2016)
Google Scholar
Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Neural module networks. In: IEEE Conference on Computer Vision and Pattern Recognition (2015)
Google Scholar
Avinash Ramakanth, S., Venkatesh Babu, R.: SeamSeg: video object segmentation using patch seams. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 376–383 (2014)
Google Scholar
Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 221–230 (2017)
Google Scholar
Chang, J., Wei, D., Fisher, J.W.: A video representation using temporal superpixels. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2051–2058 (2013)
Google Scholar
Fan, Q., Zhong, F., Lischinski, D., Cohen-Or, D., Chen, B.: JumpCut: non-successive mask transfer and interpolation for video cutout. ACM Trans. Graph. 34(6), Article No. 195 (2015)
Google Scholar
Grundmann, M., Kwatra, V., Han, M., Essa, I.: Efficient hierarchical graph-based video segmentation. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2141–2148. IEEE (2010)
Google Scholar
Hu, R., Rohrbach, M., Darrell, T.: Segmentation from natural language expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 108–124. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_7
Chapter Google Scholar
Kan, C., Kovvuri, R., Nevatia, R.: Query-guided regression network with context policy for phrase grounding (2017)
Google Scholar
Lee, Y.J., Kim, J., Grauman, K.: Key-segments for video object segmentation. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 1995–2002. IEEE (2011)
Google Scholar
Li, F., Kim, T., Humayun, A., Tsai, D., Rehg, J.M.: Video segmentation by tracking many figure-ground segments. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2192–2199 (2013)
Google Scholar
Li, J., Mu, L., Zan, H., Zhang, K.: Research on Chinese parsing based on the improved compositional vector grammar. In: Lu, Q., Gao, H. (eds.) Chinese Lexical Semantics. LNCS (LNAI), vol. 9332, pp. 649–658. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-27194-1_64
Chapter Google Scholar
Liu, C., Zhe, L., Shen, X., Yang, J., Xin, L., Yuille, A.: Recurrent multimodal interaction for referring image segmentation (2017)
Google Scholar
Liu, J., Liang, W., Yang, M.H.: Referring expression generation and comprehension via attributes. In: IEEE International Conference on Computer Vision (2017)
Google Scholar
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A., Murphy, K.: Generation and comprehension of unambiguous object descriptions (2016)
Google Scholar
Märki, N., Perazzi, F., Wang, O., Sorkine-Hornung, A.: Bilateral space video segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 743–751 (2016)
Google Scholar
Mitchell, M., Deemter, K.V., Reiter, E.: Natural reference to objects in a visual domain. In: INLG - Sixth International Natural Language Generation Conference (2010)
Google Scholar
Paraboni, I., Galindo, M.R., Iacovelli, D.: Stars2: a corpus of object descriptions in a visual domain. Lang. Resour. Eval. 51(2), 1–24 (2016)
Google Scholar
Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 724–732 (2016)
Google Scholar
Perazzi, F., Wang, O., Gross, M., Sorkine-Hornung, A.: Fully connected object proposals for video segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3227–3234 (2015)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Google Scholar
Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 817–834. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_49
Chapter Google Scholar
Wang, L., Yin, L., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: Computer Vision and Pattern Recognition (2016)
Google Scholar
Xie, S., Tu, Z.: Holistically-nested edge detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1395–1403 (2015)
Google Scholar
Yu, L., Hao, T., Bansal, M., Berg, T.L.: A joint speaker-listener-reinforcer model for referring expressions (2016)
Google Scholar
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_5
Chapter Google Scholar
Yu, L., Zhe, L., Shen, X., Yang, J., Berg, T.L.: MAttNet: modular attention network for referring expression comprehension (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Electronic and Information Engineering, Tianjin Polytechnic University, Tianjin, China
XiaoQing Bu
School of Computer Science and Technology, Tianjin Polytechnic University, Tianjin, China
Jianming Wang, Jiayu Liang, Kunliang Liu & Guanghao Jin
Tianjin International Joint Research and Development Center of Autonomous Intelligence Technology and Systems, Tianjin Polytechnic University, Tianjin, China
Guanghao Jin
Center for Engineering Internship and Training, Tianjin, China
Yukuan Sun
Tianjin Key Laboratory of Autonomous Intelligence Technology and Systems, Tianjin Polytechnic University, Tianjin, China
Jianming Wang

Authors

XiaoQing Bu
View author publications
You can also search for this author in PubMed Google Scholar
Jianming Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jiayu Liang
View author publications
You can also search for this author in PubMed Google Scholar
Kunliang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yukuan Sun
View author publications
You can also search for this author in PubMed Google Scholar
Guanghao Jin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guanghao Jin .

Editor information

Editors and Affiliations

School of EECS, Peking University, Beijing, China
Zhouchen Lin
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Liang Wang
Nanjing University of Science and Technology, Nanjing, China
Jian Yang
Xidian University, Xi'an, China
Guangming Shi
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Tieniu Tan
Institute of Artificial Intelligence, Xi'an Jiaotong University, Xi'an, China
Nanning Zheng
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Northwestern Polytechnical University, Xi'an, China
Yanning Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bu, X., Wang, J., Liang, J., Liu, K., Sun, Y., Jin, G. (2019). One-Shot Video Object Segmentation Initialized with Referring Expression. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2019. Lecture Notes in Computer Science(), vol 11858. Springer, Cham. https://doi.org/10.1007/978-3-030-31723-2_35

Download citation

DOI: https://doi.org/10.1007/978-3-030-31723-2_35
Published: 31 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-31722-5
Online ISBN: 978-3-030-31723-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics