Video Object Segmentation Based on Guided Feature Transfer Learning

Fiaz, Mustansar; Mahmood, Arif; Shahzad Farooq, Sehar; Ali, Kamran; Shaheryar, Muhammad; Jung, Soon Ki

doi:10.1007/978-3-031-06381-7_14

Mustansar Fiaz ORCID: orcid.org/0000-0003-2289-2284⁸,
Arif Mahmood⁹,
Sehar Shahzad Farooq¹⁰,
Kamran Ali¹¹,
Muhammad Shaheryar¹⁰ &
…
Soon Ki Jung ORCID: orcid.org/0000-0003-0239-6785¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1578))

Included in the following conference series:

International Workshop on Frontiers of Computer Vision

577 Accesses

Abstract

Video Object Segmentation (VOS) is a fundamental task with many real-world computer vision applications and challenging due to available distractors and background clutter. Many existing online learning approaches have limited practical significance because of high computational cost required to fine-tune network parameters. Moreover, matching based and propagation approaches are computationally efficient but may suffer from degraded performance in cluttered backgrounds and object drifts. In order to handle these issues, we propose an offline end-to-end model to learn guided feature transfer for VOS. We introduce guided feature modulation based on target mask to capture the video context information and a generative appearance model is used to provide cues for both the target and the background. Proposed guided feature modulation system learns the target semantic information based on modulation activations. Generative appearance model learns the probability of a pixel to be target or background. In addition, low-resolution features from deeper networks may not capture the global contextual information and may reduce the performance during feature refinement. Therefore, we also propose a guided pooled decoder to learn the global as well as local context information for better feature refinement. Evaluation over two VOS benchmark datasets including DAVIS2016 and DAVIS2017 have shown excellent performance of the proposed framework compared to more than 20 existing state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bao, L., Wu, B., Liu, W.: CNN in MRF: video object segmentation via inference in a CNN-based higher-order spatio-temporal MRF. In: CVPR, pp. 5977–5986 (2018)
Google Scholar
Caelles, S., et al.: One-shot video object segmentation. In: CVPR, pp. 221–230 (2017)
Google Scholar
Caelles, S., et al.: Fast video object segmentation with spatio-temporal GANs. arXiv preprint arXiv:1903.12161 (2019)
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)
Article Google Scholar
Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: ECCV, pp. 801–818 (2018)
Google Scholar
Chen, Y., et al.: Blazingly fast video object segmentation with pixel-wise metric learning. In: CVPR, pp. 1189–1198 (2018)
Google Scholar
Cheng, J., et al.: SegFlow: Joint learning for video object segmentation and optical flow. In: ICCV, pp. 686–695 (2017)
Google Scholar
Cheng, J., et al.: Fast and accurate online video object segmentation via tracking parts. In: CVPR, pp. 7415–7424 (2018)
Google Scholar
Ci, H., Wang, C., Wang, Y.: Video object segmentation by learning location-sensitive embeddings. In: ECCV, pp. 501–516 (2018)
Google Scholar
De Vries, H., Strub, F., Mary, J., Larochelle, H., Pietquin, O., Courville, A.C.: Modulating early visual processing by language. In: Advances in Neural Information Processing Systems, pp. 6594–6604 (2017)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255. IEEE (2009)
Google Scholar
Fiaz, M., Mahmood, A., Baek, K.Y., Farooq, S.S., Jung, S.K.: Improving object tracking by added noise and channel attention. Sensors 20(13), 3780 (2020)
Article Google Scholar
Fiaz, M., Mahmood, A., Javed, S., Jung, S.K.: Handcrafted and deep trackers: Recent visual object tracking approaches and trends. ACM Comput. Surv. (CSUR) 52(2), 1–44 (2019)
Article Google Scholar
Fiaz, M., Mahmood, A., Jung, S.K.: Learning soft mask based feature fusion with channel and spatial attention for robust visual object tracking. Sensors 20(14), 4021 (2020)
Article Google Scholar
Fiaz, M., Mahmood, A., Jung, S.K.: Video object segmentation using guided feature and directional deep appearance learning. In: Proceedings of the 2020 DAVIS Challenge on Video Object Segmentation-CVPR, Workshops, Seattle, WA, USA, vol. 19 (2020)
Google Scholar
Fiaz, M., et al.: Adaptive feature selection Siamese networks for visual tracking. In: Ohyama, W., Jung, S.K. (eds.) IW-FCV 2020. CCIS, vol. 1212, pp. 167–179. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-4818-5_13
Fiaz, M., Zaheer, M.Z., Mahmood, A., Lee, S.I., Jung, S.K.: 4G-VOS: video object segmentation using guided context embedding. Knowl. Based Syst. 231, 107401 (2021)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Hu, Y.T., Huang, J.B., Schwing, A.G.: Videomatch: Matching based video object segmentation. In: ECCV, pp. 54–70 (2018)
Google Scholar
Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: ICCV, pp. 1501–1510 (2017)
Google Scholar
Jampani, V., Gadde, R., Gehler, P.V.: Video propagation networks. In: CVPR, pp. 451–461 (2017)
Google Scholar
Jang, W.D., Kim, C.S.: Online video object segmentation via convolutional trident network. In: CVPR, pp. 5849–5858 (2017)
Google Scholar
Johnander, J., Danelljan, M., Brissman, E., Khan, F.S., Felsberg, M.: A generative appearance model for end-to-end video object segmentation. In: CVPR, pp. 8953–8962 (2019)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Li, X., C. Loy, C.: Video object segmentation with joint re-identification and attention-aware mask propagation. In: ECCV, pp. 90–105 (2018)
Google Scholar
Lin, H., Qi, X., Jia, J.: AGSS-VOS: attention guided single-shot video object segmentation. In: ICCV, pp. 3949–3957 (2019)
Google Scholar
Lukežič, A., Matas, J., Kristan, M.: D3s-a discriminative single shot segmentation tracker. arXiv preprint arXiv:1911.08862 (2019)
Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of ICML, vol. 30, p. 3 (2013)
Google Scholar
Maninis, K.K., et al.: Video object segmentation without temporal information. IEEE Trans. Pattern Anal. Mach. Intell. 41(6), 1515–1530 (2018)
Google Scholar
Nam, H., Kim, H.: Batch-instance normalization for adaptively style-invariant neural networks. In: Advances in Neural Information Processing System (2018)
Google Scholar
Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. In: CVPR, pp. 2663–2672 (2017)
Google Scholar
Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: CVPR, pp. 724–732 (2016)
Google Scholar
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017)
Rahman, M.M., Fiaz, M., Jung, S.K.: Efficient visual tracking with stacked channel-spatial attention learning. IEEE Access 8, 100857–100869 (2020)
Article Google Scholar
Tian, Z., He, T., Shen, C., Yan, Y.: Decoders matter for semantic segmentation: Data-dependent decoding enables flexible feature aggregation. In: CVPR, pp. 3126–3135 (2019)
Google Scholar
Tsai, Y.H., Yang, M.H., Black, M.J.: Video segmentation via object flow. In: CVPR, pp. 3899–3908 (2016)
Google Scholar
Ventura, C., Bellver, M., Girbau, A., Salvador, A., Marques, F., Giro-i Nieto, X.: RVOS: end-to-end recurrent network for video object segmentation. In: CVPR, pp. 5277–5286 (2019)
Google Scholar
Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., Chen, L.C.: Feelvos: fast end-to-end embedding learning for video object segmentation. In: CVPR, pp. 9481–9490 (2019)
Google Scholar
Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for the 2017 DAVIS challenge on video object segmentation. In: The 2017 DAVIS Challenge on VOS-CVPR Workshops, vol. 5 (2017)
Google Scholar
Voigtlaender, P., Luiten, J., Leibe, B.: BoLTVOS: box-level tracking for video object segmentation. arXiv preprint arXiv:1904.04552 (2019)
Wang, Q., et al.: Fast online object tracking and segmentation: a unifying approach. In: CVPR, pp. 1328–1338 (2019)
Google Scholar
Wang, W., Shen, J., Porikli, F., Yang, R.: Semi-supervised video object segmentation with super-trajectories. IEEE Trans. Pattern Anal. Mach. Intell. 41(4), 985–998 (2018)
Google Scholar
Wang, Z., Xu, J., Liu, L., Zhu, F., Shao, L.: RANet: ranking attention network for fast video object segmentation. In: ICCV, pp. 3978–3987 (2019)
Google Scholar
Oh, S.W., et al.: Fast video object segmentation by reference-guided mask propagation. In: CVPR, pp. 7376–7385 (2018)
Google Scholar
Xu, N., et al.: YouTube-VOS: a large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327 (2018)
Yang, L., et al.: Efficient video object segmentation via network modulation. In: CVPR, pp. 6499–6507 (2018)
Google Scholar
Yang, Z., et al.: Anchor diffusion for unsupervised video object segmentation. In: ICCV, pp. 931–940 (2019)
Google Scholar
Zhou, Q., et al.: Proposal, tracking and segmentation (PTS): a cascaded network for video object segmentation. arXiv preprint arXiv:1907.01203 (2019)
Zhuo, T., Cheng, Z., Kankanhalli, M.: Fast video object segmentation via mask transfer network. arXiv preprint arXiv:1908.10717 (2019)

Download references

Acknowledgment

This study was supported by the BK21 FOUR project (AI-driven Convergence Software Education Research Program) funded by the Ministry of Education, School of Computer Science and Engineering, Kyungpook National University, Korea (4199990214394).

Author information

Authors and Affiliations

Department of Computer Vision, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
Mustansar Fiaz
Department of Computer Science, Information Technology University, Lahore, Pakistan
Arif Mahmood
School of Computer Science and Engineering, Kyungpook National University, Daegu, Republic of Korea
Sehar Shahzad Farooq, Muhammad Shaheryar & Soon Ki Jung
Department of Computer Science, University of Central Florida, Orland, USA
Kamran Ali

Authors

Mustansar Fiaz
View author publications
You can also search for this author in PubMed Google Scholar
Arif Mahmood
View author publications
You can also search for this author in PubMed Google Scholar
Sehar Shahzad Farooq
View author publications
You can also search for this author in PubMed Google Scholar
Kamran Ali
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Shaheryar
View author publications
You can also search for this author in PubMed Google Scholar
Soon Ki Jung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Soon Ki Jung .

Editor information

Editors and Affiliations

Aoyama Gakuin University, Kanagawa, Japan
Kazuhiko Sumi
Chosun University, Gwangju, Korea (Republic of)
In Seop Na
Aoyama Gakuin University, Kanagawa, Japan
Naoshi Kaneko

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fiaz, M., Mahmood, A., Shahzad Farooq, S., Ali, K., Shaheryar, M., Jung, S.K. (2022). Video Object Segmentation Based on Guided Feature Transfer Learning. In: Sumi, K., Na, I.S., Kaneko, N. (eds) Frontiers of Computer Vision. IW-FCV 2022. Communications in Computer and Information Science, vol 1578. Springer, Cham. https://doi.org/10.1007/978-3-031-06381-7_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-06381-7_14
Published: 17 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06380-0
Online ISBN: 978-3-031-06381-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Video Object Segmentation Based on Guided Feature Transfer Learning