Abstract
Video Object Segmentation (VOS) is a fundamental task with many real-world computer vision applications and challenging due to available distractors and background clutter. Many existing online learning approaches have limited practical significance because of high computational cost required to fine-tune network parameters. Moreover, matching based and propagation approaches are computationally efficient but may suffer from degraded performance in cluttered backgrounds and object drifts. In order to handle these issues, we propose an offline end-to-end model to learn guided feature transfer for VOS. We introduce guided feature modulation based on target mask to capture the video context information and a generative appearance model is used to provide cues for both the target and the background. Proposed guided feature modulation system learns the target semantic information based on modulation activations. Generative appearance model learns the probability of a pixel to be target or background. In addition, low-resolution features from deeper networks may not capture the global contextual information and may reduce the performance during feature refinement. Therefore, we also propose a guided pooled decoder to learn the global as well as local context information for better feature refinement. Evaluation over two VOS benchmark datasets including DAVIS2016 and DAVIS2017 have shown excellent performance of the proposed framework compared to more than 20 existing state-of-the-art methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bao, L., Wu, B., Liu, W.: CNN in MRF: video object segmentation via inference in a CNN-based higher-order spatio-temporal MRF. In: CVPR, pp. 5977ā5986 (2018)
Caelles, S., et al.: One-shot video object segmentation. In: CVPR, pp. 221ā230 (2017)
Caelles, S., et al.: Fast video object segmentation with spatio-temporal GANs. arXiv preprint arXiv:1903.12161 (2019)
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834ā848 (2017)
Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: ECCV, pp. 801ā818 (2018)
Chen, Y., et al.: Blazingly fast video object segmentation with pixel-wise metric learning. In: CVPR, pp. 1189ā1198 (2018)
Cheng, J., et al.: SegFlow: Joint learning for video object segmentation and optical flow. In: ICCV, pp. 686ā695 (2017)
Cheng, J., et al.: Fast and accurate online video object segmentation via tracking parts. In: CVPR, pp. 7415ā7424 (2018)
Ci, H., Wang, C., Wang, Y.: Video object segmentation by learning location-sensitive embeddings. In: ECCV, pp. 501ā516 (2018)
De Vries, H., Strub, F., Mary, J., Larochelle, H., Pietquin, O., Courville, A.C.: Modulating early visual processing by language. In: Advances in Neural Information Processing Systems, pp. 6594ā6604 (2017)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248ā255. IEEE (2009)
Fiaz, M., Mahmood, A., Baek, K.Y., Farooq, S.S., Jung, S.K.: Improving object tracking by added noise and channel attention. Sensors 20(13), 3780 (2020)
Fiaz, M., Mahmood, A., Javed, S., Jung, S.K.: Handcrafted and deep trackers: Recent visual object tracking approaches and trends. ACM Comput. Surv. (CSUR) 52(2), 1ā44 (2019)
Fiaz, M., Mahmood, A., Jung, S.K.: Learning soft mask based feature fusion with channel and spatial attention for robust visual object tracking. Sensors 20(14), 4021 (2020)
Fiaz, M., Mahmood, A., Jung, S.K.: Video object segmentation using guided feature and directional deep appearance learning. In: Proceedings of the 2020 DAVIS Challenge on Video Object Segmentation-CVPR, Workshops, Seattle, WA, USA, vol. 19 (2020)
Fiaz, M., et al.: Adaptive feature selection Siamese networks for visual tracking. In: Ohyama, W., Jung, S.K. (eds.) IW-FCV 2020. CCIS, vol. 1212, pp. 167ā179. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-4818-5_13
Fiaz, M., Zaheer, M.Z., Mahmood, A., Lee, S.I., Jung, S.K.: 4G-VOS: video object segmentation using guided context embedding. Knowl. Based Syst. 231, 107401 (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770ā778 (2016)
Hu, Y.T., Huang, J.B., Schwing, A.G.: Videomatch: Matching based video object segmentation. In: ECCV, pp. 54ā70 (2018)
Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: ICCV, pp. 1501ā1510 (2017)
Jampani, V., Gadde, R., Gehler, P.V.: Video propagation networks. In: CVPR, pp. 451ā461 (2017)
Jang, W.D., Kim, C.S.: Online video object segmentation via convolutional trident network. In: CVPR, pp. 5849ā5858 (2017)
Johnander, J., Danelljan, M., Brissman, E., Khan, F.S., Felsberg, M.: A generative appearance model for end-to-end video object segmentation. In: CVPR, pp. 8953ā8962 (2019)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Li, X., C. Loy, C.: Video object segmentation with joint re-identification and attention-aware mask propagation. In: ECCV, pp. 90ā105 (2018)
Lin, H., Qi, X., Jia, J.: AGSS-VOS: attention guided single-shot video object segmentation. In: ICCV, pp. 3949ā3957 (2019)
LukežiÄ, A., Matas, J., Kristan, M.: D3s-a discriminative single shot segmentation tracker. arXiv preprint arXiv:1911.08862 (2019)
Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of ICML, vol. 30, p. 3 (2013)
Maninis, K.K., et al.: Video object segmentation without temporal information. IEEE Trans. Pattern Anal. Mach. Intell. 41(6), 1515ā1530 (2018)
Nam, H., Kim, H.: Batch-instance normalization for adaptively style-invariant neural networks. In: Advances in Neural Information Processing System (2018)
Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. In: CVPR, pp. 2663ā2672 (2017)
Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: CVPR, pp. 724ā732 (2016)
Pont-Tuset, J., Perazzi, F., Caelles, S., ArbelƔez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017)
Rahman, M.M., Fiaz, M., Jung, S.K.: Efficient visual tracking with stacked channel-spatial attention learning. IEEE Access 8, 100857ā100869 (2020)
Tian, Z., He, T., Shen, C., Yan, Y.: Decoders matter for semantic segmentation: Data-dependent decoding enables flexible feature aggregation. In: CVPR, pp. 3126ā3135 (2019)
Tsai, Y.H., Yang, M.H., Black, M.J.: Video segmentation via object flow. In: CVPR, pp. 3899ā3908 (2016)
Ventura, C., Bellver, M., Girbau, A., Salvador, A., Marques, F., Giro-i Nieto, X.: RVOS: end-to-end recurrent network for video object segmentation. In: CVPR, pp. 5277ā5286 (2019)
Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., Chen, L.C.: Feelvos: fast end-to-end embedding learning for video object segmentation. In: CVPR, pp. 9481ā9490 (2019)
Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for the 2017 DAVIS challenge on video object segmentation. In: The 2017 DAVIS Challenge on VOS-CVPR Workshops, vol. 5 (2017)
Voigtlaender, P., Luiten, J., Leibe, B.: BoLTVOS: box-level tracking for video object segmentation. arXiv preprint arXiv:1904.04552 (2019)
Wang, Q., et al.: Fast online object tracking and segmentation: a unifying approach. In: CVPR, pp. 1328ā1338 (2019)
Wang, W., Shen, J., Porikli, F., Yang, R.: Semi-supervised video object segmentation with super-trajectories. IEEE Trans. Pattern Anal. Mach. Intell. 41(4), 985ā998 (2018)
Wang, Z., Xu, J., Liu, L., Zhu, F., Shao, L.: RANet: ranking attention network for fast video object segmentation. In: ICCV, pp. 3978ā3987 (2019)
Oh, S.W., et al.: Fast video object segmentation by reference-guided mask propagation. In: CVPR, pp. 7376ā7385 (2018)
Xu, N., et al.: YouTube-VOS: a large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327 (2018)
Yang, L., et al.: Efficient video object segmentation via network modulation. In: CVPR, pp. 6499ā6507 (2018)
Yang, Z., et al.: Anchor diffusion for unsupervised video object segmentation. In: ICCV, pp. 931ā940 (2019)
Zhou, Q., et al.: Proposal, tracking and segmentation (PTS): a cascaded network for video object segmentation. arXiv preprint arXiv:1907.01203 (2019)
Zhuo, T., Cheng, Z., Kankanhalli, M.: Fast video object segmentation via mask transfer network. arXiv preprint arXiv:1908.10717 (2019)
Acknowledgment
This study was supported by the BK21 FOUR project (AI-driven Convergence Software Education Research Program) funded by the Ministry of Education, School of Computer Science and Engineering, Kyungpook National University, Korea (4199990214394).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Fiaz, M., Mahmood, A., Shahzad Farooq, S., Ali, K., Shaheryar, M., Jung, S.K. (2022). Video Object Segmentation Based onĀ Guided Feature Transfer Learning. In: Sumi, K., Na, I.S., Kaneko, N. (eds) Frontiers of Computer Vision. IW-FCV 2022. Communications in Computer and Information Science, vol 1578. Springer, Cham. https://doi.org/10.1007/978-3-031-06381-7_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-06381-7_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06380-0
Online ISBN: 978-3-031-06381-7
eBook Packages: Computer ScienceComputer Science (R0)