A dual-branch hybrid network of CNN and transformer with adaptive keyframe scheduling for video semantic segmentation

Liang, Zhixue; Dong, Wenyong; Zhang, Bo

doi:10.1007/s00530-024-01262-7

A dual-branch hybrid network of CNN and transformer with adaptive keyframe scheduling for video semantic segmentation

Regular Paper
Published: 21 February 2024

Volume 30, article number 67, (2024)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Zhixue Liang^1,2,
Wenyong Dong^1,3 &
Bo Zhang¹

144 Accesses
Explore all metrics

Abstract

Video semantic segmentation (VSS) plays a crucial role in various realistic applications, such as unmanned vehicles, autonomous robots, and augmented reality. Despite the significant progress achieved in this field, balancing accuracy and efficiency remains a significant challenge. This paper presents a novel dual-branch hybrid network of CNN and Transformer with adaptive keyframe scheduling (DHN–AKS) to achieve higher accuracy and faster inference times for VSS. One branch \(Net^T\) uses a hierarchical transformer to extract high-level features on keyframes beneficial for segmentation accuracy in consideration of transformer’s powerful ability of modeling global semantic information. The other branch \( Net^C \) uses a lightweight feature network (ResNet-18) to extract the low-level features on non-keyframes beneficial for segmentation efficiency. Moreover, we present a dynamically updating memory matrix that memorizes the significant semantic information of historical video frames, enabling the exploration of the temporal relevance of the current frame based on cross attention. Experiments on two benchmark data sets, Cityscapes and CamVid, demonstrate that our proposed framework achieves competitive performance in terms of accuracy and inference time against some previous state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Object detection using YOLO: challenges, architectural successors, datasets and applications

Article 08 August 2022

Attention mechanisms in computer vision: A survey

Article Open access 15 March 2022

Conv-Attention: A Low Computation Attention Calculation Method for Swin Transformer

Article Open access 24 February 2024

Data availability

The data used to support the findings of this study are available from the corresponding author upon request.

References

Li, Y., Shi, J., Lin, D.: Low-latency video semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5997–6005 (2018)
Hu, P., Caba, F., Wang, O., Lin, Z., Sclaroff, S., Perazzi, F.: Temporally distributed networks for fast video semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8818–8827 (2020)
Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7262–7272 (2021)
Wang, H., Wang, W., Liu, J.: Temporal memory attention for video semantic segmentation. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 2254–2258 (2021). IEEE
Jin, Y., Han, D., Ko, H.: Trseg: transformer for semantic segmentation. Pattern Recognit. Lett. 148, 29–35 (2021)
Article Google Scholar
Lazarević, M.: Stability and stabilization of fractional order time delay systems. Sci. Tech. Rev. 61(1), 31–45 (2011)
Google Scholar
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 34, 12077–12090 (2021)
Google Scholar
Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6881–6890 (2021)
Wu, S., Wu, T., Lin, F., Tian, S., Guo, G.: Fully transformer networks for semantic image segmentation. arXiv preprint arXiv:2106.04108 (2021)
Duan, Z., Huang, X., Ma, J.: Transformer-based cross-modal information fusion network for semantic segmentation. Neural Process. Lett. 1–15 (2023)
Qin, Z., Liu, J., Zhang, X., Tian, M., Zhou, A., Yi, S., Li, H.: Pyramid fusion transformer for semantic segmentation. arXiv preprint arXiv:2201.04019 (2022)
Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep feature flow for video recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2349–2358 (2017)
Paul, M., Danelljan, M., Van Gool, L., Timofte, R.: Local memory attention for fast video semantic segmentation. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1102–1109 (2021). IEEE
Li, J., Wang, W., Chen, J., Niu, L., Si, J., Qian, C., Zhang, L.: Video semantic segmentation via sparse temporal transformer. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 59–68 (2021)
Gadde, R., Jampani, V., Gehler, P.V.: Semantic video cnns through representation warping. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4453–4462 (2017)
Jin, X., Li, X., Xiao, H., Shen, X., Lin, Z., Yang, J., Chen, Y., Dong, J., Liu, L., Jie, Z., et al.: Video scene parsing with predictive feature learning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5580–5588 (2017)
Kundu, A., Vineet, V., Koltun, V.: Feature space optimization for semantic video segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3168–3175 (2016)
Nilsson, D., Sminchisescu, C.: Semantic video segmentation by gated recurrent flow propagation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6819–6828 (2018)
Shelhamer, E., Rakelly, K., Hoffman, J., Darrell, T.: Clockwork convnets for video semantic segmentation. In: Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14, pp. 852–868 (2016). Springer
Liu, Y., Shen, C., Yu, C., Wang, J.: Efficient semantic video segmentation with per-frame inference. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pp. 352–368 (2020). Springer
Lin, L., Fan, H., Zhang, Z., Xu, Y., Ling, H.: Swintrack: a simple and strong baseline for transformer tracking. Adv. Neural Inf. Process. Syst. 35, 16743–16754 (2022)
Google Scholar
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: herarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 801–818 (2018)
Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., Lu, H.: Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3146–3154 (2019)
Yu, C., Wang, J., Gao, C., Yu, G., Shen, C., Sang, N.: Context prior for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12416–12425 (2020)
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)
Article Google Scholar
Zhou, Y., Zheng, X., Ouyang, W., Li, B.: A strip dilated convolutional network for semantic segmentation. Neural Process. Lett. 1–21 (2022)
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)
Chen, L.-C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III 18, pp. 234–241 (2015). Springer
Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017)
Article Google Scholar
Lin, G., Milan, A., Shen, C., Reid, I.: RefineNet: multi-path refinement networks for high-resolution semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1925–1934 (2017)
Mohan, R., Valada, A.: EfficientPS: efficient panoptic segmentation. Int. J. Comput. Vis. 129(5), 1551–1579 (2021)
Article Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16 × 16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Sun, G., Liu, Y., Ding, H., Probst, T., Van Gool, L.: Coarse-to-fine feature mining for video semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3126–3137 (2022)
Jain, S., Wang, X., Gonzalez, J.E.: Accel: A corrective fusion network for efficient semantic segmentation on video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8866–8875 (2019)
Liu, J., Xu, X., Shi, Y., Deng, C., Shi, M.: RelaxNet: residual efficient learning and attention expected fusion network for real-time semantic segmentation. Neurocomputing 474, 115–127 (2022)
Article Google Scholar
Brostow, G.J., Shotton, J., Fauqueur, J., Cipolla, R.: Segmentation and recognition using structure from motion point clouds. In: Computer Vision–ECCV 2008: 10th European Conference on Computer Vision, Marseille, France, October 12-18, 2008, Proceedings, Part I 10, pp. 44–57 (2008). Springer
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). IEEE
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022)
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890 (2017)
Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H.: Video swin transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3202–3211 (2022)

Download references

Author information

Authors and Affiliations

School of Computer Science, Wuhan University, Wuhan, 430072, China
Zhixue Liang, Wenyong Dong & Bo Zhang
School of Computer and Software, Nanyang Institute of Technology, Nanyang, 473000, China
Zhixue Liang
School of Information Network Security, Xinjiang University of Political Science and Law, Tumushuke, 843900, China
Wenyong Dong

Authors

Zhixue Liang
View author publications
You can also search for this author in PubMed Google Scholar
Wenyong Dong
View author publications
You can also search for this author in PubMed Google Scholar
Bo Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

ZL wrote the main manuscript text, BZ performed the data analysis, and WD provided the methodology.

Corresponding author

Correspondence to Wenyong Dong.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Communicated by J. Gao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liang, Z., Dong, W. & Zhang, B. A dual-branch hybrid network of CNN and transformer with adaptive keyframe scheduling for video semantic segmentation. Multimedia Systems 30, 67 (2024). https://doi.org/10.1007/s00530-024-01262-7

Download citation

Received: 11 April 2023
Accepted: 09 January 2024
Published: 21 February 2024
DOI: https://doi.org/10.1007/s00530-024-01262-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A dual-branch hybrid network of CNN and transformer with adaptive keyframe scheduling for video semantic segmentation

Abstract

Access this article

Similar content being viewed by others

Object detection using YOLO: challenges, architectural successors, datasets and applications

Attention mechanisms in computer vision: A survey

Conv-Attention: A Low Computation Attention Calculation Method for Swin Transformer

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A dual-branch hybrid network of CNN and transformer with adaptive keyframe scheduling for video semantic segmentation

Abstract

Access this article

Similar content being viewed by others

Object detection using YOLO: challenges, architectural successors, datasets and applications

Attention mechanisms in computer vision: A survey

Conv-Attention: A Low Computation Attention Calculation Method for Swin Transformer

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation