Lightweight video salient object detection via channel-shuffle enhanced multi-modal fusion network

Huang, Kan; Xu, Zhijing

doi:10.1007/s11042-023-15251-x

Lightweight video salient object detection via channel-shuffle enhanced multi-modal fusion network

Published: 30 May 2023

Volume 83, pages 1025–1039, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

223 Accesses
2 Citations
Explore all metrics

Abstract

Video salient object detection (VSOD) has witnessed great development with the application of deep neural networks. However, the high computational cost of neural networks has hindered the deployment of VSOD models in real-world applications.In this work, we focus on developing lightweight VSOD model. The main issues involved in designing lightweight video saliency models include: how to combine multi-modal information (i.e., spatial and temporal information) and model multi-scale spatial context in an efficient setting. To tackle these issues, we propose a lightweight neural network architecture for VSOD. We start by adopting the ImageNet-pretrained ShuffleNet-V2 for deep feature extraction. Based on the backbone network, a Depth-wise Multi-scale Pooling Module (DMPM) is proposed to aggregate multi-scale spatial context information, which occupies only a small amount of parameters and computational overheads. Most importantly, a Shuffle enhanced Multi-modal Fusion Module (SMFM) is proposed to fuse spatial and temporal information progressively in an efficient manner, deriving the final saliency prediction. With these proposed modules, our method could achieve competitive detection accuracy with current outstanding methods while holding a much smaller model size. Specifically, the proposed model could run at a GPU speed of 49.2 FPS and hold only 1.9M parameters, making it suitable for real-time applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spatiotemporal context-aware network for video salient object detection

Article 20 May 2022

A novel spatiotemporal attention enhanced discriminative network for video salient object detection

Article 25 August 2021

Video salient object detection via self-attention-guided multilayer cross-stack fusion

Article 15 November 2023

Data Availability

Data sharing not applicable to this article.

References

Achanta R, Hemami S, Estrada F, Susstrunk S (2009) Frequency-tuned salient region detection. In: 2009 IEEE Conference on computer vision and pattern recognition, pp 1597–1604. https://doi.org/10.1109/CVPR.2009.5206596
Brox T, Malik J (2010) Object segmentation by long term analysis of point trajectories. In: Daniilidis K., Maragos P., Paragios N. (eds) Computer vision – ECCV 2010, pp 282–295. Springer Berlin Heidelberg, Berlin, Heidelberg
Chen C, Li S, Wang Y, Qin H, Hao A (2017) Video saliency detection via spatial-temporal fusion and low-rank coherency diffusion. IEEE Trans Image Process 26(7):3156–3170. https://doi.org/10.1109/TIP.2017.2670143
Article MathSciNet Google Scholar
Chen C, Wang G, Peng C, Fang Y, Zhang D, Qin H (2021) Exploring rich and efficient spatial temporal interactions for real-time video salient object detection. IEEE Trans Image Process 30:3995–4007. https://doi.org/10.1109/TIP.2021.3068644
Article Google Scholar
Chen Y, Zou W, Tang Y, Li X, Xu C, Komodakis N (2018) Scom: Spatiotemporal constrained optimization for salient object detection. IEEE Trans Image Process 27(7):3345–3357. https://doi.org/10.1109/TIP.2018.2813165
Article MathSciNet Google Scholar
Cheng MM, Mitra NJ, Huang X, Torr PHS, Hu SM (2015) Global contrast based salient region detection. IEEE Trans Pattern Anal Mach Intell 37 (3):569–582. https://doi.org/10.1109/TPAMI.2014.2345401
Article Google Scholar
Fan DP, Cheng MM, Liu Y, Li T, Borji A (2017) Structure-measure: A New Way to Evaluate Foreground Maps. In: IEEE International Conference on Computer Vision (ICCV), pp 4548–4557. IEEE. http://dpfan.net/smeasure/
Fan DP, Wang W, Cheng MM, Shen J (2019) Shifting more attention to video salient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Feng M, Lu H, Ding E (2019) Attentive feedback network for boundary-aware salient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition
Hou Q, Cheng MM, Hu X, Borji A, Tu Z, Torr P (2019) Deeply supervised salient object detection with short connections. IEEE Trans Pattern Anal Mach Intell 41(4):815–828
Article Google Scholar
Hou Q, Cheng MM, Hu X, Borji A, Tu Z, Torr PHS (2017) Deeply supervised salient object detection with short connections. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications
Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A, Brox T (2017) Flownet 2.0: Evolution of optical flow estimation with deep networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). http://lmb.informatik.uni-freiburg.de//Publications/2017/IMKDB17
Itti L (2004) Automatic foveation for video compression using a neurobiological model of visual attention. IEEE Trans Image Process 13(10):1304–1318. https://doi.org/10.1109/TIP.2004.834657
Article Google Scholar
Itti L, Koch C, Niebur E (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intell 20(11):1254–1259. https://doi.org/10.1109/34.730558
Article Google Scholar
Kim H, Kim Y, Sim JY, Kim CS (2015) Spatiotemporal saliency detection for video sequences based on random walk with restart. IEEE Trans Image Process 24(8):2552–2564. https://doi.org/10.1109/TIP.2015.2425544
Article MathSciNet Google Scholar
Le TN, Sugimoto A (2018) Video salient object detection using spatiotemporal deep features. IEEE Trans Image Process 27(10):5002–5015. https://doi.org/10.1109/TIP.2018.2849860
Article MathSciNet Google Scholar
Lee H, Kim D (2018) Salient region-based online object tracking. In: 2018 IEEE Winter conference on applications of computer vision (WACV), pp 1170–1177. https://doi.org/10.1109/WACV.2018.00133
Li F, Kim T, Humayun A, Tsai D, Rehg JM (2013) Video segmentation by tracking many figure-ground segments. In: 2013 IEEE International conference on computer vision, pp 2192–2199. https://doi.org/10.1109/ICCV.2013.273
Li G, Xie Y, Wei T, Wang K, Lin L (2018) Flow guided recurrent neural encoder for video salient object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Li H, Chen G, Li G, Yu Y (2019) Motion guided attention for video salient object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Li J, Xia C, Chen X (2018) A benchmark dataset and saliency-guided stacked autoencoders for video-based salient object detection. IEEE Trans Image Process 27(1):349–364. https://doi.org/10.1109/TIP.2017.2762594
Article MathSciNet Google Scholar
Li S, Seybold B, Vorobyov A, Lei X, Kuo CCJ (2018) Unsupervised video object segmentation with motion-based bilateral networks. In: Proceedings of the European Conference on Computer Vision (ECCV)
Li Y, Chen Y, Dai X, Chen D, Liu M, Yuan L, Liu Z, Zhang L, Vasconcelos N (2021) Micronet: Improving image recognition with extremely low flops. In: 2021 IEEE/CVF International conference on computer vision (ICCV), pp 458–467. https://doi.org/10.1109/ICCV48922.2021.00052
Liu Y, Gu YC, Zhang XY, Wang W, Cheng MM (2020) Lightweight salient object detection via hierarchical visual perception learning. IEEE Trans Cybern, pp 1–11. https://doi.org/10.1109/TCYB.2020.3035613
Liu Z, Li J, Ye L, Sun G, Shen L (2017) Saliency detection for unconstrained videos using superpixel-level graph and spatiotemporal propagation. IEEE Trans Circuits Syst Video Technol 27 (12):2527–2542. https://doi.org/10.1109/TCSVT.2016.2595324
Article Google Scholar
Ma N, Zhang X, Zheng HT, Sun J (2018) Shufflenet v2: Practical guidelines for efficient cnn architecture design
Mehta S, Rastegari M, Anat Caspi LS, Hajishirzi H (2018) Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation. In: ECCV
Paszke A, Chaurasia A, Kim S, Culurciello E (2016) Enet : A deep neural network architecture for real-time semantic segmentation
Perazzi F, Pont-Tuset J, McWilliams B, Van Gool L, Gross M, Sorkine-Hornung A (2016) A benchmark dataset and evaluation methodology for video object segmentation. In: 2016 IEEE Conference on computer vision and pattern recognition (CVPR), pp 724–732. https://doi.org/10.1109/CVPR.2016.85
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC (2019) Mobilenetv2: Inverted residuals and linear bottlenecks
Shafieyan F, Karimi N, Mirmahboub B, Samavi S, Shirani S (2014) Image seam carving using depth assisted saliency map. In: 2014 IEEE International conference on image processing (ICIP), pp 1155–1159. https://doi.org/10.1109/ICIP.2014.7025230
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations
Song H, Wang W, Zhao S, Shen J, Lam KM (2018) Pyramid dilated deeper convlstm for video salient object detection. In: Proceedings of the European Conference on Computer Vision (ECCV)
Tan M, Chen B, Pang R, Vasudevan V, Sandler M, Howard A, Le QV (2019) Mnasnet: Platform-aware neural architecture search for mobile
Tan M, Pang R, Le QV (2020) Efficientdet: Scalable and efficient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Tang Y, Zou W, Jin Z, Chen Y, Hua Y, Li X (2019) Weakly supervised salient object detection with spatiotemporal cascade neural networks. IEEE Trans Circuits Syst Video Technol 29(7):1973–1984. https://doi.org/10.1109/TCSVT.2018.2859773
Article Google Scholar
Wang L, Lu H, Wang Y, Feng M, Wang D, Yin B, Ruan X (2017) Learning to detect salient objects with image-level supervision. In: 2017 IEEE Conference on computer vision and pattern recognition (CVPR), pp 3796–3805. https://doi.org/10.1109/CVPR.2017.404
Wang W, Shen J, Shao L (2015) Consistent video saliency using local gradient flow optimization and global refinement. IEEE Trans Image Process 24 (11):4185–4196. https://doi.org/10.1109/TIP.2015.2460013
Article MathSciNet Google Scholar
Wang W, Shen J, Shao L (2018) Video salient object detection via fully convolutional networks. IEEE Trans Image Process 27(1):38–49. https://doi.org/10.1109/TIP.2017.2754941
Article MathSciNet Google Scholar
Wang W, Song H, Zhao S, Shen J, Zhao S, Hoi SCH, Ling H (2019) Learning unsupervised video object segmentation through visual attention. In: CVPR
Yan P, Li G, Xie Y, Li Z, Wang C, Chen T, Lin L (2019) Semi-supervised video salient object detection using pseudo-labels. In: Proceedings of the IEEE International Conference on Computer Vision, pp 7284–7293
Zhang X, Zhou X, Lin M, Sun J (2017) Shufflenet: An extremely efficient convolutional neural network for mobile devices
Zhao JX, Liu JJ, Fan DP, Cao Y, Yang J, Cheng MM (2019) Egnet: Edge guidance network for salient object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Zhao W, Zhang J, Li L, Barnes N, Liu N, Han J (2021) Weakly supervised video salient object detection
Zhao X, Pang Y, Zhang L, Lu H, Zhang L (2020) Suppress and balance: A simple gated network for salient object detection

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant 62101316.

Author information

Authors and Affiliations

College of Information Engineering, Shanghai Maritime University, Shanghai, China
Kan Huang & Zhijing Xu

Authors

Kan Huang
View author publications
You can also search for this author in PubMed Google Scholar
Zhijing Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kan Huang.

Ethics declarations

Compliance with ethical standards

This article does not contain any studies with human participants or animals performed by any of the authors.

Conflict of Interests

Authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Huang, K., Xu, Z. Lightweight video salient object detection via channel-shuffle enhanced multi-modal fusion network. Multimed Tools Appl 83, 1025–1039 (2024). https://doi.org/10.1007/s11042-023-15251-x

Download citation

Received: 30 July 2021
Revised: 22 November 2022
Accepted: 06 April 2023
Published: 30 May 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s11042-023-15251-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Lightweight video salient object detection via channel-shuffle enhanced multi-modal fusion network

Abstract

Access this article

Similar content being viewed by others

Spatiotemporal context-aware network for video salient object detection

A novel spatiotemporal attention enhanced discriminative network for video salient object detection

Video salient object detection via self-attention-guided multilayer cross-stack fusion

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Compliance with ethical standards

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Lightweight video salient object detection via channel-shuffle enhanced multi-modal fusion network

Abstract

Access this article

Similar content being viewed by others

Spatiotemporal context-aware network for video salient object detection

A novel spatiotemporal attention enhanced discriminative network for video salient object detection

Video salient object detection via self-attention-guided multilayer cross-stack fusion

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Compliance with ethical standards

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation