skip to main content
10.1145/3532342.3532352acmotherconferencesArticle/Chapter ViewAbstractPublication PagessspsConference Proceedingsconference-collections
research-article

Hierarchical Vision Transformer with Channel Attention for RGB-D Image Segmentation

Published: 29 June 2022 Publication History

Abstract

Although convolutional neural networks (CNNs) have become the mainstream for image processing and achieved great success in the past decade, due to the local characteristics, CNN is difficult to obtain global and long-range semantical information. Moreover, in some scenes, the pure RGB image-based model is difficult to accurately identify the pixel classification and finely segment the edge of objects. This study presents a hierarchical vision Transformer model named Swin-RGB-D to incorporate and exploit the depth information in depth images to supplement and enhance the ambiguous and obscure features in RGB images. In this design, RGB and depth images are used as the two inputs of the two-branch network. The upstream branch applies the Swin Transform which is capable of learning global continuous information from RGB images for segmentation; the other branch performs channel attention on depth image to abstract the feature correlation and dependency between channels and generates a weight matrix. Then matrix multiplication on the feature maps in each stage of the down-sampling process is performed for weighted multi-modal feature extraction. Then this study adds the fused maps to the up-sampled feature maps of the corresponding size, which sufficiently compensates for the distortion of feature in the sampling process. The experiment results on the two benchmark datasets show that the proposed model makes the network more sensitive to edge information.

References

[1]
Vaswani A, Shazeer N, Parmar N, Attention is all you need[C]//Advances in neural information processing systems. 2017: 5998-6008.
[2]
Dosovitskiy A, Beyer L, Kolesnikov A, An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint arXiv:2010.11929, 2020.
[3]
Peng Z, Huang W, Gu S, Conformer: Local Features Coupling Global Representations for Visual Recognition[J]. arXiv preprint arXiv:2105.03889, 2021.
[4]
Strudel R, Garcia R, Laptev I, Segmenter: Transformer for Semantic Segmentation[J]. arXiv preprint arXiv:2105.05633, 2021.
[5]
Liu Z, Lin Y, Cao Y, Swin transformer: Hierarchical vision transformer using shifted windows[J]. arXiv preprint arXiv:2103.14030, 2021.
[6]
Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7132-7141.
[7]
Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 3431-3440.
[8]
Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation[C]//International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015: 234-241.
[9]
Silberman N, Hoiem D, Kohli P, Indoor segmentation and support inference from rgbd images[C]//European conference on computer vision. Springer, Berlin, Heidelberg, 2012: 746-760.
[10]
Song S, Lichtenberg S P, Xiao J. Sun rgb-d: A rgb-d scene understanding benchmark suite[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 567-576.
[11]
Janoch A, Karayev S, Jia Y, A category-level 3d object dataset: Putting the kinect to work[M]//Consumer depth cameras for computer vision. Springer, London, 2013: 141-165.
[12]
Xiao J, Owens A, Torralba A. Sun3d: A database of big spaces reconstructed using sfm and object labels[C]//Proceedings of the IEEE international conference on computer vision. 2013: 1625-1632.
[13]
Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 3431-3440.
[14]
Lin G, Shen C, Van Den Hengel A, Exploring context with deep structured models for semantic segmentation[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 40(6): 1352-1366.
[15]
Lin G, Milan A, Shen C, Refinenet: Multi-path refinement networks for high-resolution semantic segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 1925-1934.
[16]
Park S J, Hong K S, Lee S. Rdfnet: Rgb-d multi-level residual feature fusion for indoor semantic segmentation[C]//Proceedings of the IEEE international conference on computer vision. 2017: 4980-4989.
[17]
Li Z, Gan Y, Liang X, Lstm-cf: Unifying context modeling and fusion with lstms for rgb-d scene labeling[C]//European conference on computer vision. Springer, Cham, 2016: 541-557.
[18]
Hazirbas C, Ma L, Domokos C, Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture[C]//Asian conference on computer vision. Springer, Cham, 2016: 213-228.
[19]
Kendall A, Badrinarayanan V, Cipolla R. Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding[J]. arXiv preprint arXiv:1511.02680, 2015.

Cited By

View all
  • (2025)Deep learning based 3D segmentation in computer vision: A surveyInformation Fusion10.1016/j.inffus.2024.102722115(102722)Online publication date: Mar-2025
  • (2024)Semantic Guidance Fusion Network for Cross-Modal Semantic SegmentationSensors10.3390/s2408247324:8(2473)Online publication date: 12-Apr-2024
  • (2024)MalViT: An Approach to Enhancing Malware Detection2024 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI)10.1109/ACCAI61061.2024.10601747(1-8)Online publication date: 9-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
SSPS '22: Proceedings of the 4th International Symposium on Signal Processing Systems
March 2022
116 pages
ISBN:9781450396103
DOI:10.1145/3532342
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 June 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Channel attention
  2. Depth images
  3. Multi-modal
  4. Segmentation
  5. Swin Transformer

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • NSFC
  • the Sichuan Science and Technology Programs

Conference

SSPS 2022

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)34
  • Downloads (Last 6 weeks)3
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Deep learning based 3D segmentation in computer vision: A surveyInformation Fusion10.1016/j.inffus.2024.102722115(102722)Online publication date: Mar-2025
  • (2024)Semantic Guidance Fusion Network for Cross-Modal Semantic SegmentationSensors10.3390/s2408247324:8(2473)Online publication date: 12-Apr-2024
  • (2024)MalViT: An Approach to Enhancing Malware Detection2024 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI)10.1109/ACCAI61061.2024.10601747(1-8)Online publication date: 9-May-2024
  • (2024)An Improved PointNet++ Based Method for 3D Point Cloud Geometric Features Segmentation in Mechanical PartsProcedia CIRP10.1016/j.procir.2024.10.006129(25-30)Online publication date: 2024
  • (2024)Cascading context enhancement network for RGB-D semantic segmentationMultimedia Tools and Applications10.1007/s11042-024-19110-1Online publication date: 15-Apr-2024
  • (2024)CLGFormer: Cross-Level-Guided transformer for RGB-D semantic segmentationMultimedia Tools and Applications10.1007/s11042-024-19051-9Online publication date: 9-May-2024
  • (2024)A review of deep learning-based stereo vision techniques for phenotype feature and behavioral analysis of fish in aquacultureArtificial Intelligence Review10.1007/s10462-024-10960-758:1Online publication date: 7-Nov-2024
  • (2023)Automatic Network Architecture Search for RGB-D Semantic SegmentationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612288(3777-3786)Online publication date: 26-Oct-2023
  • (2023)Industrial Image Anomaly Detection Method Based on Improved MAE2023 28th International Conference on Automation and Computing (ICAC)10.1109/ICAC57885.2023.10275293(1-6)Online publication date: 30-Aug-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media