research-article

Hierarchical Vision Transformer with Channel Attention for RGB-D Image Segmentation

Authors:

Chaolong Zhang,

Jian HuangAuthors Info & Claims

SSPS '22: Proceedings of the 4th International Symposium on Signal Processing Systems

Pages 68 - 73

https://doi.org/10.1145/3532342.3532352

Published: 29 June 2022 Publication History

Abstract

Although convolutional neural networks (CNNs) have become the mainstream for image processing and achieved great success in the past decade, due to the local characteristics, CNN is difficult to obtain global and long-range semantical information. Moreover, in some scenes, the pure RGB image-based model is difficult to accurately identify the pixel classification and finely segment the edge of objects. This study presents a hierarchical vision Transformer model named Swin-RGB-D to incorporate and exploit the depth information in depth images to supplement and enhance the ambiguous and obscure features in RGB images. In this design, RGB and depth images are used as the two inputs of the two-branch network. The upstream branch applies the Swin Transform which is capable of learning global continuous information from RGB images for segmentation; the other branch performs channel attention on depth image to abstract the feature correlation and dependency between channels and generates a weight matrix. Then matrix multiplication on the feature maps in each stage of the down-sampling process is performed for weighted multi-modal feature extraction. Then this study adds the fused maps to the up-sampled feature maps of the corresponding size, which sufficiently compensates for the distortion of feature in the sampling process. The experiment results on the two benchmark datasets show that the proposed model makes the network more sensitive to edge information.

References

[1]

Vaswani A, Shazeer N, Parmar N, Attention is all you need[C]//Advances in neural information processing systems. 2017: 5998-6008.

[2]

Dosovitskiy A, Beyer L, Kolesnikov A, An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint arXiv:2010.11929, 2020.

[3]

Peng Z, Huang W, Gu S, Conformer: Local Features Coupling Global Representations for Visual Recognition[J]. arXiv preprint arXiv:2105.03889, 2021.

[4]

Strudel R, Garcia R, Laptev I, Segmenter: Transformer for Semantic Segmentation[J]. arXiv preprint arXiv:2105.05633, 2021.

[5]

Liu Z, Lin Y, Cao Y, Swin transformer: Hierarchical vision transformer using shifted windows[J]. arXiv preprint arXiv:2103.14030, 2021.

[6]

Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7132-7141.

[7]

Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 3431-3440.

[8]

Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation[C]//International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015: 234-241.

[9]

Silberman N, Hoiem D, Kohli P, Indoor segmentation and support inference from rgbd images[C]//European conference on computer vision. Springer, Berlin, Heidelberg, 2012: 746-760.

[10]

Song S, Lichtenberg S P, Xiao J. Sun rgb-d: A rgb-d scene understanding benchmark suite[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 567-576.

[11]

Janoch A, Karayev S, Jia Y, A category-level 3d object dataset: Putting the kinect to work[M]//Consumer depth cameras for computer vision. Springer, London, 2013: 141-165.

[12]

Xiao J, Owens A, Torralba A. Sun3d: A database of big spaces reconstructed using sfm and object labels[C]//Proceedings of the IEEE international conference on computer vision. 2013: 1625-1632.

[13]

Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 3431-3440.

[14]

Lin G, Shen C, Van Den Hengel A, Exploring context with deep structured models for semantic segmentation[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 40(6): 1352-1366.

[15]

Lin G, Milan A, Shen C, Refinenet: Multi-path refinement networks for high-resolution semantic segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 1925-1934.

[16]

Park S J, Hong K S, Lee S. Rdfnet: Rgb-d multi-level residual feature fusion for indoor semantic segmentation[C]//Proceedings of the IEEE international conference on computer vision. 2017: 4980-4989.

[17]

Li Z, Gan Y, Liang X, Lstm-cf: Unifying context modeling and fusion with lstms for rgb-d scene labeling[C]//European conference on computer vision. Springer, Cham, 2016: 541-557.

[18]

Hazirbas C, Ma L, Domokos C, Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture[C]//Asian conference on computer vision. Springer, Cham, 2016: 213-228.

[19]

Kendall A, Badrinarayanan V, Cipolla R. Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding[J]. arXiv preprint arXiv:1511.02680, 2015.

Cited By

He YYu HLiu XYang ZSun WAnwar SMian A(2025)Deep learning based 3D segmentation in computer vision: A surveyInformation Fusion10.1016/j.inffus.2024.102722115(102722)Online publication date: Mar-2025
https://doi.org/10.1016/j.inffus.2024.102722
Zhang PChen MGao M(2024)Semantic Guidance Fusion Network for Cross-Modal Semantic SegmentationSensors10.3390/s2408247324:8(2473)Online publication date: 12-Apr-2024
https://doi.org/10.3390/s24082473
Roshan NBarik DRoseline S(2024)MalViT: An Approach to Enhancing Malware Detection2024 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI)10.1109/ACCAI61061.2024.10601747(1-8)Online publication date: 9-May-2024
https://doi.org/10.1109/ACCAI61061.2024.10601747
Show More Cited By

Recommendations

No-reference image quality assessment with multi-scale weighted residuals and channel attention mechanism
Abstract
With the rapid development of deep learning, no-reference image quality assessment (NR-IQA) based on convolutional neural network (CNN) plays an important role in image processing. Currently, most CNN-based NR-IQA methods focus primarily on the ...
All-in-One Image Dehazing Based on Attention Mechanism
Intelligent Robotics and Applications
Abstract
The objective of image dehazing is to restore the clear content from a hazy image. However, different parts of the same image pose varying degrees of difficulty for recovery. Existing image dehazing networks treat channel and pixel features ...
Domain adaptation from RGB-D to RGB images

The introduction of depth cameras offers an opportunity to utilize the depth images to help the object recognition tasks. However, when our target tasks are classifying RGB images, how can we use the RGB-D images? To deal with this problem, we proposed ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SSPS '22: Proceedings of the 4th International Symposium on Signal Processing Systems

March 2022

116 pages

ISBN:9781450396103

DOI:10.1145/3532342

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 June 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

NSFC
the Sichuan Science and Technology Programs

Conference

SSPS 2022

SSPS 2022: 2022 4th International Symposium on Signal Processing Systems

March 25 - 27, 2022

Xi'an, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
179
Total Downloads

Downloads (Last 12 months)34
Downloads (Last 6 weeks)3

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

He YYu HLiu XYang ZSun WAnwar SMian A(2025)Deep learning based 3D segmentation in computer vision: A surveyInformation Fusion10.1016/j.inffus.2024.102722115(102722)Online publication date: Mar-2025
https://doi.org/10.1016/j.inffus.2024.102722
Zhang PChen MGao M(2024)Semantic Guidance Fusion Network for Cross-Modal Semantic SegmentationSensors10.3390/s2408247324:8(2473)Online publication date: 12-Apr-2024
https://doi.org/10.3390/s24082473
Roshan NBarik DRoseline S(2024)MalViT: An Approach to Enhancing Malware Detection2024 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI)10.1109/ACCAI61061.2024.10601747(1-8)Online publication date: 9-May-2024
https://doi.org/10.1109/ACCAI61061.2024.10601747
Zhang PKong CXu YZhang CJin JLi TJiang XTang D(2024)An Improved PointNet++ Based Method for 3D Point Cloud Geometric Features Segmentation in Mechanical PartsProcedia CIRP10.1016/j.procir.2024.10.006129(25-30)Online publication date: 2024
https://doi.org/10.1016/j.procir.2024.10.006
Tang XZhang ZMeng YXie JTang CZhang W(2024)Cascading context enhancement network for RGB-D semantic segmentationMultimedia Tools and Applications10.1007/s11042-024-19110-1Online publication date: 15-Apr-2024
https://doi.org/10.1007/s11042-024-19110-1
Li TZhou QWu DSun MHu T(2024)CLGFormer: Cross-Level-Guided transformer for RGB-D semantic segmentationMultimedia Tools and Applications10.1007/s11042-024-19051-9Online publication date: 9-May-2024
https://doi.org/10.1007/s11042-024-19051-9
Zhao YQin HXu LYu HChen Y(2024)A review of deep learning-based stereo vision techniques for phenotype feature and behavioral analysis of fish in aquacultureArtificial Intelligence Review10.1007/s10462-024-10960-758:1Online publication date: 7-Nov-2024
https://doi.org/10.1007/s10462-024-10960-7
Wang WZhuo TZhang XSun MYin HXing YZhang YEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Automatic Network Architecture Search for RGB-D Semantic SegmentationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612288(3777-3786)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612288
He HXu YZhang CGuo BXu ZJin JKong CHuang J(2023)Industrial Image Anomaly Detection Method Based on Improved MAE2023 28th International Conference on Automation and Computing (ICAC)10.1109/ICAC57885.2023.10275293(1-6)Online publication date: 30-Aug-2023
https://doi.org/10.1109/ICAC57885.2023.10275293

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten