Abstract:
RGB-D Salient object detection (SOD) is a pixel-level dense prediction task, which can highlight the prominent object in the scene. Recently, Convolution Neural Network (...Show MoreMetadata
Abstract:
RGB-D Salient object detection (SOD) is a pixel-level dense prediction task, which can highlight the prominent object in the scene. Recently, Convolution Neural Network (CNN) is widely applied in SOD to generate multi-level features, which are complementary to each other. However, most methods ignore the unique characteristics of multi-level features (high-level and low-level features). Given the effective employment of multi-level features, we propose a novel multi-modality hierarchy-aware decision network (HDNet) by embedding a Swin Transformer as an encoder. The proposed HDNet contains three primary designs: (1) a Swin Transformer encoder is employed instead of a CNN to learn long-range dependencies; (2) a hierarchy-aware feature decision mechanism (HFDM) is proposed to exploit effective local detail cues of low-level features and global semantic information of high-level features, which consists of two sub-modules, namely low-hierarchy edge module (LEM) and high-hierarchy region module (HRM); (3) a decision-based fusion module (DFM) is designed to fuse RGB and depth features under the attribute of multi-level features generated from HFDM. Experiments on five public benchmarks verify that our framework has better performance than the other 18 state-of-the-art algorithms.
Published in: IEEE Signal Processing Letters ( Volume: 29)