Full length article
SRI-Net: Similarity retrieval-based inference network for light field salient object detection

https://doi.org/10.1016/j.jvcir.2022.103721Get rights and content

Highlights

  • We deploy the FSRM to explicitly dig complementary information from all focal slices.

  • The similarity computation is utilized to retrieve the most valuable focal slice.

  • To aggregate the retrieved focal slice and the RGB image, we deploy the two-stream SIM to extract and fuse the deep features of them.

Abstract

The cutting-edge RGB saliency models are prone to fail for some complex scenes, while RGB-D saliency models are often affected by inaccurate depth maps. Fortunately, light field images can provide a sufficient spatial layout depiction of 3D scenes. Therefore, this paper focuses on salient object detection of light field images, where a Similarity Retrieval-based Inference Network (SRI-Net) is proposed. Due to various focus points, not all focal slices extracted from light field images are beneficial for salient object detection, thus, the key point of our model lies in that we attempt to select the most valuable focal slice, which can contribute more complementary information for the RGB image. Specifically, firstly, we design a focal slice retrieval module (FSRM) to choose an appropriate focal slice by measuring the foreground similarity between the focal slice and RGB image. Secondly, in order to combine the original RGB image and the selected focal slice, we design a U-shaped saliency inference module (SIM), where the two-stream encoder is used to extract multi-level features, and the decoder is employed to aggregate multi-level deep features. Extensive experiments are conducted on two widely used light field datasets, and the results firmly demonstrate the superiority and effectiveness of the proposed SRI-Net.

Introduction

Salient object detection (SOD) attempts to imitate human visual attention, and it is used to highlight the most attractive objects in the scene. It can be treated as a pre-processing step of many computer vision tasks, such as visual tracking [1], image retrieval [2], image/video segmentation [3], image/video compression [4], [5], and so on.

Generally, according to the input type of saliency models, the salient object detection models can be divided into 2D (RGB), 3D (RGB-D), 4D (light field). In the early stage, many traditional models were developed based on the heuristic prior [6], [7] and the traditional machine learning models [8], [9]. In recent years, convolutional neural networks (CNN) [10] have been widely used in computer vision applications. Especially fully convolutional neural networks (FCN) [11] largely promote salient object detection performance [12], [13], [14], [15]. However, there are also many shortcomings in 2D and 3D models, among which 2D models cannot accurately segment the salient object when encountering the image with low contrast and complex backgrounds. For the 3D models, they often present failure cases when confronting low-quality depth maps of RGB-D scenes. Fortunately, the light field imaging can effectively capture the spatial layout information of the scene, where the focal slices are treated as one type of light field data and can be obtained by adjusting the focal length of camera. According to Fig. 1, it can be seen that the light field data consists of a sequence of focal slices which focus on the different depths of scene. Thus, the light field salient object detection has attracted more and more attention. Over the past few decades, the light field saliency detection has been pushed forward significantly. For example, in [16], Zhang et al. tried to imitate the human memory mechanism to fuse information and excavate spatial correlations of focal slices. Piao et al. [17] designed asymmetrical two-stream architecture to exploit the focal slices and produce focal knowledge tailored for the student network. Zhang et al. [18] implemented the feature fusion of the focal stack by using 3D convolution [19] and aggregated the RGB feature with the fused focal stack feature by deploying the synergistic attention module. We can find that most light field saliency models try to aggregate all focal slices, which largely promotes the performance of light field salient object detection. However, the aggregation of all focal slices will inevitably introduce redundant information [17], [18]. Especially, each focal slice focuses on different depths of the scene, salient objects in some focal slices may be out of focus, and thus the stacked focal slices may introduce redundancy information which will interfere with the concerns on salient regions. Meanwhile, we notice that the focal slice can provide complementary information for the RGB image. Therefore, we attempt to select the most valuable focal slice, which has the similar concerns on foreground regions as the RGB image. In this way, we can filter out the disturbances from other focal slices, and pay more concerns on salient regions by fusing features from the retrieved focal slice and the RGB image.

Motivated by the descriptions above, we propose a novel Similarity Retrieval-based Inference Network (SRI-Net) for light field salient object detection, as shown in Fig. 2, which contains the focal slice retrieval module (FSRM) and the saliency inference module (SIM). To be specific, firstly, some focal slices are with interfering information, leading to the performance degradation of saliency models. In contrast, the most valuable focal slice can provide enough salient cues, and we can retrieve it from all focal slices by deploying FSRM shown in Fig. 3. At the beginning, the RGB image IR is fed into the coarse saliency map prediction network, which is used to generate the coarse saliency map So. After that, the coarse saliency map is used to guide the focal slice retrieval process. Concretely, the RGB image IR and all focal slices {IiF}i=112 multiply by the coarse saliency map So, where both kinds of images only contain the foreground regions. Subsequently, to measure the similarity between the RGB image and each focal slice, we compute the mean absolute error (MAE) value between them. Here, the minimal MAE value corresponds to the most valuable focal slice IF, which focuses on the foreground regions and thus presents a clear appearance of salient objects along with a blurry background. Following this way, the retrieved focal slice IF provides prominent contrast information (namely clear foreground and blurry background) to the RGB image IR, complementing the RGB image.

Secondly, after retrieving the effective focal slice, the RGB image IR and the retrieved focal slice IF are fed into a two-stream encoder–decoder network shown in Fig. 2. Concretely, we first deploy the encoder to extract the multi-scale deep features from the RGB image and the retrieved focal slice, and then fuse the two types of deep features. After that, the decoder tries to progressively integrate the fused deep feature into the final saliency map.

Overall, our main contributions are summarized as follows:

  • 1.

    We propose a novel light field salient object detection model, namely Similarity Retrieval-based Inference Network (SRI-Net), where the proposed SRI-Net consists of the focal slice retrieval module (FSRM) and the saliency inference module (SIM).

  • 2.

    To sufficiently dig the complementary information from the focal slices, we deploy the FSRM to retrieve the most valuable focal slice by comparing the similarity of foreground regions between the RGB image and each focal slice, where the selected focal slice is with a strong contrast between clear salient regions and blurred backgrounds.

  • 3.

    To aggregate the retrieved focal slice and the RGB image, we deploy the two-stream SIM to extract and fuse the deep features from the images. The generated saliency maps can highlight the salient objects completely and are with clear boundary details.

The remaining of this article is organized as follows. In Section 2, the related works about salient object detection are briefly reviewed. A detailed description of the proposed model is given in Section 3. Comprehensive experiments and detailed analyses are reported in Section 4. Finally, we give a conclusion in Section 5.

Section snippets

Related works

In recent years, lots of salient object detection models have been proposed. According to the input types, they can be divided into RGB saliency models [20], RGB-D saliency models [21], light field saliency models [22], and so on. Here, we give a brief introduction for the saliency models.

The proposed method

In this section, the architecture of our proposed Similarity Retrieval-based Inference Network (SRI-Net) is detailed in Section 3.1. The focal slice retrieval module (FSRM) and the saliency inference module (SIM) are introduced in Section 3.2 and Section 3.3, respectively. The loss functions are presented in Section 3.4.

Experiments

In this section, datasets and implementation details are described in Section 4.1. Evaluation metrics are shown in Section 4.2. In Section 4.3, we compare our model with 21 state-of-the-art models quantitatively and qualitatively. Ablation studies are discussed in Section 4.4. We present failure case in Section 4.5.

Conclusion

In this paper, we propose a novel light field salient object detection model SRI-Net, which consists of focal slice retrieval module (FSRM) and saliency inference module (SIM), to accurately pop-out the salient objects. Firstly, we design the FSRM to retrieve the most valuable focal slice, which endows the RGB image branch with a strong saliency prior cue. In FSRM, the similarity measurement helps us retrieve the most effective focal slice. This paves a novel road for the light field salient

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Key Research and Development Program of China under Grants 2020YFB1406604; the Fundamental Research Funds for the Provincial Universities of Zhejiang under Grants GK229909299001-009; the National Natural Science Foundation of China under Grants 62271180, 62171002, 61901145, U21B2024, 61931008, 62071415, 61972123, 62001146; the Zhejiang Province Nature Science Foundation of China under Grants LR17F030006, LY19F030022, LZ22F020003; the Hangzhou Dianzi

References (60)

  • JiangH. et al.

    Salient object detection: A discriminative regional feature integration approach

  • TongN. et al.

    Salient object detection via bootstrap learning

  • LeCunY. et al.

    Gradient-based learning applied to document recognition

    Proc. IEEE

    (1998)
  • LongJ. et al.

    Fully convolutional networks for semantic segmentation

  • QinX. et al.

    Basnet: Boundary-aware salient object detection

  • FanD.-P. et al.

    BBS-Net: RGB-D salient object detection with a bifurcated backbone strategy network

  • ZhouX. et al.

    Dense attention-guided cascaded network for salient object detection of strip steel surface defects

    IEEE Trans. Instrum. Meas.

    (2021)
  • ZhangM. et al.

    Memory-oriented decoder for light field salient object detection

    Adv. Neural Inf. Process. Syst.

    (2019)
  • PiaoY. et al.

    Exploit and replace: An asymmetrical two-stream architecture for versatile light field saliency detection

  • ZhangY. et al.

    Learning synergistic attention for light field salient object detection

  • ChenQ. et al.

    RGB-D salient object detection via 3D convolutional neural networks

  • BorjiA. et al.

    Salient object detection: A survey

    Comput. Vis. Media

    (2019)
  • ZhouT. et al.

    RGB-D salient object detection: A survey

    Comput. Vis. Media

    (2021)
  • FuK. et al.

    Light field salient object detection: A review and benchmark

    Comput. Vis. Media

    (2022)
  • AchantaR. et al.

    Frequency-tuned salient region detection

  • LiuT. et al.

    Learning to detect a salient object

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2010)
  • WuZ. et al.

    Cascaded partial decoder for fast and accurate salient object detection

  • ZhouX. et al.

    Edge-aware multiscale feature integration network for salient object detection in optical remote sensing images

    IEEE Trans. Geosci. Remote Sens.

    (2021)
  • ZhaoJ.-X. et al.

    EGNet: Edge guidance network for salient object detection

  • ChenS. et al.

    Reverse attention-based residual network for salient object detection

    IEEE Trans. Image Process.

    (2020)
  • Cited by (0)

    This paper has been recommended for acceptance by Zicheng Liu.

    View full text