Deep spatio-frequency saliency detection

doi:10.1016/j.neucom.2020.05.109

Neurocomputing

Volume 453, 17 September 2021, Pages 645-655

https://doi.org/10.1016/j.neucom.2020.05.109 Get rights and content

Abstract

Despite the wide success in many vision tasks, it is still challenging for Convolutional Neural Networks (CNNs) to perform saliency detection due to their limited receptive fields and lack of enough discriminative contexts until very late layers. In this paper, beyond spatial convolution, we propose a Spatio-Frequency Network (SFNet) that exploits spatio-frequency clues to effectively enlarge the receptive fields of CNN layers and more importantly, strengthen their spatial discrimination for better saliency detection. In particular, the proposed SFNet contains a carefully designed Frequency Residual Module (FRM) that captures the holistic representation of the whole image within the frequency domain. The FRM leverages discrete and inverse discrete wavelet transformation to alternatively transfer global spatial features into frequency domains, to assist fast and accurate salient object detection. Besides, SFNet also includes an Aggregation of Frequency and Spatial Feature (AFSF) module to jointly integrate the two domain features guided by saliency results in a top-down manner. In this way, the aggregation features per layer contain rich holistic contexts, and the network can eventually explore more complete salient object parts and details by progressively integrating saliency predictions. Extensive experiments on six widely-used saliency detection datasets clearly demonstrate the advantages of our proposed model compared with state-of-the-arts.

Introduction

Saliency detection aims to identify the most visually distinctive objects in the images. It commonly serves as a pre-processing procedure in many computer vision tasks, such as semantic segmentation [1], [2], visual tracking [3], [4], image retrial [5], video segmentation [6], [7], and image captioning [8]. Early saliency detection methods rely on hand-crafted visual features (e.g., color, texture, intensity contrast) [9], [10], [11] or heuristic priors [12], [13], [14], providing limited performance due to lack of high-level semantic knowledge.

In recent years, the widely successful Convolutional Neural Networks (CNNs) have shown their power in saliency detection as in other vision tasks [15], [16]. The CNN models [17], [18], [19], [20], [21] extract features hierarchically and incorporate higher-level semantic information to better infer salient objects. However, we observe two limitations of CNN models for saliency detection. First, they suffer limited receptive fields in early layers and thus cannot capture sufficient discriminative contexts until very late layers, making it difficult to fast capture the holistic information of an image. Second, the resolution of high-level features is usually too low to generate saliency detection results with sharp details and boundaries.

Recently, some researchers enlarge the receptive fields of CNN-based saliency detection models by increasing network depth [18], [22], but the low resolution of feature maps in deeper layers would lead to undesired coarse saliency predictions and even artifacts. To mitigate this issue, some works [23], [24] adopt dilated convolutional layers to rapidly expand the receptive fields, and incorporate larger contexts without losing resolution for better saliency detection. However, the receptive fields in dilated convolution networks only perform a sparse sampling of input features with checkerboard patterns, which leads to difficulties of griding and extra efforts for designing suitable dilation patterns.

In this paper, we propose a novel CNN-based saliency detection model that can fast capture holistic information at early layers and meanwhile keep rich details within the high-level features. Specifically, we propose to enlarge the receptive field through transforming the spatial CNN feature into the frequency domain, and devise a Spatio-Frequency Network (SFNet) for finer saliency detection by leveraging frequency and spatial domain features jointly. For frequency domain, the discrete wavelet transform (DWT) [25], [26] is able to transform the whole image into the frequency domain fast and provide holistic representation of the image, it will be more effective in enlarging the receptive field for saliency detection and retaining rich high-frequency details when introducing DWT in convolutional network. Therefore, in our network, we adopt DWT to capture comprehensive spatial and frequency information of CNN features for saliency detection. Fig. 1 illustrates the overall flowchart of the proposed network. The proposed SFNet first extracts spatial features of different dimensions via CNN layers. Then, Frequency Residual Module (FRM) is embedded into shallower CNN layers to effectively augment spatial features with frequency features. In particular, in our FRM, discrete wavelet transform (DWT) is performed to transform the input spatial features into frequency features at different bands, and the high frequency bands are retained after DWT. Then inverse DWT (iDWT) is performed with a proxy low frequency feature learned from the top CNN layer and the retained high frequency features to generate augmented spatial features. As such, FRM effectively enlarges the receptive field size by incorporating the whole context without introducing extra parameters, and helps refine spatial saliency features in each CNN layer.

Following FRM, the SFNet adopts an Aggregation of Frequency and Spatial Feature (AFSF) block for each layer to jointly integrate the frequency-augmented features and the semantically meaningful spatial CNN features, guiding by saliency results in a top-down manner. In this way, the aggregation features per layer contain rich holistic contexts, and the proposed network can explore more complete salient object parts and details by progressively integrating intermediate saliency predictions. Our main contributions can be summarized as follows:

•
We propose a novel spatio-frequency CNN network to detect salient objects by mining cues in both spatial and frequency domains simultaneously, which is new to deep saliency detection.
•
We design a frequency residual module and an aggregation of frequency and spatial feature module to capture the holistic information of the whole image with a large receptive field.
•
The proposed model is shown to effectively capture the whole salient objects and provide sharp saliency detection results by exploiting both frequency and spatial features.

Section snippets

Deep saliency detection models

Saliency detection methods can be divided into hand-crafted features based [26], [27], [14], [12] and CNN driven methods [28], [20], [24], [18], [29], [30], [31], [19], [32]. Earlier works [17], [33], [34], [35] utilize hand crafted features, such as color, texture, and contrast relation, for detecting salient objects in an image. Though effective in simple image scenarios, they are not always robust in challenging cases due to lack of high-level semantic information. Nowadays CNN based models

Proposed method

In this section, we first revisit the Discrete Wavelet Transform (DWT) as preliminaries in Section 3.1. Then, we describe our key novelties, the Frequency Residual Module (FRM) in Section 3.2 and the Aggregation of Frequency and Spatial Feature (AFSF) module in Section 3.3. Finally, the overall architecture of the proposed model is presented in Section 3.4.

Experiments

In this section, we first present the experiment setups, including the used datasets, the evaluation metrics and the implementation details. Then, we report performance of the proposed method and compare it with existing state-of-the-art methods. Afterward, we conduct comprehensive ablation studies to illustrate the impact of each component in our approach on the performance. At last, for illustrating the computational efficiency of SFNet, we also analyze the computational time of the proposed

Conclusion

In this paper, we propose a novel saliency detection model by learning frequency and spatial domain features jointly. The model first embeds a frequency residual module (FRM) into CNN layers with residual learning to effectively augment spatial features facilitated with frequency features. Such FRM enlarges receptive field size to incorporate whole context without introducing extra parameters, and effectively helps to refine spatial saliency features in each CNN layer. Then, an aggregation

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported by the Fundamental Research Funds for the Central Universities (No. 2020JBM403), the Beijing Natural Science Foundation (Grants No. 4202057, No. 4202058, 4202060), National Natural Science Foundation of China (No.62072027, 61872032), and the Ministry of Education - China Mobile Communications Corporation Foundation (No. MCM20170201), and the program of China Scholarships Council (No.201807090094).

Zun Li received the BS degree in School of Software, Zhengzhou University, in 2014. Currently, She is working toward the PhD degree in the School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China. She has been a visiting scholar in the Department of Electronics and Computer Engineering, National University of Singapore, from 2018 to 2019. Her research interests include computer vision and machine learning.

References (69)

Y. Guo et al.
Saliency motivated improved simplified pcnn model for object segmentation
Neurocomputing
(2018)
C. Bai et al.
Saliency-based multi-feature modeling for semantic image retrieval
J. Visual Commun. Image Representation
(2018)
J. Wang et al.
Salient object detection: A discriminative regional feature integration approach
IJCV
(2017)
Y. Wei et al.
Stc: A simple to complex framework for weakly-supervised semantic segmentation
IEEE Trans. Pattern Anal. Mach. Intell.
(2017)
S. Hong, T. You, S. Kwak, B. Han, Online tracking by learning discriminative saliency map with convolutional neural...
L. Qi, Y. Xu, X. Shang, J. Dong, Fusing visual saliency for material recognition, in: CVPRW, 2018, pp....
W. Wang et al.
Saliency-aware video object segmentation
IEEE Trans. Pattern Anal. Mach. Intell.
(2018)
X. Lu, W. Wang, C. Ma, J. Shen, L. Shao, F. Porikli, See more, know more: Unsupervised video object segmentation with...
V. Ramanishka, A. Das, J. Zhang, K. Saenko, Top-down visual saliency guided by captions, in: CVPR, 2017, pp....
C. Yang, L. Zhang, H. Lu, R. Xiang, M.H. Yang, Saliency detection via graph-based manifold ranking, in: CVPR, 2013, pp....

X. Zhu et al.

Saliency detection via affinity graph learning and weighted manifold ranking

Neurocomputing

(2018)

M.M. Cheng et al.

Global contrast based salient region detection

IEEE Trans. Pattern Anal. Mach. Intell.

(2011)

L. Lu et al.

Saliency detection based on foreground appearance and background-prior

Neurocomputing

(2018)

W. Zhu, S. Liang, Y. Wei, J. Sun, Saliency optimization from robust background detection, in: CVPR, 2014, pp....

W. Wang et al.

Consistent video saliency using local gradient flow optimization and global refinement

IEEE Trans. Image Process.

(2015)

W. Wang et al.

Revisiting video saliency prediction in the deep learning era

IEEE Trans. Pattern Anal. Mach. Intell.

(2019)

G. Li, Y. Yu, Visual saliency based on multiscale deep features, in: CVPR, 2015, pp....

N. Liu, J. Han, Dhsnet: Deep hierarchical saliency network for salient object detection, in: CVPR, 2016, pp....

Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, P. Torr, Deeply supervised salient object detection with short...

P. Zhang, D. Wang, H. Lu, H. Wang, X. Ruan, Amulet: Aggregating multi-level convolutional features for salient object...

T. Wang, A. Borji, L. Zhang, P. Zhang, H. Lu, A stagewise refinement model for detecting salient objects in images, in:...

T. Wang et al.

Detect globally, refine locally: A novel approach to saliency detection

CVPR

(2018)

N. Liu, J. Han, M.-H. Yang, Picanet: Learning pixel-wise contextual attention for saliency detection, in: CVPR, 2018,...

L. Zhang, J. Dai, H. Lu, Y. He, G. Wang, A bi-directional message passing model for salient object detection, in: CVPR,...

N. Murray, M. Vanrell, X. Otazu, C.A. Pórraga, Saliency estimation using a non-parametric low-level vision model, in:...

N. Imamoglu et al.

A saliency detection model using low-level features based on wavelet transform

IEEE Trans. Multimedia

(2013)

M.-M. Cheng et al.

Global contrast based salient region detection

IEEE Trans. Pattern Anal. Mach. Intell.

(2015)

W. Wang et al.

Deep visual attention prediction

IEEE Trans. Image Process.

(2018)

W. Wang et al.

Correspondence driven saliency transfer

IEEE Trans. Image Process.

(2016)

G. Li, Y. Yu, Deep contrast learning for salient object detection, in: CVPR, 2016, pp....

W. Wang et al.

Inferring salient objects from human fixations

IEEE Trans. Pattern Anal. Mach. Intell.

(2020)

W. Wang et al.

Video salient object detection via fully convolutional networks

IEEE Trans. Image Process.

(2018)

L. Wang, H. Lu, X. Ruan, M.-H. Yang, Deep networks for saliency detection via local estimation and global search, in:...

G. Lee, Y.-W. Tai, J. Kim, Deep saliency with encoded low level distance map and high level features, in: CVPR, 2016,...

Cited by (4)

Content-aware dynamic filter salient object detection network in multispectral polarimetric imagery
2022, Optik
Citation Excerpt :
In general, traditional salient object detection methods determine the local contrast of image regions with their surroundings through features of color and intensity. Those methods are mainly based on objects’ uniqueness and compactness [6,7]. Moreover, the boundary and connection priors can provide more clues for the saliency detection of central objects [8].
Salient object detection (SOD) is widely applied in image segmentation, image fusion, and adaptive compression. However, the SOD of visible images in complex scenes remains a prominent problem due to the lack of low-level features. To solve this problem, a Content-aware Dynamic Filter salient object detection Network using visible and polarized mask images is proposed. It can use the prior information on polarization dimension to guide the SOD of visual features. First, to extract information from the MSPI, a salient polarization mask M composed of three channels is generated. Secondly, deep fused features of the M and RGB images are generated by Encoder and DenseNet fusion structures with receptive fields. Finally, the fused features guide the decoder to generate saliency maps through the content-aware dynamic filtering model. The indexes of the salient object detection results given in this paper are superior to the state-of-the-art algorithms, especially for objects in dim, low contrast, or high transparency and other complex scenarios.
FESNet: Frequency-Enhanced Saliency Detection Network for Grain Pest Segmentation
2023, Insects
Fusion Transformer-Based Salient Object Detection in Multispectral Polarimetric Imagery
2022, SSRN
Detecting salient image objects using color histogram clustering for region granularity
2021, Journal of Imaging

Congyan Lang received the Ph.D. degree from the School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China, in 2006. She was a Visiting Professor with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore, from 2010 to 2011. From 2014 to 2015, she visited the Department of Computer Science, University of Rochester, Rochester, NY, USA, as a Visiting Researcher. She is currently a Professor with the School of Computer and Information Technology, Beijing Jiaotong University. Her current research interests include multimedia information retrieval and analysis, machine learning, and computer vision.

Tao Wang received the Ph.D. degree in the School of Computer and Information Technology, Beijing Jiaotong University, Beijing, P.R. China, in 2013. He is currently an Associate Professor in the School of Computer and Information Technology, Beijing Jiaotong University. He has been a visiting scholar in the Department of Computer & Information Sciences, Temple University, USA, from 2014 to 2015. His research interests include computer vision and machine learning.

Yidong Li is the Vice-Dean and a professor in the School of Computer and Information Technology at Beijing Jiaotong University. Dr. Li received his B.Eng. degree in electrical and electronic engineering from Beijing Jiaotong University in 2003, and M.Sci. and Ph.D. degrees in computer science from the University of Adelaide, in 2006 and 2010, respectively. Dr. Li’s research interests include big data analysis, privacy preserving and information security, data mining, social computing and intelligent transportation. Dr. Li has published more than 80 research papers in various journals and refereed conferences. He has also co-authored/co-edited 5 books (including proceedings) and contributed several book chapters.

Jiashi Feng received the bachelor’s degree in automation from the University of Science and Technology of China, Hefei, China, and the Ph.D. degree from the National University of Singapore, Singapore, in 2014. He was a Post-Doctoral Researcher with the University of California at Berkeley, Berkeley, CA, USA. He is currently an Assistant Professor with the National University of Singapore. His research interests include computer vision and machine learning, particularly, image recognition, attributes learning, robust optimization, and online and distributed learning.

View full text

Deep spatio-frequency saliency detection

Abstract

Introduction

Section snippets

Deep saliency detection models

Proposed method

Experiments

Conclusion

Declaration of Competing Interest

Acknowledgement

Neurocomputing

J. Visual Commun. Image Representation

IJCV

Stc: A simple to complex framework for weakly-supervised semantic segmentation

IEEE Trans. Pattern Anal. Mach. Intell.

Saliency-aware video object segmentation

IEEE Trans. Pattern Anal. Mach. Intell.

Saliency detection via affinity graph learning and weighted manifold ranking

Neurocomputing

Global contrast based salient region detection

IEEE Trans. Pattern Anal. Mach. Intell.

Saliency detection based on foreground appearance and background-prior

Neurocomputing

Consistent video saliency using local gradient flow optimization and global refinement

IEEE Trans. Image Process.

Revisiting video saliency prediction in the deep learning era

IEEE Trans. Pattern Anal. Mach. Intell.

Detect globally, refine locally: A novel approach to saliency detection

CVPR

A saliency detection model using low-level features based on wavelet transform

IEEE Trans. Multimedia

Global contrast based salient region detection

IEEE Trans. Pattern Anal. Mach. Intell.

Deep visual attention prediction

IEEE Trans. Image Process.

Correspondence driven saliency transfer

IEEE Trans. Image Process.

Inferring salient objects from human fixations

IEEE Trans. Pattern Anal. Mach. Intell.

Video salient object detection via fully convolutional networks

IEEE Trans. Image Process.