Deep spatio-frequency saliency detection
Introduction
Saliency detection aims to identify the most visually distinctive objects in the images. It commonly serves as a pre-processing procedure in many computer vision tasks, such as semantic segmentation [1], [2], visual tracking [3], [4], image retrial [5], video segmentation [6], [7], and image captioning [8]. Early saliency detection methods rely on hand-crafted visual features (e.g., color, texture, intensity contrast) [9], [10], [11] or heuristic priors [12], [13], [14], providing limited performance due to lack of high-level semantic knowledge.
In recent years, the widely successful Convolutional Neural Networks (CNNs) have shown their power in saliency detection as in other vision tasks [15], [16]. The CNN models [17], [18], [19], [20], [21] extract features hierarchically and incorporate higher-level semantic information to better infer salient objects. However, we observe two limitations of CNN models for saliency detection. First, they suffer limited receptive fields in early layers and thus cannot capture sufficient discriminative contexts until very late layers, making it difficult to fast capture the holistic information of an image. Second, the resolution of high-level features is usually too low to generate saliency detection results with sharp details and boundaries.
Recently, some researchers enlarge the receptive fields of CNN-based saliency detection models by increasing network depth [18], [22], but the low resolution of feature maps in deeper layers would lead to undesired coarse saliency predictions and even artifacts. To mitigate this issue, some works [23], [24] adopt dilated convolutional layers to rapidly expand the receptive fields, and incorporate larger contexts without losing resolution for better saliency detection. However, the receptive fields in dilated convolution networks only perform a sparse sampling of input features with checkerboard patterns, which leads to difficulties of griding and extra efforts for designing suitable dilation patterns.
In this paper, we propose a novel CNN-based saliency detection model that can fast capture holistic information at early layers and meanwhile keep rich details within the high-level features. Specifically, we propose to enlarge the receptive field through transforming the spatial CNN feature into the frequency domain, and devise a Spatio-Frequency Network (SFNet) for finer saliency detection by leveraging frequency and spatial domain features jointly. For frequency domain, the discrete wavelet transform (DWT) [25], [26] is able to transform the whole image into the frequency domain fast and provide holistic representation of the image, it will be more effective in enlarging the receptive field for saliency detection and retaining rich high-frequency details when introducing DWT in convolutional network. Therefore, in our network, we adopt DWT to capture comprehensive spatial and frequency information of CNN features for saliency detection. Fig. 1 illustrates the overall flowchart of the proposed network. The proposed SFNet first extracts spatial features of different dimensions via CNN layers. Then, Frequency Residual Module (FRM) is embedded into shallower CNN layers to effectively augment spatial features with frequency features. In particular, in our FRM, discrete wavelet transform (DWT) is performed to transform the input spatial features into frequency features at different bands, and the high frequency bands are retained after DWT. Then inverse DWT (iDWT) is performed with a proxy low frequency feature learned from the top CNN layer and the retained high frequency features to generate augmented spatial features. As such, FRM effectively enlarges the receptive field size by incorporating the whole context without introducing extra parameters, and helps refine spatial saliency features in each CNN layer.
Following FRM, the SFNet adopts an Aggregation of Frequency and Spatial Feature (AFSF) block for each layer to jointly integrate the frequency-augmented features and the semantically meaningful spatial CNN features, guiding by saliency results in a top-down manner. In this way, the aggregation features per layer contain rich holistic contexts, and the proposed network can explore more complete salient object parts and details by progressively integrating intermediate saliency predictions. Our main contributions can be summarized as follows:
- •
We propose a novel spatio-frequency CNN network to detect salient objects by mining cues in both spatial and frequency domains simultaneously, which is new to deep saliency detection.
- •
We design a frequency residual module and an aggregation of frequency and spatial feature module to capture the holistic information of the whole image with a large receptive field.
- •
The proposed model is shown to effectively capture the whole salient objects and provide sharp saliency detection results by exploiting both frequency and spatial features.
Section snippets
Deep saliency detection models
Saliency detection methods can be divided into hand-crafted features based [26], [27], [14], [12] and CNN driven methods [28], [20], [24], [18], [29], [30], [31], [19], [32]. Earlier works [17], [33], [34], [35] utilize hand crafted features, such as color, texture, and contrast relation, for detecting salient objects in an image. Though effective in simple image scenarios, they are not always robust in challenging cases due to lack of high-level semantic information. Nowadays CNN based models
Proposed method
In this section, we first revisit the Discrete Wavelet Transform (DWT) as preliminaries in Section 3.1. Then, we describe our key novelties, the Frequency Residual Module (FRM) in Section 3.2 and the Aggregation of Frequency and Spatial Feature (AFSF) module in Section 3.3. Finally, the overall architecture of the proposed model is presented in Section 3.4.
Experiments
In this section, we first present the experiment setups, including the used datasets, the evaluation metrics and the implementation details. Then, we report performance of the proposed method and compare it with existing state-of-the-art methods. Afterward, we conduct comprehensive ablation studies to illustrate the impact of each component in our approach on the performance. At last, for illustrating the computational efficiency of SFNet, we also analyze the computational time of the proposed
Conclusion
In this paper, we propose a novel saliency detection model by learning frequency and spatial domain features jointly. The model first embeds a frequency residual module (FRM) into CNN layers with residual learning to effectively augment spatial features facilitated with frequency features. Such FRM enlarges receptive field size to incorporate whole context without introducing extra parameters, and effectively helps to refine spatial saliency features in each CNN layer. Then, an aggregation
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgement
This work was supported by the Fundamental Research Funds for the Central Universities (No. 2020JBM403), the Beijing Natural Science Foundation (Grants No. 4202057, No. 4202058, 4202060), National Natural Science Foundation of China (No.62072027, 61872032), and the Ministry of Education - China Mobile Communications Corporation Foundation (No. MCM20170201), and the program of China Scholarships Council (No.201807090094).
Zun Li received the BS degree in School of Software, Zhengzhou University, in 2014. Currently, She is working toward the PhD degree in the School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China. She has been a visiting scholar in the Department of Electronics and Computer Engineering, National University of Singapore, from 2018 to 2019. Her research interests include computer vision and machine learning.
References (69)
- et al.
Saliency motivated improved simplified pcnn model for object segmentation
Neurocomputing
(2018) - et al.
Saliency-based multi-feature modeling for semantic image retrieval
J. Visual Commun. Image Representation
(2018) - et al.
Salient object detection: A discriminative regional feature integration approach
IJCV
(2017) - et al.
Stc: A simple to complex framework for weakly-supervised semantic segmentation
IEEE Trans. Pattern Anal. Mach. Intell.
(2017) - S. Hong, T. You, S. Kwak, B. Han, Online tracking by learning discriminative saliency map with convolutional neural...
- L. Qi, Y. Xu, X. Shang, J. Dong, Fusing visual saliency for material recognition, in: CVPRW, 2018, pp....
- et al.
Saliency-aware video object segmentation
IEEE Trans. Pattern Anal. Mach. Intell.
(2018) - X. Lu, W. Wang, C. Ma, J. Shen, L. Shao, F. Porikli, See more, know more: Unsupervised video object segmentation with...
- V. Ramanishka, A. Das, J. Zhang, K. Saenko, Top-down visual saliency guided by captions, in: CVPR, 2017, pp....
- C. Yang, L. Zhang, H. Lu, R. Xiang, M.H. Yang, Saliency detection via graph-based manifold ranking, in: CVPR, 2013, pp....
Saliency detection via affinity graph learning and weighted manifold ranking
Neurocomputing
Global contrast based salient region detection
IEEE Trans. Pattern Anal. Mach. Intell.
Saliency detection based on foreground appearance and background-prior
Neurocomputing
Consistent video saliency using local gradient flow optimization and global refinement
IEEE Trans. Image Process.
Revisiting video saliency prediction in the deep learning era
IEEE Trans. Pattern Anal. Mach. Intell.
Detect globally, refine locally: A novel approach to saliency detection
CVPR
A saliency detection model using low-level features based on wavelet transform
IEEE Trans. Multimedia
Global contrast based salient region detection
IEEE Trans. Pattern Anal. Mach. Intell.
Deep visual attention prediction
IEEE Trans. Image Process.
Correspondence driven saliency transfer
IEEE Trans. Image Process.
Inferring salient objects from human fixations
IEEE Trans. Pattern Anal. Mach. Intell.
Video salient object detection via fully convolutional networks
IEEE Trans. Image Process.
Cited by (4)
Content-aware dynamic filter salient object detection network in multispectral polarimetric imagery
2022, OptikCitation Excerpt :In general, traditional salient object detection methods determine the local contrast of image regions with their surroundings through features of color and intensity. Those methods are mainly based on objects’ uniqueness and compactness [6,7]. Moreover, the boundary and connection priors can provide more clues for the saliency detection of central objects [8].
Detecting salient image objects using color histogram clustering for region granularity
2021, Journal of Imaging
Zun Li received the BS degree in School of Software, Zhengzhou University, in 2014. Currently, She is working toward the PhD degree in the School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China. She has been a visiting scholar in the Department of Electronics and Computer Engineering, National University of Singapore, from 2018 to 2019. Her research interests include computer vision and machine learning.
Congyan Lang received the Ph.D. degree from the School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China, in 2006. She was a Visiting Professor with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore, from 2010 to 2011. From 2014 to 2015, she visited the Department of Computer Science, University of Rochester, Rochester, NY, USA, as a Visiting Researcher. She is currently a Professor with the School of Computer and Information Technology, Beijing Jiaotong University. Her current research interests include multimedia information retrieval and analysis, machine learning, and computer vision.
Tao Wang received the Ph.D. degree in the School of Computer and Information Technology, Beijing Jiaotong University, Beijing, P.R. China, in 2013. He is currently an Associate Professor in the School of Computer and Information Technology, Beijing Jiaotong University. He has been a visiting scholar in the Department of Computer & Information Sciences, Temple University, USA, from 2014 to 2015. His research interests include computer vision and machine learning.
Yidong Li is the Vice-Dean and a professor in the School of Computer and Information Technology at Beijing Jiaotong University. Dr. Li received his B.Eng. degree in electrical and electronic engineering from Beijing Jiaotong University in 2003, and M.Sci. and Ph.D. degrees in computer science from the University of Adelaide, in 2006 and 2010, respectively. Dr. Li’s research interests include big data analysis, privacy preserving and information security, data mining, social computing and intelligent transportation. Dr. Li has published more than 80 research papers in various journals and refereed conferences. He has also co-authored/co-edited 5 books (including proceedings) and contributed several book chapters.
Jiashi Feng received the bachelor’s degree in automation from the University of Science and Technology of China, Hefei, China, and the Ph.D. degree from the National University of Singapore, Singapore, in 2014. He was a Post-Doctoral Researcher with the University of California at Berkeley, Berkeley, CA, USA. He is currently an Assistant Professor with the National University of Singapore. His research interests include computer vision and machine learning, particularly, image recognition, attributes learning, robust optimization, and online and distributed learning.