Bagging-based saliency distribution learning for visual saliency detection
Introduction
Saliency detection is still an unsolved problem in computer vision and image processing tasks. It aims to locate the most interesting regions in an image. Thus, saliency detection contributes to subsequent computer vision and image processing tasks, such as image retrieval [1], action recognition [2], image segmentation [3], video saliency detection [4], [5], [6], [7] so on. In summary, state-of-the-art saliency detection methods focus on two strategies: top-down [8], [9], [10], [11], [12], [13] and bottom-up [14], [15], [16], [17], [18], [19], [20], [21].
Top-down methods are usually driven by specific tasks and involve supervised learning framework. They aim to learn a saliency model from numerous training images with the ground truth, deep learning based methods are the most popular top-down methods, they have achieved promising performances in recent years. Owning to their hierarchical architecture, deep neural networks can exploit effectively high-level semantic information from training images. In contrast, bottom-up methods are faster and simpler than top-down ones because training images are not needed. They mainly exploit various low-level features such as color feature, texture feature, gradient feature, and various prior knowledge such as background prior, center prior and contrast prior. Furthermore, machine learning algorithms are applied widely to bottom-up methods (MLBU), such as bootstrap learning [22], multiple instance learning [23], Bayesian framework [24] and so on. The flow of these MLBU methods is summarized as follows: Given an input image, they firstly utilize prior knowledge to select some regions from input image as training samples, based on various machine learning algorithms, the selected training samples are utilized to train saliency model to classify the each region of input image into foreground/background.
Nevertheless, above methods are very powerless when image content is very complex. i.e., for a complex image, it is hard to train a unified saliency model to classify each region into foreground or background, because they ignore the different characteristics of various regions in complex images. Such as Fig. 1(a), salient region A has a great feature difference to another salient region B but has similar feature with background region C. It means that we are difficult to train a unified saliency model for image Fig. 1(a) to separate salient region and background, because various regions’ features are very rich. The same situation also occurs in Fig. 1(b). Thus, for complex images where various regions have rich features, how to train an effective saliency model to separate foreground and background is a very challenging but important issue in MLBU methods.
To deal with above problems, we propose a novel visual saliency detection framework via bagging-based saliency distribution learning (BSDL). In our method, input image is firstly segmented into superpixels as basic units (a superpixel represents a region), each superpixel is represented by deep features extracted from pre-trained VGG19 net [25]. Then two well-known prior knowledge containing background prior and center prior are integrated to generate an initial prior map, which help to select superpixels from input image as training samples by setting adaptive threshold. Secondly, training samples are utilized to train the BSDL model which contains two stages: (1) To improve the generalization ability of saliency model, we use bagging-based sampling method to train saliency classifiers for input image, i.e., we select randomly a subset of all training samples as training set in each classifier training, a saliency classifier corresponds to a training set. (2) Furthermore, we propose a saliency distribution learning method to infer the reliability of using each saliency classifier to predict each superpixel saliency value. i.e., In the BSDL, for certain superpixel, we not only train classifiers to give its prediction saliency values but also learn its saliency distribution which is used to infer the reliability of using each classifier to predict its saliency value. So, each superpixel’s saliency value is determined by its prediction saliency values and its saliency distribution. Comparing with previous works, the BSDL firstly constructs saliency classifiers for input image and then learns to find the most appropriate classifiers for each superpixel to predict saliency value. This is no doubt to be more effective when input image contains various superpixels with different features, such as Fig. 1.
Considering that the BSDL takes each superpixel as an individual instance without the exploration of the spatial relationship between superpixels. Saliency optimization method is then utilized to further improve the quality of saliency map obtained by BSDL. Previous optimization methods usually assign similar saliency values to adjacent superpixels with similar features, which is hard to enforce saliency consistency between foreground superpixels when salient object consists of multiple regions with different features. Different from previous works, we propose a so called foreground consistency saliency optimization framework (FCSO) to further refine saliency map obtained by BSDL. Two novel optimization matrixes named local structure matrix and spatial compactness matrix are proposed to exploit saliency cues from local structure perspective and global spatial perspective. The proposed FCSO can better enforce saliency consistency between foreground superpixels than previous works. To improve computation efficiency, a prejudgment mechanism is also proposed to evaluate the quality of saliency map obtained by BSDL, which is used to decide whether the FCSO is needed for input image. In summary, the contributions of the proposed method are listed as follows:
- (1)
The first contribution is the development of bagging-basedsaliency distribution learning model(BSDL). Given input image, classifiers are firstly trained to predict each superpixel saliency value by using the bagging-based method. For each superpixel, we also learn to compute its saliency distribution which can infer the reliability of using each classifier to predict its saliency value. Each superpixel’s saliency value is determined by its prediction saliency values and its saliency distribution. The BSDL deeply analyzes the different characteristics of various superpixels in input image, it is no doubt to be more effective than previous works in complex images.
- (2)
The second contribution is to propose a so called foreground consistency saliency optimization framework (FCSO) to further improve the quality of saliency map obtained by BSDL. In the FCSO, new local structure matrix and spatial compactness matrix are developed to update all superpixels’ saliency values.
- (3)
The third contribution is the development of an effective prejudgment mechanism, it is used to evaluate the performance of saliency map obtained by BSDL, which help to decide whether the FCSO is needed for input image.
Section snippets
Related work
Deep learning based methods have achieved outstanding performances in recent years. Wang et al. [26] construct two CNN frameworks to exploit saliency cues, they are global search network and local estimation network. He et al. [27] learn a CNN framework named Super-CNN to construct superpixel-level saliency map. Hou et al. [8] propose to introduce short connections into the skip-layer structure within a hierarchical architecture. In [9], multi-scale deep features are learned from CNN to
Bagging-based saliency distribution learning (BSDL)
Given image is segmented into superpixels as basic units in our method. We firstly integrate two prior knowledge to obtain an initial prior map which provides an indicator for subsequent training samples selection. We then utilize the selected training samples to train the BSDL model which contains two stages: (1) Based on bagging-based sampling method, we aim to train classifiers from training samples, each of which is used to predict each superpixel to be foreground/background (1/0). (2)
Foreground consistency saliency optimization (FCSO)
Considering that the BSDL only takes each superpixel as an individual instance without the exploitation of the spatial relationship between superpixels, therefore, we propose a so called foreground consistency saliency optimization framework (FCSO) to further refine saliency result obtained by BSDL. Previous optimization methods usually assign similar saliency values to adjacent superpixels with similar features, however, they are hard to enforce saliency consistency between foreground
Prejudgment mechanism
In some cases, the FCSO fails to improve the quality of saliency map obtained by BSDL, i.e., the FCSO even obtains worse saliency map than the BSDL in some images, such as Fig. 6. To address this problem, we construct a prejudgment mechanism to evaluate the performance of saliency map obtained by BSDL, which is used to determine whether the FCSO is needed.
Generally, there is a great contrast between salient object and background in a good saliency map. i.e., the saliency values of most
Experiments
We compare the proposed method with other 15 state-of-the-art methods including: LEGs [26], S-CNN [27], BSCA [31], TLLT [38], LPS [18], MAP [39], MST [40], KSR [41], LDS [42], SMD [43], MILP [23], DGLS [33], HCA [28], AE [17] and FCB [44]. Where LEGs, S-CNN, KSR, AE and HCA exploit saliency cues by utilizing deep neural network (DNN), MST, LDS, MILPS, DGLS and SMD are saliency detection methods using classical machine learning algorithms or mathematical theories. BSCA, TLLT, LPS and MAP are
Conclusion
In this paper, we propose a novel saliency detection framework via bagging-based saliency distribution learning (BSDL). Input image is segmented into superpixels as basic units in our method. Firstly, we construct an initial prior map to extract roughly saliency cues by integrating two prior knowledge. The initial prior map is used to select superpixels from input image as training samples, which are utilized to train the BSDL model: (1) We use bagging-based sampling method to train SVM
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under Grant nos. 61701101, 61973093, U1713216, 61901098, 61971118, the Fundamental Research Fund for the Central Universities of China N2026005, N181602014, N2026004, N2026006, N2026001, N2011001, and the project for the science and technology major special plan of Liaoning 2019JH1/10100005.
References (47)
- et al.
Improved image deblurring based on salient-region segmentation
Signal Process. Image Commun.
(2013) - et al.
Depth-aware saliency detection using convolutional neural networks
J. Vis. Commun. Image Rrepresent.
(2019) - et al.
Salient object detection using background substraction, gabor filters, objectness and minimum directional backgroundness
J. Vis. Commun. Image Rrepresent.
(2019) - et al.
Saliency detection integrating global and local information
J. Vis. Commun. Image Rrepresent.
(2018) - et al.
Saliency detection via local structure propagation
J. Vis. Commun. Image Rrepresent.
(2018) - et al.
Integrating visual saliency and consistency for re-ranking image search results
IEEE Trans. Multimedia
(2011) - et al.
A robust and efficient video representation for action recognition
Int. J. Comput. Vis.
(2016) - et al.
Improved robust video saliency detection based on long-term spatial–temporal information
IEEE Trans. Image Process.
(2020) - et al.
Accurate and robust video saliency detection via self-paced diffusion
IEEE Trans. Multimedia
(2020) - et al.
Bi-level feature learning for video saliency detection
IEEE Trans. Multimedia
(2018)