1 Introduction

In the early 80’s, Treisman and Gelade [23] proposed the feature integration theory for visual attention. Where they assume that the visual scene is initially coded along a number of separable dimensions such as color, orientation, spatial frequency, brightness, direction of movement, etc to provide the feature maps. Then, these maps are recombined to ensure the correct synthesis of features, and to provide final focal attention. Based on this theory, various visual saliency models are developed. Such models can be grouped into two categories: local and global approaches.

Local approaches measure the rarity of a region over its neighborhoods. Itti et al. [10] derived a bottom-up visual saliency model based on center surround difference through multi-scale image features. A bottom-up saliency model derived from a Bayesian framework is proposed in [26]. A saliency model that computes local descriptors from a given image in order to measure the similarity of a pixel to its neighborhoods was proposed in [22]. AWS method [5] is based on two biological mechanisms: the decorrelation and the distinctiveness of local responses. Harel et al. [6] propose GBVS which is a bottom-up saliency approach that consists of two steps: the activation maps generation over feature channels, and their normalization.

In contrast, global approaches are based on the rarity and uniqueness of image regions with respect to the whole scene. Scharfenberger et al. [21] proposed a texture based saliency model, where an object is salient if it has a distinctive texture from the rest of the scene. Kim et al. [12] developed a method which separates the background from foreground to highlight the salient object. Cheng et al. [2] proposed a regional contrast based salient object detection model which assumes that human cortical cells preferentially respond to high contrast stimulus in their receptive fields. Hou and Zhang [7] introduced a spectral based method by analyzing the spectrum of the input image in order to extract the residual spectrum. Bruce and Tsotsos [1] proposed the AIM approach using Shanon’s self-information measure to maximize the sampled information from a scene. Scharfenberger et al. [20] proposed a salient object detection model that uses the natural images structural and textural characteristics.

While there are many computational models that detect salient regions in still images, video saliency methods are in the early stages. In that context, Itti and Baldi in [9] assumed that salient object is related to surprising events and developed a model that detects objects influenced by surprising events. Rahtu et al. [19] developed a saliency model where they incorporated local saliency features into a conditionnal random field model. Mancas et al. [16] used the optical flow magnitude to highlight motion in a crowd. A video saliency model based on optical flow strength and static saliency features of an input video frame was proposed by Zhong et al. [27]. Beside motion, Fang et al. [3] have used color, luminance and texture to produce saliency model in compressed domain. Lee et al. [13] combine a set of spatial saliency features including rarity, compactness, and center prior with temporal features of motion intensity and motion contrast into an SVM regressor to detect each video frame’s salient object. Kim et al. [11] developed a novel approach based on the random walk with restart to detect salient regions. First, a temporal saliency distribution is found using the motion distinctiveness, Then, that temporal saliency distribution is used as a restarting distribution of the random walker. The spatial features are used to design a transition probability matrix for the walker, to estimate the final spatiotemporal saliency distribution.

While the Feature Integration Theory for visual attention has led to the development of many saliency approaches (for videos and still images), the Boolean Map theory for visual attention [8] has attracted Researchers. Zhang and Sclaroff proposed the Boolean Map Saliency in their paper [25]. Also, Qi et al. [18] used Boolean maps to produce a multi-scale propagation method where a graph-inference is performed to produce final saliency maps.

Therefore, in this paper we propose a novel video saliency model based on Boolean maps. First, we compute the optical flow of each pair of frames, and use our proposed motion feature to remove noise caused by camera motion. The smoothed optical flow will serve to produce the motion Boolean map. Then, we generate the color Boolean maps by thresholding the input frame color channel map. Thereafter, we combine the color and motion Boolean maps into one global map. Finally, we use the Gestalt principle for figure-ground segregation to maintain the surrounded regions in each Boolean map and eliminate the unfenced regions. The saliency of each video frame is the mean of the processed global Boolean maps of each frame over the total number of the randomly generated Boolean maps. The main contribution of this paper is the evaluation of Boolean maps for video saliency. As an additional contribution we propose a new motion feature for saliency prediction. To evaluate the proposed approach we use two standard benchmark datasets for video saliency: SegTrack v2 [14] and Fukuchi [4].

The paper is organized as follow: First, we will present our method and explain how we proceeded to generate Boolean maps in Sect. 2. Then, we will discuss experimental results in Sect. 3. Finally, we provide conclusions in Sect. 4.

2 Boolean-Map Video Saliency

The Boolean Map Theory of visual attention was introduced by Huang and Pashler [8] where they assume that at a moment an observer’s awareness of a scene can be represented by a Boolean map. From that assumption, we derive a video saliency model which highlights regions of interest in videos. We first, fix two saliency features which are motion and color, then, we build for each frame color Boolean map and motion Boolean map. These Boolean maps will define the saliency level of each region in the Boolean map according to its connectivity (connected regions belong to foreground).

2.1 Boolean Maps Generation

The novel Boolean map saliency method proposed by Zhang and Sclaroff in [25] used a color thresholding on the input image’s feature maps on the Lab color space to produce a boolean-map. Lately, Qi et al. in [18] combined the RGB, Lab and HSV color spaces to generate moreprecise boolean-maps. Recent works on saliency detection using boolean map theory have been used on color cues for saliency computation. n this paper, we present a Boolean maps video saliency model, where we use motion and color cues to generate boolean maps. Recent video saliency detection works [11, 24], have proved that moving objects attract attention. Therefore, optical flow is used to estimate motion and determine moving object in the video frame. We use the ptical flow estimation method proposed by [15] to produce direction and velocity measures. In case of static camera, optical flow can be a perfect video saliecy indicator, but we work on benchmark datasets which include different scenarios where the camera is not static. And While the optical flow displays pixels that change position from one frame to another, the motion caused by the camera will also be displayed. To remove the noise caused by the optical flow, we propose a new motion feature.

If we consider \(O_t\) and \(M_t\) the orientation and the magnitude of the frame t, we define the motion strength as

$$\begin{aligned} S_t(x,y)=\sqrt{M_t(x,y)^2-O_t(x,y)^2} \end{aligned}$$
(1)

Since the optical flow noise caused by the camera motion will produce wrong measures, we use the motion feature proposed by Papazoglou and Ferrari [17] to extract the exact motion boundaries

$$\begin{aligned} M_t= 1-exp(-\lambda _M * M_t(x,y)) \end{aligned}$$
(2)

\(\lambda _M\) is used to control the function’s steepness which is set to 0.8 then our motion feature can be computed as

$$\begin{aligned} M_t=(M_t(x,y)*S_t(x,y))/max(M_t(x,y)) \end{aligned}$$
(3)

We define \(M_t\) as the computed motion at frame t, to determine the motion Boolean map we apply the function thresh(,\(\theta \)) to assign 1 to the pixel if its value is greater than the threshold value \(\theta \) and 0 otherwise see Eq. 4.

$$\begin{aligned} B_m(t)=thresh(M_t,\theta ), \end{aligned}$$
(4)

In our experiments, we set \(\theta \) to be between the maximum and the minimum of \(M_t\). While motion is basic to predict saliency in videos, color cue is crucial in saliency prediction in still images. To strengthen the saliency prediction for our video frames, we opt to use a static saliency feature (color). We select the RGB and Lab color spaces to produce the color Boolean map.

We define a vector \(F_c=\{[F_R, F_G, F_B], [F_L, F_a, F_b]\}\) where \(c \in [1,6]\). Then we generate the feature map using a linear combination between \(f_m(t)\) which is the feature channel of the frame t and the vector \(F_c\)

$$\begin{aligned} F_m(t)=f_m(t)*F_c, \end{aligned}$$
(5)

The color Boolean map can be computed as

$$\begin{aligned} B_c(t)=thresh(F_m(t),\beta ), \end{aligned}$$
(6)

also the function thresh(,\(\beta \)) assigns 1 to a pixel if its value is greater than \(\beta \) and 0 otherwise. The values of the feature map \(F_m\) are assumed to vary between 0 and 255 by a uniform distribution. The threshold \(\beta \) is set to be between the maximum and the minimum values of the feature map \(F_m\). Given a color Boolean map and a motion Boolean map, the master Boolean map can be estimated as the union of both maps and can be defined as follow

$$\begin{aligned} B(t)= B_m(t)\cup B_c(t) \end{aligned}$$
(7)

2.2 Saliency Computation

The Gestalt principle for figure-ground segregation, assumes that connected regions belong to foreground and are more likely to be perceived as figures. While Boolean maps decomposes the input frame into selected or non selected regions, selected regions are defined as a connected region that has either a value of 0 or 1. Based on the Gestalt principle for figure-ground segregation saliency maps are computed. The first thing to do is to eliminate connected regions that touch the border and set them to be a part of the background. Then each pixel in the Boolean map, is marked by 1 if it belongs to a surrounded regions which means that it is salient and 0 to the rest of the map. For each Boolean map in each video frame a post-processed map R which highlights only important connected regions is deducted

$$\begin{aligned} R(x,y) = \left\{ \begin{array}{ll} 1 &{} (x,y) \in SR \\ 0 &{} \text{ otherwise } \end{array} \right. \end{aligned}$$
(8)

where SR is a surrounded region. The map R should be smoothed so that small areas get more accentuation. Thereby, we apply a dilatation over each post-processed Boolean map R, then we ensure a linear normalization so that small areas will get more accentuation.

The final saliency map can be defined as the mean of the post-processed Boolean map R over the whole number of generated Boolean maps and can be defined as follow

$$\begin{aligned} S(x,y)= \frac{1}{n}\sum _{i=1}^n R_i \end{aligned}$$
(9)

where n is the number of Boolean maps.

3 Experimental Results

3.1 Experiments

Our method uses Boolean maps to predict saliency in videos. In this section we will evaluate the performance of our method by comparing the resultant saliency maps to seven state-of-the-art methods on two benchmark datasets in terms of Precion-Recall, ROC curves and Mean Absolute Error.

SegTrack v2 dataset [14] is a video segmentation and tracking dataset. It contains 14 videos with 976 frames. The videos are diversified, there is some videos with one dynamic object, others with more than one. Each video object has specific characteristics that can be Slow motion, Motion blur, change in Appearance, Complex deformation, Occlusion, and Interacting objects. In addition to video frames, a binarized ground truth for each frame is provided.

Fukuchi dataset [4] is a video saliency dataset which contains 10 video sequences with a total of 936 frames with a segmented ground truth.

PR-curve plots the Precision against the recall. To do so, each saliency map is binarized using a fixed set of thresholds variant from 0 to 255. The precision and the recall are then computed by comparing the binarized map S to the ground-truth G see Eqs. 10 and 11

$$\begin{aligned} \mathrm {precision}=\frac{\sum \limits _{\text {x,y}}\text {S(x,y) G(x,y)}}{\sum \limits _{\text {x,y}}\text {S(x,y)}} \end{aligned}$$
(10)
$$\begin{aligned} \mathrm {recall}=\frac{\sum \limits _{\text {x,y}}\text {S(x,y) G(x,y)}}{\sum \limits _{\text {x,y}}\text {G(x,y)}} \end{aligned}$$
(11)

Receiver operating characteristics (ROC) curve plots the false positive rate against the truth positive rate by varying a fixed threshold from 0 to 255. For a better estimate of the saliency map ground truth dissimilarity, we compute the Mean Absolute Error MAE which approximates the estimate level between the ground truth and the saliency map Eq. 12

$$\begin{aligned} MAE = \frac{|S-G|}{N} \end{aligned}$$
(12)

where S and G are the saliency map and the ground truth, and N is the number of pixels in the video frame.

3.2 Results

Precision-Recall curves over the benchmark datasets are repotedin Fig. 1. They provide an efficient comparison of how the produced salient regions in the video frames are correctly predicted. These curves show that our proposed method outperforms other methods. When varying the fixed threshold from 0 to 255, the values of precision and recall are affected. When the value of the threshold is by 255, the recall values of [6, 9, 16] go down to 0 because their predicted salient objects do not highlight the exact or the right salient object. Our proposed method provides a minimum value of recall different from zero because our saliency maps point out the object of interest with a big response. Furthermore, our proposed method offers more precise saliency maps since we achieved the best precision rate (over 0.75).

Figure 2 presents our ROC curves against state-of-the-art methods curves over the two evaluation datasets. On SegTrack v2 dataset our curve has competitive shape with the BMS [25] curve. On Fukuchi dataset, our curve has similar shape with the BMS [25] in the begining and the end.

On SegTrack v2 and Fukuchi datasets we outperform all other approaches with a big gap in terms of MAE values (see Fig. 3).

Fig. 1.
figure 1

Precision-Recall curves on Fukuchi and Segtrack v2 datasets

Fig. 2.
figure 2

ROC curves on Fukuchi and Segtrack v2 datasets

Fig. 3.
figure 3

Mean Absolute Error on Fukuchi and Segtrack v2 datasets

Fig. 4.
figure 4

Visual comparison of saliency maps generated from 6 different methods, including our method, GVS [24], GB [6], RR [16], RT [19], ITTI [9] and BMS [25]

A visual comparison between our proposed method and the state-of-the-art methods where higher saliency predictions are indicated by bright pixels are reported in Fig. 4.

The GVS [24] used the spatial and temporal edges of each dynamic object in the video frame to compute saliency maps. In case of static camera and one moving object, this method converge to the exact salient object which explains the good results on Fukuchi dataset and the competitive results SegTrack v2 dataset which includes video frames with different conditions (as we explained in the last paragraph). The graph based method [6] is a saliency method which does not include motion cues in saliency map generation which leads to bad PR, ROC curves and high MAE values and bad saliency maps. While it can be used in saliency detection, this method is more suitable for still image saliency. [16] is a saliency method that defined region of interest as the region where moving object is focused. It uses the optical flow to detect moving objects. PR and ROC curves are not as good as our curves because optical flow with no smoothing can be useful only in case of videos with no moving camera.

In case of Fukuchi and Segtrack v2 datasets, a camera motion estimation or an optical flow smoothing should be added to improve saliency maps. The statistical framework proposed by [19] includes motion features to segment the salient object from its background. Salient object is characterized by suprising event in [9] where besides motion, color, intensity, orientation and flicker features are extracted to produce final saliency map. The saliency maps produced by [9, 19] are not good indicative of salient object because they use spatial and motion features together, so a static pixel which belongs to background can be marked salient.

The BMS [25] which uses the Boolean maps theory to predict saliency in still images, does not provide good results in some video frames (e.g. the last two rows of Fig. 4) where the color of the background and the moving object are almost the same. Our Boolean map based method tried to heal this issue. We suppose that not only moving object attracts attention but, a change in color could also be important in saliency detection. The Eq. 7 introduces our global Boolean map which is the union of the motion based Boolean map and color based Boolean map.

4 Conclusion

In this paper, we presented a video saliency detection method using the Boolean map theory for visual attention. The proposed method combines color and motion cues to produce a set of Boolean maps for each video frame. Motion cue is generated from the optical flow and smoothed using our proposed motion feature. Then, each Boolean map is processed to highlight only the surrounded regions which are considered as salient. The final saliency map is a linear combination of all Boolean maps. We evaluate the performance of our method over two benchmark datasets against seven state-of-the-art methods. We revealed that the Boolean maps based video saliency can be effective using color and motion cues. As future work, we remain to test the influence of other features channels on saliency detection.