Keywords

1 Introduction

Recently, RGB-D cameras have become affordable and easily available, e.g., Microsoft’s Kinect and Intel’s RealSense. In foreground segmentation based on a background subtraction strategy, the depth information has some benefits for the problems of the appearance information, such as illumination changes and color camouflage caused when the colors of foreground objects are very close to the background color, because the depth information is independent of appearance information such as color and texture features. However, the depth information has some disadvantages such as the limitation of the measurement range and problems of depth camouflage such as color camouflage.

The depth and appearance information are combined in order to compensate for the disadvantages of the two pieces of information. Gordon et al. [1] combined two binary segmentations detected based on the depth and appearance information independently. Some researchers proposed multidimensional background models using the appearance and depth information. Moyà-Alcover et al. [2] proposed a multidimensional statistical background model based on the appearance and depth information using kernel density estimation. Fernandez-Sanchez et al. [3] proposed the combination of depth and RGB information for the codebook background subtraction model. The other researchers adaptively weighted the depth and appearance cues to select more reliable cues for the foreground segmentation. Schiller and Koch [4] proposed a reliability measure based on the variance in the depth and weighted the depth and RGB cues based on the reliability measure. Camplani and Salgado [5] used the global edge-closeness probability. At a pixel close to the edges, the authors considered that the appearance information is more reliable for the foreground segmentation.

In this paper, we propose a simple combination of the appearance and depth information (SCAD) in order to compensate for the disadvantages of the two pieces of information. Our method is inspired by [1]. We compute the two likelihoods of the background based on the appearance and depth information. Each background likelihood is computed based on a background subtraction strategy. Subsequently, we define an energy function based on the two likelihoods of the background and minimize the energy to obtain a foreground mask. In our experiment, we confirm that our simple combination exhibits satisfactory performance for indoor environments using the SBM-RGBD 2017 dataset [6].

2 Simple Combination of Appearance and Depth

We propose a batch algorithm using two likelihoods of depth and appearance backgrounds in order to compensate for problems such as color camouflage and depth camouflage. We minimize the energy function based on the two likelihoods by using graph cuts [7] in order to obtain a foreground mask. The two likelihoods are obtained based on background subtraction strategies. We apply the farthest depth for the depth background. The likelihood of the depth background is computed based on depth-based background subtraction. The appearance background image is defined as the appearance with a large likelihood of the depth background to eliminate the appearances of foreground objects. The likelihood of the appearance background is computed based on texture-based and RGB-based background subtraction. In order to reduce false positives owing to illumination changes, we roughly detect foreground objects by using texture-based background subtraction. Subsequently, RGB-based background subtraction is performed in order to improve the results of texture-based background subtraction. The likelihood of the depth background is used for the decision of the pixels to which RGB-based background subtraction is applied. Moreover, we detect changes in illumination by using the hue, saturation, and value (HSV) color space.

2.1 Background Subtraction Using Depth

We perform background subtraction at each pixel \({\varvec{x}}=(x,y)\) and obtain the likelihood of the depth background \(p^t({\varvec{x}})\) at frame t as follows:

$$\begin{aligned} p^t({\varvec{x}})&= \frac{1.0+\exp (-k)}{1.0 + \exp (D^t_d({\varvec{x}})-k)}\end{aligned}$$
(1)
$$\begin{aligned} D^t_d({\varvec{x}})&=\frac{|B_{d}({\varvec{x}})-I^t_{d}({\varvec{x}})|}{\sigma _{d_{\varvec{x}}}}, \end{aligned}$$
(2)

where \(I^t_{d}\) is the depth image at frame t, \(B_{d}({\varvec{x}})\) is the depth background image, and \(\sigma _{d_{\varvec{x}}}\) is the deviation from \(B_{d}({\varvec{x}})\). Furthermore, k is a parameter that controls the increment of \(p^t\) according to the increment of \(D^t_{d}({\varvec{x}})\).

We assume that background objects do not move because indoor environments do not have dynamic backgrounds such as waving trees. Therefore, we consider the farthest depth as the depth background. We select the farthest depth value as \(B_{d}({\varvec{x}})\). In this study, we used all the target frames for obtaining \(B_{d}({\varvec{x}})\). In order to obtain \(\sigma _{d_{\varvec{x}}}\), we select depth values similar to \(B_{d}({\varvec{x}})\) for eliminating depth values from the foreground objects. The number of selected depth values is 25 % of the number of frames. We limit the range of the deviation to avoid a significantly large/small value. The lower-bound is \(\max (1.0, 0.1\mu )\) and the upper-bound is \(1.1\mu \). Further, \(\mu \) is the mean of \(\sigma _{d_{\varvec{x}}}\). Dilation is performed for the deviation image in order to smoothen the deviation.

The depth values are not always measured at each pixel. When we cannot measure depth values, we perform background subtraction based on the status and trend of depth observation at each pixel. The status of depth observation has two states: md, which indicates measurement depth, and nmd, which indicates non-measurement depth. We classify the trend of depth observation into three classes: the constant nmd, rippling nmd, and constant md. The constant md indicates that we can measure the depth value stably; whereas, the constant nmd indicates that we cannot measure the depth value stably. The rippling nmd indicates that the status of depth observation changes from nmd to md frequently. We count the number of instances of nmd (\(\#nmd\)) and the switching from nmd to md ((\(\#switch\)) at each pixel using all the target frames. We classify the trend of depth observation as follows:

$$\begin{aligned} td({\varvec{x}}) = {\left\{ \begin{array}{ll} \hbox {constant nmd} &{} \#nmd> 0.5N \wedge \#switch < 0.1N \\ \hbox {rippling nmd} &{} \#switch > 0.1N \\ \hbox {constant md} &{} \hbox {otherwise} \end{array}\right. }, \end{aligned}$$
(3)

where N is the number of frames.

We modify the likelihood of the depth background \(p^t({\varvec{x}})\) as \(p^t_d({\varvec{x}})\) based on the status and trend of depth observation.

$$\begin{aligned} p^t_d({\varvec{x}}) = {\left\{ \begin{array}{ll} p^t({\varvec{x}}), &{} I^t_d({\varvec{x}})=\hbox {md}~ \wedge \\ &{} td({\varvec{x}})= \hbox {rippling nmd or constant md} \\ 1.0, &{} I_d^t({\varvec{x}})=\hbox {nmd}~ \wedge \\ &{} td({\varvec{x}})= \hbox {rippling nmd or constant nmd} \\ \displaystyle \frac{1}{2dt+1}\sum _{i=t-dt}^{t+dt} p^i({\varvec{x}}), &{} \hbox {otherwise} \end{array}\right. }, \end{aligned}$$
(4)

where dt is the range of neighboring frames used for calculating the average \(p^t\). We used \(dt=2\) in this study. We consider that a pixel \({\varvec{x}}\) with the trend of rippling nmd has two background depth values: \(B_{d}({\varvec{x}})\) and non-measurement depth value. Therefore, we compute \(p^t({\varvec{x}})\) in the case of md and return 1.0 in the case of nmd when \(td({\varvec{x}})\) is the rippling nmd. When the status of depth observation is md and \(td({\varvec{x}})\) is the constant nmd, we average \(p^t({\varvec{x}})\) in the neighboring frames because we cannot determine whether the depth value originates from the foreground objects or sudden depth noises. In this case, we set 0.5 to \(p^t({\varvec{x}})\) at frame t. When the status of depth observation is nmd and \(td({\varvec{x}})\) is the constant md, we also average \(p^t\) to obtain \(p^t_d({\varvec{x}})\).

2.2 Background Subtraction Using Appearance

It is efficient to use the appearance information for reducing false negatives owing to the depth camouflage problem. Appearance-based background subtraction mainly suffers from illumination changes in indoor environments. Many researchers proposed robust strategies of appearance-based background subtraction for illumination changes [8]. In our method, we use two approaches for illumination changes.

First, we detect foreground objects using texture-based background subtraction for reducing false positives from global illumination changes. We use a scale invariant local ternary pattern (SILTP) [9] as a texture-based feature and visual background extractor (ViBe) [10] as a background subtraction strategy. In the foreground detection of ViBe, we use Hamming distance instead of L2 distance. For the initialization of ViBe, we make an RGB background image \(B_{a}({\varvec{x}})\). We use \(p^t_d({\varvec{x}})\) in order to eliminate RGB values from the foreground objects as follows:

$$\begin{aligned} B_{a}({\varvec{x}})&= \frac{1}{\sum _{t \in \{i| p^i_d({\varvec{x}})> 0.75\}}p^t_d({\varvec{x}})}\sum _{t \in \{i| p^i_d({\varvec{x}}) > 0.75\}}p^t_d({\varvec{x}})I^t_a({\varvec{x}}), \end{aligned}$$
(5)

where \(I^t_{a}\) is the RGB image at frame t.

Second, we perform RGB-based background subtraction around the pixels detected by the texture-based background subtraction because the results of the texture-based background subtraction are mainly foreground boundaries as shown in Fig. 1a. However, the final results may suffer from false positives from illumination changes if we uniformly apply RGB-based background subtraction to the neighboring pixels. We select the neighboring pixels to which RGB-based background subtraction is applied based on the likelihood of the depth background \(p^t_d\). Figure 1b shows the procedure of the decision. We compare \(p^t_d({\varvec{x}})\) at a pixel \({\varvec{x}}\) with \(p^t_d({\varvec{x}}+\delta )\) and \(p^t_d({\varvec{x}}-\delta )\) at the two neighboring pixels as shown in Fig. 1b. We consider that there are gradient descents based on \(p^t_d({\varvec{x}})\) at the boundaries between the foreground objects and background regions. If \(p^t_d({\varvec{x}}) - p^t_d({\varvec{x}}+\delta )> th_g \wedge |p^t_d({\varvec{x}}) - p^t_d({\varvec{x}}-\delta )| < th_g\) or \(p^t_d({\varvec{x}}-\delta ) - p^t_d({\varvec{x}}) > th_g \wedge |p^t_d({\varvec{x}}) - p^t_d({\varvec{x}}+\delta )| < th_g\) are satisfied, we perform RGB-based background subtraction for \(5 \times 5\) neighboring pixels at \({\varvec{x}}+\delta \). We also perform RGB-based background subtraction for \({\varvec{x}}+i\delta \) if \(|p^t_d({\varvec{x}}+i\delta ) - p^t_d({\varvec{x}}+\delta )| < th_g\) until \(i = 4\) because the regions where the likelihood of the depth background are similar to \(p^t_d({\varvec{x}}+\delta )\) also likely to be foreground objects. In this study, we compare eight neighboring pixels as shown in Fig. 1b.

We compute RGB-based background subtraction as follows:

$$\begin{aligned} p^t_a({\varvec{x}})&= \frac{1.0+\exp (-k)}{1.0 + \exp (D^t_a({\varvec{x}})-k)}ic({\varvec{x}}) + (1.0-ic({\varvec{x}}))\end{aligned}$$
(6)
$$\begin{aligned} D^t_a({\varvec{x}})&= \frac{|B_{a}({\varvec{x}})-I^t_{a}({\varvec{x}})|^2}{\sigma _{a}}, \end{aligned}$$
(7)

where \(\sigma _{a}\) is a parameter that controls the similarity between \(B_{a}({\varvec{x}})\) and \(I^t_{a}({\varvec{x}})\), and \(ic({\varvec{x}})\) is the penalty term for illumination changes.

Fig. 1.
figure 1

Appearance-based background subtraction. In (a), texture-based background subtraction detects foreground boundaries. (b) illustrates \(p^t_d\) of the upper-right region of the object in (a). \(p^t_d\) has larger values in darker regions. White pixels indicate the result of texture-based background subtraction, and RGB-based background subtraction is performed in the yellow solid/dotted circles. The white arrow indicates the direction of gradient descents based on \(p^t_d\). We consider that there are gradient descents based on \(p^t_d\) at the boundaries between foreground objects and background regions. Therefore, we perform RGB-based background subtraction in this direction. (c) shows the result of RGB-based background subtraction. (Color figure online)

The texture-based approach is robust to global illumination changes; however, this approach often causes false positives in the case of local illumination changes such as shadows. Similar to [11], we perform the detection of illumination changes in the HSV color space.

$$\begin{aligned} ic({\varvec{x}}) = {\left\{ \begin{array}{ll} &{} (\cos (B_H({\varvec{x}})-I_H({\varvec{x}}))> th_H \wedge |B_S({\varvec{x}})-I_S({\varvec{x}})|< th_S \\ 0.0 &{} \wedge ~ B_V({\varvec{x}})I_V({\varvec{x}}) > 0.1 ) \vee \\ &{} (B_S({\varvec{x}})+ I_S({\varvec{x}})< th_S \wedge |B_V({\varvec{x}})-I_V({\varvec{x}})| < th_V) \\ 1.0 &{} otherwise \end{array}\right. }, \end{aligned}$$
(8)

where \(I_{H,S,V}\) and \(B_{H,S,V}\) indicate hue, saturation, and brightness of \(I^t_{a}\) and \(B_{a}\) in the HSV color space, respectively. Further, \(th_{H,S,V}\) represents chosen parameters. This condition has two parts. The former part indicates the similarity of hue and saturation between \(I^t_{a}\) and \(B_{a}\) in an environment with sufficient light. The latter part indicates the similarity of brightness between \(I^t_{a}\) and \(B_{a}\) in dark regions. We remove the condition based on hue because hue is not stable in dark regions.

2.3 Combination of Depth and Appearance Information

We combine \(p^t_d({\varvec{x}})\) and \(p^t_a({\varvec{x}})\) using a graph-based approach for obtaining a foreground mask \(L^t\) at frame t. We define the energy function as follows:

$$\begin{aligned} E(L^t)&= \sum _{{\varvec{x}}}f(L^t({\varvec{x}}))+ \alpha \sum _{({\varvec{x}}_i,{\varvec{x}}_j)\in \xi }g(L^t({\varvec{x}}_i),L^t({\varvec{x}}_j)), \end{aligned}$$
(9)

where \(L^t({\varvec{x}}) = \{1.0 \equiv FG,0.0 \equiv BG\}\) and \(\xi \) is a set of connected pixel pairs in an eight-connected 2D grid graph; \(f(L^t({\varvec{x}}))\) evaluates the likelihood of the foreground and background; \(g(L^t({\varvec{x}}_i),L^t({\varvec{x}}_j))\) represents the relationship between neighboring pixels using the depth and the appearance information; \(\alpha \) is a parameter defined by the user. We minimize \(E(L^t)\) using graph cuts [7] in order to obtain the optimal \(L^t\).

We compute \(f(L^t({\varvec{x}}))\) as follows:

$$\begin{aligned} f(L^t({\varvec{x}}))&= {\left\{ \begin{array}{ll} fg &{} L^t({\varvec{x}}) = 0.0 \\ 1.0 -fg &{} L^t({\varvec{x}}) = 1.0 \\ \end{array}\right. }\end{aligned}$$
(10)
$$\begin{aligned} fg&= \frac{2.0}{1.0 + \exp (-\sigma _f(1.0-p^t_d({\varvec{x}})+ 1.0-p^t_a({\varvec{x}})))} - 1.0, \end{aligned}$$
(11)

where \(\sigma _f\) emphasizes the likelihood of the foreground. Figure 2 shows \(p^t_d\), \(p^t_a\), and fg. In Fig. 2d, the likelihood that his arm is in the foreground is enhanced owing to \(p^t_a\).

We compute \(g(L^t({\varvec{x}}_i),L^t({\varvec{x}}_j))\) as follows:

$$\begin{aligned} g(L^t({\varvec{x}}_i),L^t({\varvec{x}}_j))&= {\left\{ \begin{array}{ll} g_d + g_a &{} L^t({\varvec{x}}_i) \ne L^t({\varvec{x}}_j) \\ 0 &{} L^t({\varvec{x}}_i) = L^t({\varvec{x}}_j) \end{array}\right. }\end{aligned}$$
(12)
$$\begin{aligned} g_d&= \exp (-\frac{|I^t_d({\varvec{x}}_i) - I^t_d({\varvec{x}}_j)|^2}{\sigma _{ds}^2}) \end{aligned}$$
(13)
$$\begin{aligned} g_a&= \exp (-\frac{|I^t_a({\varvec{x}}_i) - I^t_a({\varvec{x}}_j)|^2}{\sigma _{as}^2}), \end{aligned}$$
(14)

where \(\sigma _{ds}\) and \(\sigma _{as}\) are parameters that control the similarity of the depth and RGB values. If neighboring pixels have similar depth and RGB values, the labels of the neighboring pixels are likely to be the same owing to the term g, which reduces false positives using the spatial similarity of the depth and appearance information.

Fig. 2.
figure 2

Combination of \(p^t_d\) and \(p^t_a\). Depth-based background subtraction suffers from depth camouflage in his arm as shown in (b). In (d), the likelihood that the man is in the foreground is enhanced thanks to the combination of \(p^t_d\) and \(p^t_a\).

3 Dataset and Parameter Settings

We used an open access dataset provided by SBM-RGBD 2017 [6]. The following seven video categories are available.

  1. 1.

    Illumination Changes (4 videos) owing to light switches or automatic camera brightness changes.

  2. 2.

    Color Camouflage (4 videos) containing foreground objects with color very close to that of the background.

  3. 3.

    Depth Camouflage (4 videos) containing foreground objects with depth very close to that of the background.

  4. 4.

    Intermittent Motion (6 videos) with foreground objects which cause “ghosting” artifacts in the detected motion.

  5. 5.

    Out of Sensor Range (5 videos) with non-measurement depth region resulting from being too close to or far from the sensor.

  6. 6.

    Shadows (5 videos) caused by foreground objects.

  7. 7.

    Bootstrapping (5 videos) with foreground objects in all the frames.

We evaluated the performance of our method using seven metrics: recall, specificity, false positive rate (FPR), false negative rate (FNR), percentage of wrong classifications (PWC), precision, and FMeasure. We used the parameters described in Table 1. These parameters were fixed for all the scenes. For further details of the parameters for SILTP and ViBe, please refer to [9, 10].

Our method was implemented in C++ and executed using a single thread. The performance measurements were carried out on 3.70 GHz Intel Xeon processor with main memory of 32.0 GB. The average computational time was 1.95 fps. Notably, our method is a batch algorithm.

Table 1. Parameter settings.

4 Experimental Results

Table 2 presents the results of quantitative evaluation. We confirmed that our method (SCAD) exhibits satisfactory performance in indoor environments. Notably, FMeasure, precision, and recall in Illumination Changes are lower because two videos of the category do not contain foreground objects.

Table 2. Averaged quantitative evaluation results.
Fig. 3.
figure 3

Results of identification of foreground and background using SCAD for Color Camouflage, Depth Camouflage, and Out Of Range.

Fig. 4.
figure 4

Results of identification of foreground and background using SCAD for Bootstrapping, Illumination Changes, Intermittent Motion, and Shadows.

Figures 3 and 4 show the examples of input images, results of SCAD, \(p^t_d\), and \(p^t_a\). We observed that SCAD detected foreground objects in Depth Camouflage and Color Camouflage as shown in Fig. 3. In Fig. 3, the appearance-based background subtraction is not efficient for colorCam2 according to \(p^t_a\) and the depth-based background subtraction is not efficient for DCamSeq2 according to \(p^t_d\). However, SCAD reduces false negatives by combining \(p^t_d\) and \(p^t_a\). Moreover, SCAD suppressed false positives owing to the influence of noise on \(p^t_d\) by using graph cuts.

In the category of bootstrapping, we observed that SCAD did not detect a part of the foreground objects as shown in Fig. 4. SCAD failed to enable the background images to eliminate foreground objects because these objects stay at the same position in all the frames.

SCAD has a drawback owing to the combination of depth and appearance information. In the category of Illumination Changes, we observed some false positives. The false positives were caused by false detection of texture-based background subtraction when the room became darker and the observed RGB image is noisy. The texture-based background subtraction does not function well because the method detects noise as local changes. Moreover, SCAD classified weak shadows as the background; however, strong shadows near the foreground object were detected as foreground objects as shown in shadows1 of Fig. 4. The texture-based background subtraction detects the boundaries of strong shadows. In these categories, false positives in the final results of SCAD depend on the errors from texture-based background subtraction.

5 Conclusion

In this paper, we proposed a simple combination of appearance and depth information (SCAD) for foreground segmentation. We compute the two likelihoods of the background using background subtraction based on the appearance and depth information. Subsequently, we optimize the energy function based on the two likelihoods of the background using graph cuts in order to obtain foreground masks. We used the SBM-RGBD 2017 dataset in our evaluation. We confirmed that SCAD was effective for indoor environments. In future work, we will convert SCAD to online methods. We expect that the conversion will be easy based on a sequential updating strategy.