Abstract
In foreground segmentation, the depth information is robust to problems of the appearance information such as illumination changes and color camouflage; however, the depth information is not always measured and suffers from depth camouflage. In order to compensate for the disadvantages of the two pieces of information, we define an energy function based on the two likelihoods of depth and appearance backgrounds and minimize the energy using graph cuts to obtain a foreground mask. The two likelihoods are obtained using background subtraction. We use the farthest depth as the depth background in the background subtraction according to the depth information. The appearance background is defined as the appearance with a large likelihood of the depth background to eliminate appearances of foreground objects. In the computation of the likelihood of the appearance background, we also use the likelihood of the depth background for reducing false positives owing to illumination changes. In our experiment, we confirm that our method is sufficiently accurate for indoor environments using the SBM-RGBD 2017 dataset.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Recently, RGB-D cameras have become affordable and easily available, e.g., Microsoft’s Kinect and Intel’s RealSense. In foreground segmentation based on a background subtraction strategy, the depth information has some benefits for the problems of the appearance information, such as illumination changes and color camouflage caused when the colors of foreground objects are very close to the background color, because the depth information is independent of appearance information such as color and texture features. However, the depth information has some disadvantages such as the limitation of the measurement range and problems of depth camouflage such as color camouflage.
The depth and appearance information are combined in order to compensate for the disadvantages of the two pieces of information. Gordon et al. [1] combined two binary segmentations detected based on the depth and appearance information independently. Some researchers proposed multidimensional background models using the appearance and depth information. Moyà-Alcover et al. [2] proposed a multidimensional statistical background model based on the appearance and depth information using kernel density estimation. Fernandez-Sanchez et al. [3] proposed the combination of depth and RGB information for the codebook background subtraction model. The other researchers adaptively weighted the depth and appearance cues to select more reliable cues for the foreground segmentation. Schiller and Koch [4] proposed a reliability measure based on the variance in the depth and weighted the depth and RGB cues based on the reliability measure. Camplani and Salgado [5] used the global edge-closeness probability. At a pixel close to the edges, the authors considered that the appearance information is more reliable for the foreground segmentation.
In this paper, we propose a simple combination of the appearance and depth information (SCAD) in order to compensate for the disadvantages of the two pieces of information. Our method is inspired by [1]. We compute the two likelihoods of the background based on the appearance and depth information. Each background likelihood is computed based on a background subtraction strategy. Subsequently, we define an energy function based on the two likelihoods of the background and minimize the energy to obtain a foreground mask. In our experiment, we confirm that our simple combination exhibits satisfactory performance for indoor environments using the SBM-RGBD 2017 dataset [6].
2 Simple Combination of Appearance and Depth
We propose a batch algorithm using two likelihoods of depth and appearance backgrounds in order to compensate for problems such as color camouflage and depth camouflage. We minimize the energy function based on the two likelihoods by using graph cuts [7] in order to obtain a foreground mask. The two likelihoods are obtained based on background subtraction strategies. We apply the farthest depth for the depth background. The likelihood of the depth background is computed based on depth-based background subtraction. The appearance background image is defined as the appearance with a large likelihood of the depth background to eliminate the appearances of foreground objects. The likelihood of the appearance background is computed based on texture-based and RGB-based background subtraction. In order to reduce false positives owing to illumination changes, we roughly detect foreground objects by using texture-based background subtraction. Subsequently, RGB-based background subtraction is performed in order to improve the results of texture-based background subtraction. The likelihood of the depth background is used for the decision of the pixels to which RGB-based background subtraction is applied. Moreover, we detect changes in illumination by using the hue, saturation, and value (HSV) color space.
2.1 Background Subtraction Using Depth
We perform background subtraction at each pixel \({\varvec{x}}=(x,y)\) and obtain the likelihood of the depth background \(p^t({\varvec{x}})\) at frame t as follows:
where \(I^t_{d}\) is the depth image at frame t, \(B_{d}({\varvec{x}})\) is the depth background image, and \(\sigma _{d_{\varvec{x}}}\) is the deviation from \(B_{d}({\varvec{x}})\). Furthermore, k is a parameter that controls the increment of \(p^t\) according to the increment of \(D^t_{d}({\varvec{x}})\).
We assume that background objects do not move because indoor environments do not have dynamic backgrounds such as waving trees. Therefore, we consider the farthest depth as the depth background. We select the farthest depth value as \(B_{d}({\varvec{x}})\). In this study, we used all the target frames for obtaining \(B_{d}({\varvec{x}})\). In order to obtain \(\sigma _{d_{\varvec{x}}}\), we select depth values similar to \(B_{d}({\varvec{x}})\) for eliminating depth values from the foreground objects. The number of selected depth values is 25 % of the number of frames. We limit the range of the deviation to avoid a significantly large/small value. The lower-bound is \(\max (1.0, 0.1\mu )\) and the upper-bound is \(1.1\mu \). Further, \(\mu \) is the mean of \(\sigma _{d_{\varvec{x}}}\). Dilation is performed for the deviation image in order to smoothen the deviation.
The depth values are not always measured at each pixel. When we cannot measure depth values, we perform background subtraction based on the status and trend of depth observation at each pixel. The status of depth observation has two states: md, which indicates measurement depth, and nmd, which indicates non-measurement depth. We classify the trend of depth observation into three classes: the constant nmd, rippling nmd, and constant md. The constant md indicates that we can measure the depth value stably; whereas, the constant nmd indicates that we cannot measure the depth value stably. The rippling nmd indicates that the status of depth observation changes from nmd to md frequently. We count the number of instances of nmd (\(\#nmd\)) and the switching from nmd to md ((\(\#switch\)) at each pixel using all the target frames. We classify the trend of depth observation as follows:
where N is the number of frames.
We modify the likelihood of the depth background \(p^t({\varvec{x}})\) as \(p^t_d({\varvec{x}})\) based on the status and trend of depth observation.
where dt is the range of neighboring frames used for calculating the average \(p^t\). We used \(dt=2\) in this study. We consider that a pixel \({\varvec{x}}\) with the trend of rippling nmd has two background depth values: \(B_{d}({\varvec{x}})\) and non-measurement depth value. Therefore, we compute \(p^t({\varvec{x}})\) in the case of md and return 1.0 in the case of nmd when \(td({\varvec{x}})\) is the rippling nmd. When the status of depth observation is md and \(td({\varvec{x}})\) is the constant nmd, we average \(p^t({\varvec{x}})\) in the neighboring frames because we cannot determine whether the depth value originates from the foreground objects or sudden depth noises. In this case, we set 0.5 to \(p^t({\varvec{x}})\) at frame t. When the status of depth observation is nmd and \(td({\varvec{x}})\) is the constant md, we also average \(p^t\) to obtain \(p^t_d({\varvec{x}})\).
2.2 Background Subtraction Using Appearance
It is efficient to use the appearance information for reducing false negatives owing to the depth camouflage problem. Appearance-based background subtraction mainly suffers from illumination changes in indoor environments. Many researchers proposed robust strategies of appearance-based background subtraction for illumination changes [8]. In our method, we use two approaches for illumination changes.
First, we detect foreground objects using texture-based background subtraction for reducing false positives from global illumination changes. We use a scale invariant local ternary pattern (SILTP) [9] as a texture-based feature and visual background extractor (ViBe) [10] as a background subtraction strategy. In the foreground detection of ViBe, we use Hamming distance instead of L2 distance. For the initialization of ViBe, we make an RGB background image \(B_{a}({\varvec{x}})\). We use \(p^t_d({\varvec{x}})\) in order to eliminate RGB values from the foreground objects as follows:
where \(I^t_{a}\) is the RGB image at frame t.
Second, we perform RGB-based background subtraction around the pixels detected by the texture-based background subtraction because the results of the texture-based background subtraction are mainly foreground boundaries as shown in Fig. 1a. However, the final results may suffer from false positives from illumination changes if we uniformly apply RGB-based background subtraction to the neighboring pixels. We select the neighboring pixels to which RGB-based background subtraction is applied based on the likelihood of the depth background \(p^t_d\). Figure 1b shows the procedure of the decision. We compare \(p^t_d({\varvec{x}})\) at a pixel \({\varvec{x}}\) with \(p^t_d({\varvec{x}}+\delta )\) and \(p^t_d({\varvec{x}}-\delta )\) at the two neighboring pixels as shown in Fig. 1b. We consider that there are gradient descents based on \(p^t_d({\varvec{x}})\) at the boundaries between the foreground objects and background regions. If \(p^t_d({\varvec{x}}) - p^t_d({\varvec{x}}+\delta )> th_g \wedge |p^t_d({\varvec{x}}) - p^t_d({\varvec{x}}-\delta )| < th_g\) or \(p^t_d({\varvec{x}}-\delta ) - p^t_d({\varvec{x}}) > th_g \wedge |p^t_d({\varvec{x}}) - p^t_d({\varvec{x}}+\delta )| < th_g\) are satisfied, we perform RGB-based background subtraction for \(5 \times 5\) neighboring pixels at \({\varvec{x}}+\delta \). We also perform RGB-based background subtraction for \({\varvec{x}}+i\delta \) if \(|p^t_d({\varvec{x}}+i\delta ) - p^t_d({\varvec{x}}+\delta )| < th_g\) until \(i = 4\) because the regions where the likelihood of the depth background are similar to \(p^t_d({\varvec{x}}+\delta )\) also likely to be foreground objects. In this study, we compare eight neighboring pixels as shown in Fig. 1b.
We compute RGB-based background subtraction as follows:
where \(\sigma _{a}\) is a parameter that controls the similarity between \(B_{a}({\varvec{x}})\) and \(I^t_{a}({\varvec{x}})\), and \(ic({\varvec{x}})\) is the penalty term for illumination changes.
The texture-based approach is robust to global illumination changes; however, this approach often causes false positives in the case of local illumination changes such as shadows. Similar to [11], we perform the detection of illumination changes in the HSV color space.
where \(I_{H,S,V}\) and \(B_{H,S,V}\) indicate hue, saturation, and brightness of \(I^t_{a}\) and \(B_{a}\) in the HSV color space, respectively. Further, \(th_{H,S,V}\) represents chosen parameters. This condition has two parts. The former part indicates the similarity of hue and saturation between \(I^t_{a}\) and \(B_{a}\) in an environment with sufficient light. The latter part indicates the similarity of brightness between \(I^t_{a}\) and \(B_{a}\) in dark regions. We remove the condition based on hue because hue is not stable in dark regions.
2.3 Combination of Depth and Appearance Information
We combine \(p^t_d({\varvec{x}})\) and \(p^t_a({\varvec{x}})\) using a graph-based approach for obtaining a foreground mask \(L^t\) at frame t. We define the energy function as follows:
where \(L^t({\varvec{x}}) = \{1.0 \equiv FG,0.0 \equiv BG\}\) and \(\xi \) is a set of connected pixel pairs in an eight-connected 2D grid graph; \(f(L^t({\varvec{x}}))\) evaluates the likelihood of the foreground and background; \(g(L^t({\varvec{x}}_i),L^t({\varvec{x}}_j))\) represents the relationship between neighboring pixels using the depth and the appearance information; \(\alpha \) is a parameter defined by the user. We minimize \(E(L^t)\) using graph cuts [7] in order to obtain the optimal \(L^t\).
We compute \(f(L^t({\varvec{x}}))\) as follows:
where \(\sigma _f\) emphasizes the likelihood of the foreground. Figure 2 shows \(p^t_d\), \(p^t_a\), and fg. In Fig. 2d, the likelihood that his arm is in the foreground is enhanced owing to \(p^t_a\).
We compute \(g(L^t({\varvec{x}}_i),L^t({\varvec{x}}_j))\) as follows:
where \(\sigma _{ds}\) and \(\sigma _{as}\) are parameters that control the similarity of the depth and RGB values. If neighboring pixels have similar depth and RGB values, the labels of the neighboring pixels are likely to be the same owing to the term g, which reduces false positives using the spatial similarity of the depth and appearance information.
3 Dataset and Parameter Settings
We used an open access dataset provided by SBM-RGBD 2017 [6]. The following seven video categories are available.
-
1.
Illumination Changes (4 videos) owing to light switches or automatic camera brightness changes.
-
2.
Color Camouflage (4 videos) containing foreground objects with color very close to that of the background.
-
3.
Depth Camouflage (4 videos) containing foreground objects with depth very close to that of the background.
-
4.
Intermittent Motion (6 videos) with foreground objects which cause “ghosting” artifacts in the detected motion.
-
5.
Out of Sensor Range (5 videos) with non-measurement depth region resulting from being too close to or far from the sensor.
-
6.
Shadows (5 videos) caused by foreground objects.
-
7.
Bootstrapping (5 videos) with foreground objects in all the frames.
We evaluated the performance of our method using seven metrics: recall, specificity, false positive rate (FPR), false negative rate (FNR), percentage of wrong classifications (PWC), precision, and FMeasure. We used the parameters described in Table 1. These parameters were fixed for all the scenes. For further details of the parameters for SILTP and ViBe, please refer to [9, 10].
Our method was implemented in C++ and executed using a single thread. The performance measurements were carried out on 3.70 GHz Intel Xeon processor with main memory of 32.0 GB. The average computational time was 1.95 fps. Notably, our method is a batch algorithm.
4 Experimental Results
Table 2 presents the results of quantitative evaluation. We confirmed that our method (SCAD) exhibits satisfactory performance in indoor environments. Notably, FMeasure, precision, and recall in Illumination Changes are lower because two videos of the category do not contain foreground objects.
Figures 3 and 4 show the examples of input images, results of SCAD, \(p^t_d\), and \(p^t_a\). We observed that SCAD detected foreground objects in Depth Camouflage and Color Camouflage as shown in Fig. 3. In Fig. 3, the appearance-based background subtraction is not efficient for colorCam2 according to \(p^t_a\) and the depth-based background subtraction is not efficient for DCamSeq2 according to \(p^t_d\). However, SCAD reduces false negatives by combining \(p^t_d\) and \(p^t_a\). Moreover, SCAD suppressed false positives owing to the influence of noise on \(p^t_d\) by using graph cuts.
In the category of bootstrapping, we observed that SCAD did not detect a part of the foreground objects as shown in Fig. 4. SCAD failed to enable the background images to eliminate foreground objects because these objects stay at the same position in all the frames.
SCAD has a drawback owing to the combination of depth and appearance information. In the category of Illumination Changes, we observed some false positives. The false positives were caused by false detection of texture-based background subtraction when the room became darker and the observed RGB image is noisy. The texture-based background subtraction does not function well because the method detects noise as local changes. Moreover, SCAD classified weak shadows as the background; however, strong shadows near the foreground object were detected as foreground objects as shown in shadows1 of Fig. 4. The texture-based background subtraction detects the boundaries of strong shadows. In these categories, false positives in the final results of SCAD depend on the errors from texture-based background subtraction.
5 Conclusion
In this paper, we proposed a simple combination of appearance and depth information (SCAD) for foreground segmentation. We compute the two likelihoods of the background using background subtraction based on the appearance and depth information. Subsequently, we optimize the energy function based on the two likelihoods of the background using graph cuts in order to obtain foreground masks. We used the SBM-RGBD 2017 dataset in our evaluation. We confirmed that SCAD was effective for indoor environments. In future work, we will convert SCAD to online methods. We expect that the conversion will be easy based on a sequential updating strategy.
References
Gordon, G., Darrell, T., Harville, M., Woodfill, J.: Background estimation and removal based on range and color. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 459–464. IEEE (1999)
Moyà-Alcover, G., Elgammal, A., Jaume-i Capó, A., Varona, J.: Modeling depth for nonparametric foreground segmentation using RGBD devices. Pattern Recognit. Lett. (2016)
Fernandez-Sanchez, E.J., Diaz, J., Ros, E.: Background subtraction based on color and depth using active sensors. Sensors 13(7), 8895–8915 (2013)
Schiller, I., Koch, R.: Improved video segmentation by adaptive combination of depth keying and mixture-of-gaussians. In: Heyden, A., Kahl, F. (eds.) SCIA 2011. LNCS, vol. 6688, pp. 59–68. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21227-7_6
Camplani, M., Salgado, L.: Background foreground segmentation with RGB-D kinect data: an efficient combination of classifiers. J. Vis. Commun. Image Represent. 25(1), 122–136 (2014)
Camplani, M., Maddalena, L., Moy Alcover, G., Petrosino, A., Salgado, L.: A benchmarking framework for background subtraction in RGBD videos. In: Battiato, S., Gallo, G., Farinella, G.M., Leo, M. (eds.) New Trends in Image Analysis and Processing-ICIAP 2017 Workshops, LNCS, vol. 10590. Springer (2017)
Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Trans. Pattern Anal. Mach. Intell. 26(9), 1124–1137 (2004)
Bouwmans, T.: Traditional and recent approaches in background modeling for foreground detection: an overview. Comput. Sci. Rev. 11, 31–66 (2014)
Liao, S., Zhao, G., Kellokumpu, V., Pietikäinen, M., Li, S.Z.: Modeling pixel process with scale invariant local patterns for background subtraction in complex scenes. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1301–1306. IEEE (2010)
Barnich, O., Van Droogenbroeck, M.: Vibe: a powerful random technique to estimate the background in video sequences. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2009, pp. 945–948. IEEE (2009)
Cucchiara, R., Grana, C., Neri, G., Piccardi, M., Prati, A.: The Sakbot system for moving object detection and tracking. In: Remagnino, P., Jones, G.A., Paragios, N., Regazzoni, C.S. (eds.) Video-Based Surveillance Systems, pp. 145–157. Springer, Boston (2002), https://doi.org/10.1007/978-1-4615-0913-4_12
Acknowledgment
This work was partially supported by JSPS KAKENHI Grant Number JP16J02614 and JP15K12066. We acknowledge the SBM-RGB dataset web page http://rgbd2017.na.icar.cnr.it/SBM-RGBDdataset.html.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Minematsu, T., Shimada, A., Uchiyama, H., Taniguchi, Ri. (2017). Simple Combination of Appearance and Depth for Foreground Segmentation. In: Battiato, S., Farinella, G., Leo, M., Gallo, G. (eds) New Trends in Image Analysis and Processing – ICIAP 2017. ICIAP 2017. Lecture Notes in Computer Science(), vol 10590. Springer, Cham. https://doi.org/10.1007/978-3-319-70742-6_25
Download citation
DOI: https://doi.org/10.1007/978-3-319-70742-6_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-70741-9
Online ISBN: 978-3-319-70742-6
eBook Packages: Computer ScienceComputer Science (R0)