1 Introduction

Low cost RGBD sensors are being successfully used in several indoor video surveillance applications. Many of them rely on a scene background model learned from data for detecting moving objects, to be further processed and analyzed.

Background subtraction from color video data is a widely studied problem, as witnessed by several recent surveys [1, 6, 19, 21]. Main challenges include illumination changes (where the background model should adapt to strong and mild illumination changes), color camouflage (where foreground objects having color very close to the background are hardly segmented), shadows caused by foreground objects occluding the visible light, bootstrapping (where the background model should be properly set up even in the absence of a training set free of foreground moving objects), and the so-called intermittent motion, referring to videos with scenarios known for causing “ghosting” artifacts in the detected motion, i.e., foreground objects that should be detected even if they stop moving (abandoned object) or if they were initially stationary and then start moving (removed object).

Depth data is particularly attractive for background subtraction, since it is not affected by illumination changes or color camouflage; thus some background modeling approaches based only on depth have been proposed [9, 20]. However, depth data suffers from other types of problems, such as depth camouflage (where foreground objects having depth very close to the background are hardly segmented), and out of sensor range (where the sensor produces invalid depth values for foreground or background objects that are too close to/far from it). Moreover, depth data shares with color data other challenges, including intermittent motion, bootstrapping, and shadows caused by foreground objects occluding the IR light coming from the emitter.

Many recent methods try to exploit the complementary nature of color and depth information acquired with RGBD sensors. Generally, these methods either extend to RGBD data well-known background models originally designed for color data [8, 12] or model the scene background (and sometimes also the foreground) based on color and depth independently and then combine the results, on the basis of different criteria [5, 10, 13, 15]

The method proposed in this paper belongs to the latter class of methods. Two background models are constructed for color and depth information, exploiting a self-organizing neural background model previously adopted for RGB videos [18]. The resulting color and depth detection masks are then combined to achieve the final detection masks, also used to better guide the selective model update procedure.

2 RGBD-SOBS Algorithm

The proposed algorithm for background subtraction using RGBD video data exploits the background model constructed and maintained in the SC-SOBS algorithm [18], originally designed for RGB data. It is based on the idea of building a neural background model of the image sequence by learning in a self-organizing manner image sequence variations, seen as trajectories of pixels in time. Two separated models are constructed for color and depth data, and their resulting background subtraction masks are suitably combined in order to update the models and to achieve the final result. In the following, we provide a self-contained description of the color and depth models, referring to [18] for further details on the original neural model, and of the combination criterion.

2.1 The Color Model

Given the color image sequence \(\left\{ I_1, \ldots , I_T \right\} \), at each time instant t we build and update a neuronal map for each pixel \(\mathbf p \), consisting of \(n \!\times \! n\) weight vectors \(cm_t^{i,j}(\mathbf p ), i,j\,=\,0, \ldots , n-1\), which will be called the color model for pixel \(\mathbf p \) and will be indicated as \(CM_t(\mathbf p )\):

$$\begin{aligned} CM_t(\mathbf p ) = \left\{ cm_t^{i,j}(\mathbf p ), \; i,j =0, \ldots , n-1 \right\} . \end{aligned}$$
(1)

If every sequence image has size \(N\,\times \,P\), the complete set of models \(CM_t(\mathbf p )\) for all pixels \(\mathbf p \) of the t-th sequence image \(I_t\) is organized as a 2D neuronal map \(CB_t\) of size \((n \!\times \! N) \!\times \! (n \!\times \! P)\), where the weight vectors \(cm_t^{i,j}(\mathbf p )\) for the generic pixel \(\mathbf p =(x,y)\) are at neuronal map position \((n \!\times \! x + i, n \!\times \! y + j)\), \(i, j = 0, \ldots , n-1\):

$$\begin{aligned} CB_t(n \!\times \! x + i, n \!\times \! y + j) = cm_t^{i,j}(\mathbf p ), \; i,j =0, \ldots , n-1. \end{aligned}$$
(2)

Although redundant, notations \(CM_t\) and \(CB_t\) introduced in Eqs. (1) and (2) will both be adopted. Indeed, the color model \(CM_t(\mathbf p )\) will be adopted in order to indicate the whole set of color weight vectors for each single pixel \(\mathbf p \) at time t, helping to focus on the pixelwise representation of the background model. On the other side, the neuronal map \(CB_t\) will be adopted in order to refer to the whole color background model for an image sequence at time t, to highlight spatial relationships among weight vectors of adjacent pixels (see Eq. (7)).

Differently from [18], for color model initialization, we construct a color image CE that is an estimate of the color scene background. Then, for each pixel \(\mathbf p \), the corresponding weight vectors of the color model \(CM_0(\mathbf p )\) are initialized with the pixel color value \(CE(\mathbf p )\):

$$\begin{aligned} cm_0^{i,j}(\mathbf p ) = CE(\mathbf p ), \; \; \; i,j=0, \ldots , n-1. \end{aligned}$$
(3)

Among the several state-of-the-art background estimation methods [2] for constructing CE, in the experiments we have chosen the LabGen algorithm [14], which is one of the best performing on the SBMnet datasetFootnote 1. Specifically, LabGen was run over the first L initial color frames, where L = 100.

At each time step t, color background subtraction sis achieved by comparing each pixel \(\mathbf p \) of the t-th sequence frame \(I_t\) with the current pixel color model \(CM_{t-1}(\mathbf p )\), to determine the weight vector \(BM_t^C(\mathbf p )\) that best matches it:

$$\begin{aligned} d(BM_t^C(\mathbf p ), I_t(\mathbf p )) = \min _{i,j=0, \ldots , n-1} d(cm_{t-1}^{i,j}(\mathbf p ), I_t(\mathbf p )), \end{aligned}$$
(4)

For the experiments reported in Sect. 3, the metric \(d(\cdot ,\cdot )\) is chosen as the Euclidean distance in the HSV color hexcone as in [18]. The color background subtraction mask for pixel \(\mathbf p \) is then computed as

$$\begin{aligned} M^C_t(\mathbf p ) = \left\{ \begin{array}{lll} 1 &{} &{} \mathrm{if } \; \; \; NCF_t(\mathbf p ) \le 0.5\\ 0 &{} &{} \mathrm{otherwise}\\ \end{array}, \right. \end{aligned}$$
(5)

where the Neighborhood Coherence Factor is defined as \(NCF_t(\mathbf p )\!=\!|\varOmega _\mathbf p |/|N_\mathbf p |\) [7]. Here \(| \cdot |\) refers to the set cardinality, \(N_\mathbf p \!=\!\{ \mathbf q \!: |\mathbf p -\mathbf q | \le h \}\) is a 2D spatial neighborhood of \(\mathbf p \) having width (2h + 1) \(\in \mathbb {N}\) (in the experiments h = 2), and

$$\begin{aligned} \varOmega _\mathbf p = \{ \mathbf q \in N_\mathbf p \!: (d(BM_t^C(\mathbf q ),I_t(\mathbf p )) \le \varepsilon ^C) \vee (shadow(BM_t^C(\mathbf q ),I_t(\mathbf p ))) \}. \end{aligned}$$
(6)

\(\varOmega _\mathbf p \) is the set of pixels \(\mathbf q \) belonging to \(N_\mathbf p \) that either have in their background model a best match that is close enough to their value \(I_t(\mathbf q )\) or are shadows of the background. \(\varepsilon ^C\) is a color threshold enabling the distinction between foreground and background pixels, while \(shadow(\cdot )\) is a function implementing the shadow detection mechanism adopted in [16]. It has been shown that the introduction of spatial coherence enhances robustness of the background subtraction algorithm against false detections [17].

An update of the color neuronal map is performed in order to adapt the color background model to scene modifications. At each time step t, the weight vectors of \(CB_{t-1}\) in a neighborhood of the best matching weight vector \(BM_t^C(\mathbf p )\) are updated according to weighted running average. In details, if \(BM_t^C(\mathbf p )\) is found at position \(\overline{\mathbf{p }}\) in \(CB_{t-1}\), then weight vectors of \(CB_{t-1}\) are updated according to

$$\begin{aligned} CB_{t}(\mathbf q ) = (1-\alpha ^C_t(\mathbf p )) CB_{t-1}(\mathbf q ) + \alpha ^C_t(\mathbf p ) I_t(\mathbf p ) \; \; \; \forall \mathbf q \in N_{\overline{\mathbf{p }}}, \end{aligned}$$
(7)

where \(N_{\overline{\mathbf{p }}}=\left\{ \mathbf q : \left| \overline{\mathbf{p }} - \mathbf q \right| \le k \right\} \) is a 2D spatial neighborhood of \(\overline{\mathbf{p }}\) having width (2\(k+\) 1) \(\in \mathbb {N}\) (in the reported experiments k = 1). Moreover,

$$\begin{aligned} \alpha ^C_t(\mathbf p ) = \gamma \cdot G(\mathbf q -\overline{\mathbf{p }}) \cdot \left( 1 - M_t(\mathbf p ) \right) , \end{aligned}$$
(8)

where \(\gamma \) represents the learning rate, \(G(\cdot ) = \mathcal{N}(\cdot ; \mathbf 0 , \sigma ^2 I)\) is a 2D Gaussian low-pass filter with zero mean and \(\sigma ^2 I\) variance (in the reported experiments \(\sigma ^2\) = 0.75). The \(\alpha ^C_t(\mathbf p )\) values in Eq. (8) are weights that allow us to smoothly take into account the spatial relationship between current pixel \(\mathbf p \) (through its best matching weight vector found at position \(\overline{\mathbf{p }}\)) and its neighboring pixels in \(I_t\) (through weight vectors at position \(\mathbf q \in N_{\overline{\mathbf{p }}}\)), thus preserving topological properties of the input in the neural network update (close inputs correspond to close outputs). In [18], \(M_t(\mathbf p )\) is the background subtraction mask value \(M^C_t(\mathbf p )\) for pixel \(\mathbf p ,\) computed as in Eq. (5).

In the usual case that a set of K initial sequence frames is available for training, the above described initialization and update procedures on the first K sequence frames are adopted for training the neural network background model, to be used for detection and update in all subsequent sequence frames. What differentiates the training and the online phases in the proposed algorithm is the background subtraction mask \(M_t(\mathbf p )\) adopted in Eq. (8), besides the choice of parameters in Eqs. (6) and (8). Indeed, during the online phase, \(M_t(\mathbf p )\) is the combined mask value for pixel \(\mathbf p \) (see Sect. 2.3):

$$\begin{aligned} M_t(\mathbf p ) = \left\{ \begin{array}{lll} M^C_t(\mathbf p ) &{} &{} \mathrm{if } \; \; \; 1 \le t \le K\\ M^{Comb}_t(\mathbf p ) &{} &{} \mathrm{if } \; \; \; t > K\\ \end{array} \right. , \end{aligned}$$
(9)

in order to exploit depth information for the update of the color background model. The threshold \(\varepsilon ^C\) in Eq. (6) is chosen as \(\varepsilon ^C\!=\!\varepsilon ^C_1\) during training and \(\varepsilon ^C\!=\!\varepsilon ^C_2\) during the online phase, with \(\varepsilon ^C_2 \!\le \! \varepsilon ^C_1\), in order to include several observed pixel color variations during training and to obtain a more accurate color background model during the online phase (in the experiments, \(\varepsilon ^C_1\!=\!0.1\) and \(\varepsilon ^C_2\!=\!0.008\)). The learning rate \(\gamma \) in Eq. (8) is set as \(\gamma \!=\!\gamma _1 - t (\gamma _1-\gamma _2)/K\) during training and as \(\gamma \!=\!\gamma _2\) during the online phase, where \(\gamma _1\) and \(\gamma _2\) are predefined constants such that \(\gamma _2\!\le \!\gamma _1\), in order to ensure neural network convergence during the training phase and to adapt to scene variability during the online phase. In order to have in (7) values for \(\alpha ^C_t(\mathbf p )\) that belong to [0,1], we set \(\gamma _1\!=\!c_1/\max \limits _{{\displaystyle \mathbf q \in N_{\overline{\mathbf{p }}}}} G(\mathbf q -\overline{\mathbf{p }})\) and \(\gamma _2\!=\!c_2/\max \limits _{{\displaystyle \mathbf q \in N_{\overline{\mathbf{p }}}}} G(\mathbf q -\overline{\mathbf{p }})\), with \(c_1\) and \(c_2\) constants such that \(0\!\le \!c_2\!\le \!c_1\!\le \) 1 (in the experiments, \(c_1\!=\!0.1\) and \(c_2\!=\!0.05\)). For a deeper explanation of the mathematical ground behind the choice of color model parameters, the interested reader is referred to [18].

2.2 The Depth Model

The neural model adopted for depth information is analogous to the one adopted for color information. Differences are mainly due to the special treatment of invalid values inherent in the depth information acquisition phase.

Given the depth image sequence \(\left\{ D_1, \ldots , D_T \right\} \), at each time instant t we build and update a depth neuronal map for each pixel \(\mathbf p \). It consists of \(n\,\times \,n\) weight vectors \(dm_t^{i,j}(\mathbf p ), i,j\) = 0, ..., n - 1, which will be called the depth model for pixel \(\mathbf p \) and will be indicated as \(DM_t(\mathbf p )\):

$$\begin{aligned} DM_t(\mathbf p ) = \left\{ dm_t^{i,j}(\mathbf p ), \; i,j =0, \ldots , n-1 \right\} . \end{aligned}$$
(10)

Analogously to the case of the color model, the complete set of models \(DM_t(\mathbf p )\) for all pixels \(\mathbf p \) of the t-th depth frame \(D_t\) is organized as a 2D neuronal map \(DB_t\) of size \((n\,\times \,N)\,\times \,(n\,\times \,P)\).

For depth model initialization, an estimate DE of the depth scene background is constructed based on the observation that the scene background is generally further away from the camera as compared to the foreground. Therefore, DE is obtained by accumulating, for each pixel, the highest depth value held in the first L depth frames. Then, for each pixel \(\mathbf p \), the corresponding weight vectors of the depth model \(DM_0(\mathbf p )\) are initialized with the pixel depth value \(DE(\mathbf p )\):

$$\begin{aligned} dm_0^{i,j}(\mathbf p ) = DE(\mathbf p ), \; \; \; i,j=0, \ldots , n-1. \end{aligned}$$
(11)

At each time step t, depth background subtraction is achieved by comparing each pixel \(\mathbf p \) of the t-th depth frame \(D_t\) having valid value with the current pixel depth model \(DM_{t-1}(\mathbf p )\), to determine the closest weight vector \(BM_t^D(\mathbf p )\):

$$\begin{aligned} |BM_t^D(\mathbf p ) - D_t(\mathbf p )| = \min _{i,j=0, \ldots , n-1} | dm_{t-1}^{i,j}(\mathbf p ) - D_t(\mathbf p ) |. \end{aligned}$$
(12)

The depth background subtraction mask for pixel \(\mathbf p \) is then computed as

$$\begin{aligned} M^D_t(\mathbf p ) = \left\{ \begin{array}{lll} 2 &{} &{} \mathrm{if} \; (D_t(\mathbf p ) \; invalid)\\ 0 &{} &{} \mathrm{if} \; (D_t(\mathbf p ) \; valid) \wedge (BM_t^D(\mathbf p ) - D_t(\mathbf p ) \le \varepsilon ^D)\\ 1 &{} &{} \mathrm{otherwise}\\ \end{array}, \right. \end{aligned}$$
(13)

where \(\wedge \) denotes the logical AND operator and \(\varepsilon ^D\) is a predefined threshold. In the experiments, depth values are normalized in [0,1] and \(\varepsilon ^D\) is chosen as \(\varepsilon ^D_1\!=\!0.1\) during training and \(\varepsilon ^D_2\!=\!0.00075\) for 16bit depth images and 0.005 for 8bit depth images in the online phase. According to Eq. (13), incoming pixels having invalid depth value are signaled in the depth detection mask (being assigned the value 2), so as to be suitably treated in the mask combination step (see Sect. 2.3). Moreover, all pixels that have depth value greater than all weight vectors of their depth model are considered as background pixels (being assigned the value 0). This is in line with the observation that the scene background is generally further away from the camera as compared to the foreground, already exploited in the depth model initialization step.

Depth neuronal map update is also performed in order to adapt the depth background model to scene modifications. At each time step t and for each pixel \(\mathbf p \) having valid depth value \(D_t(\mathbf p )\), the weight vectors of \(DB_{t-1}\) in a neighborhood of a valid best matching weight vector \(BM_t^D(\mathbf p )\), found at position \(\overline{\mathbf{p }}\) in \(DB_{t-1}\), are updated according to

$$\begin{aligned} DB_{t}(\mathbf q ) = (1-\alpha ^D_t(\mathbf p )) DB_{t-1}(\mathbf q ) + \alpha ^D_t(\mathbf p ) D_t(\mathbf p ) \; \; \;\forall \mathbf q \in N_{\overline{\mathbf{p }}}, \end{aligned}$$
(14)

where \(\alpha ^D_t(\mathbf p ) = \gamma \cdot G(\mathbf q -\overline{\mathbf{p }}) \cdot \left( 1 - M^D_t(\mathbf p ) \right) ,\) and the remaining notations are defined as those for Eqs. (7) and (8).

Moreover, during training, valid depth values for pixels that in previous frames had invalid values are included into the depth model. Specifically, weight vectors for generic pixel \(\mathbf p =(x,y)\) that are still invalid at time t, \(1 \le t \le K\), are initialized with valid depth values \(D_t(\mathbf p )\)

$$\begin{aligned} DB_t(n \!\times \! x + i, n \!\times \! y + j) = dm_t^{i,j}(\mathbf p ) = D_t(\mathbf p ), \; \; \; i,j=0, \ldots , n-1, 1 \le t \le K. \end{aligned}$$
(15)

This leads to learning a more complete depth background model during training. The process is not applied during the online phase, in order to avoid to include into the depth model new valid values that might belong to foreground objects.

2.3 Combining Color and Depth Masks

During online learning, color mask \(M^C_t\) and depth mask \(M^D_t\) are combined in order to produce a combined mask \(M^{Comb}_t\), that is adopted to selectively update the color model (see Eq. (9)). In case of invalid depth values (signaled by \(M^D_t(\mathbf p )\!=\!2\) in Eq. (13)), only color mask values are considered; otherwise, depth values are considered. In order to reduce the adverse effect of noisy depth values around the object contours, signaled by setting \(M^D_t(\mathbf p )\!=\!3\), color mask values are considered instead in these areas. Thus, the combined mask is computed as

$$\begin{aligned} M^{Comb}_t(\mathbf p ) = \left\{ \begin{array}{lll} M^C_t(\mathbf p ) &{} &{} \mathrm{if } \; \; \; M^D_t(\mathbf p )>1\\ M^D_t(\mathbf p ) &{} &{} \mathrm{if } \; \; \; otherwise\\ \end{array} \right. . \end{aligned}$$
(16)
Fig. 1.
figure 1

Sequence genseq2 of the SBM-RGBD dataset: (a) color and depth images \(I_1\) and \(D_1\); (b) color and depth background estimates CE and DE; (c) color and depth images \(I_{159}\) and \(D_{159}\); (d) color and depth detection masks \(M^C_{159}\) and \(M^D_{159}\); (e) combined mask \(M^{Comb}_{159}\) and ground truth mask \(GT_{159}\). (Color figure online)

An example is provided in Fig. 1. Similarly to [10], the object contours are obtained as \(dil(\overline{M}^D_t) \!\wedge \! M^D_t\), where \(dil(\cdot )\) is the morphological dilation operator with a \(3\!\times \!3\) structuring element and \(\overline{M}^D_t\) denotes the complement of \(M^D_t\).

2.4 The Algorithm

The above described procedure for each pixel can be sketched as the RGBD-SOBS algorithm reported in Fig. 2.

Fig. 2.
figure 2

RGBD-SOBS algorithm for pixel \(\mathbf p \).

3 Experimental Results

Experimental results have been carried out on the SBM-RGBD dataset [3, 4], consisting of 33 RGBD videos acquired by the Microsoft Kinect, spanning 7 categories that include diverse scene background modelling challenges (see Sect. 1): Illumination Changes (IC), Color Camouflage (CC), Depth Camouflage (DC), Intermittent Motion (IM), Out of sensor Range (OR), Shadows (Sh), and Bootstrapping (Bo).

Table 1. Parameter values adopted for evaluating the RGBD-SOBS algorithm.

Parameter values for the RGBD-SOBS algorithm common to all the SBM-RGBD videos are summarized in Table 1. In practice, all default SC-SOBS parameter values [18], well established on CDnet.net [11], have been chosen for the color model, and an analysis analogous to the one reported in [16] has been carried out for choosing the initialization and depth model parameters.

Accuracy is evaluated in terms of seven well-known metrics [3]: Recall (Rec), Specificity (Sp), False Positive Rate (FPR), False Negative Rate (FNR), Percentage of Wrong Classifications (PWC), Precision (Prec), and F-Measure (\(F_1\)).

Table 2. Average performance results of RGBD-SOBS and RGB-SOBS in each category of the SBM-RGBD dataset. In boldface the best results for each metric.

Average performance metricsFootnote 2 achieved by the proposed RGBD-SOBS algorithm are reported in Table 2 for all the categories of the SBM-RGBD dataset, showing that, on average, RGBD-SOBS performs quite well. Moreover, comparisons with the RGB-SOBS results (obtained by RGBD-SOBS using only color information) clearly show that the exploitation of depth information helps in achieving much higher performance. This is particularly true for disambiguating color camouflage, as exemplified in the first two rows of Fig. 3. Great improvement is also achieved for the Intermittent Motion category, where depth information can be easily exploited to detect both abandoned and removed objects and, thus, can help driving the update of the color model in a much more consistent way (see Fig. 3, third and fourth rows). Indeed, since selectivity prevents the model update in foreground areas, a correct classification of the removed foreground (e.g., the box that originally was on the floor, as shown in Fig. 4), is essential for achieving an accurate background model and, consequently, accurate detection results.

Fig. 3.
figure 3

Sequence colorCam1 (CC, 1st and 2nd row) and abandoned1 (IM, 3rd and 4th row): (a) color images; (b) depth images; (c) ground truth masks; masks computed by (d) RGBD-SOBS and (e) RGB-SOBS. (Color figure online)

Fig. 4.
figure 4

Sequence abandoned1: detail of color background models for RGBD-SOBS at frames (a) 161 and (b) 185; and for RGB-SOBS at frames (c) 161 and (d) 185. (Color figure online)

Fig. 5.
figure 5

Sequences BootStrapping_ds (Bo, 1st row), shadows2 (Sh, 2nd row), and ChairBox (IC, 3rd row), DCamSeq2 (DC, 4th row): (a) color images; (b) depth images; (c) ground truth masks; masks computed by (d) RGBD-SOBS and (e) RGB-SOBS. (Color figure online)

Other well-known challenges, such as bootstrapping, illumination changes, and shadows, are partly well handled also by RGB-SOBS (see Fig. 5). It should be observed that, although performance results are on average comparable for the Depth Camouflage category, sometimes the strategy for combining color and depth masks (see Sect. 2.3) can lead to results worse that using only color information, as shown in the last row of Fig. 5, suggesting that further work is still needed for accurately handling this challenge.

4 Conclusions and Perspectives

The paper proposes the RGBD-SOBS algorithm for detecting moving objects in RGBD video sequences. Two background models are constructed for color and depth information, exploiting a self-organizing neural background model previously adopted for RGB videos. The resulting color and depth detection masks are combined, not only to achieve the final results, but also to better guide the selective model update procedure. The evaluation of the algorithm on the SBM-RGBD dataset shows that the exploitation of depth information helps in achieving much higher performance than just using color. This is true not only for sequences showing color camouflage, but also for those including many other color and depth background maintenance challenges (e.g., intermittent motion, bootstrapping, and out of sensor range data). Further work will be devoted to specifically handling the depth camouflage challenge, for which only fair results are achieved by the proposed method.