Keywords

1 Introduction

In computer vision community, the background modeling has always been a field of great interest. This is because it can be a main prerequisite for many smart applications, ranging from active video surveillance to optical motion capture. Due to many dynamic factors of the scene, background modeling is a very complex task. For example, creating the model when there are foreground elements within the scene (i.e., bootstrapping), managing natural light changes over time, or handling the movement performed by parts of the scenery (i.e., dynamic background) are all critical aspects in defining a robust background model of uncontrolled environments.

Over the years, different solutions have been proposed both for gradually light changes, such as the adaptive model based on self-organizing feature maps (SOFMs) presented in [22], and for bootstrapping problem, by using keypoint clustering methods [5,6,7]. The last cited works have introduced a new background modeling technique, based on the most interesting points of an image, able to manage efficiently video sequences acquired by moving video cameras. Typically keypoint-based methods are structured as follows: first, the background model is defined using keypoints and related descriptors detected by a feature extractor; then the keypoints of the model are tracked across consecutive frames, through a feature matching phase, to tell apart background keypoints from the foreground ones; finally, the Density-Based Clustering of Application with Noise (DBSCAN) is applied to extract foreground keypoint clusters, which represent the foreground element regions inside the scene. Despite their great results, the aforementioned works cannot bear scenarios consisting of dynamic elements (i.e., trees blowing in the wind or gushing fountains), that is because the common descriptors of the keypoints are too sensitive to all variations that can occur in an uncontrolled environment.

This paper, extending the pipeline of the adaptive bootstrapping management (ABM) proposed in [5], introduces an innovative strategy to improve the feature matching phase accuracy in keypoint-based methods for dynamic background modeling. Once the keypoints are extracted with the A-KAZE algorithm [1], descriptors are computed combining the RootSIFT [2] with an average pooling, applied on different patches obtained from a neighbourhood of each keypoint. Compared to renowned descriptors, such as A-KAZE, this combination results invariant to small local changes caused by moving background. Moreover, in the feature matching step, the Hamming distance is replaced by the Bhattacharyya distance, since the latter is more suitable to measure the closeness between our new descriptors, compared to the former, which is applicable only between binary descriptors (i.e., A-KAZE, ORB [26], and others). Experiments on the Scene Background Modeling.NET (SBMnet) [18] dataset have shown how the proposed model provides good performance, in terms of background reconstruction, in very challenging video sequences. Finally, including a foreground detection module, inspired by the work proposed in [7], additional tests on the Freiburg-Berkeley Motion Segmentation Dataset (FBMS-59) [24] dataset have also shown that our solution can be integrated into a moving object detection pipeline obtaining excellent results.

The rest of paper is organized as follows. In Sect. 2, a brief overview of the state-of-the-art on background modeling is presented. In Sect. 3, the proposed method is described in detail, presenting both our descriptor for background modeling and the distance metric used in the feature matching phase. Section 4 contains the experimental results on the SBMnet and FBMS-59 datasets. Finally, Sect. 5 concludes the paper.

2 Related Work

In recent years, the background modeling has been extensively studied, often integrating it as prerequisite for other tasks, such as foreground detection or image segmentation. A valid example is reported in [31], where a spatio-temporal tensor formulation and a Gaussian Mixture Background Model are fused to perform a hybrid moving object detection system. In [21], a hierarchical background model for video surveillance using PTZ camera is presented, obtained by separating the range of continuous focal lengths of the camera into several discrete levels and partitioning each level into many partial fixed scenes. Then, each new frame acquired by the PTZ camera is related to a specific scene using a fast approximate nearest neighbour search. Another interesting model is used to implement the change detection method proposed in [29], where a pixel representation is characterized in a non-parametric paradigm by an improved spatio-temporal binary similarity descriptors. This model is also used after in [10] to guide the training of a Convolutional Neural Network (CNN) in performing a foreground segmentation. In [27], the background model is built using appearance features, obtained by considering each pixel neighborhood, and the foreground is separated from the background classifying each pixel using its associated feature vectors and a Support Vector Machine (SVM). Two different models are proposed in [12] to segment background and foreground dynamics, respectively, where the first one is based on the Mixture of Generalized Gaussian (MoGG) distributions and the second one combines multi-scale correlation analysis with a histogram matching. In [15], a pre-trained CNN model is used to create the background model, extracting features from a cleaned background image, without moving objects. The foreground detection is performed comparing the features of the previous model with the features extracted from each frame of the video sequence using the previous CNN. In [32], a dual-target non-parametric background model, able to work with different scenarios and simultaneously distinguish background and foreground pixels, is introduced. Moreover, a novel classification rule is presented: for the background pixels, the method controls the updating of neighbouring pixels to obtain a complete silhouette of static or low-speed moving objects; instead, for the foreground pixels, it controls the current pixel updating to decrease false detection caused by improper background initialization or frequent background changes. To conclude, an innovative solution is presented in [7], where a model-based on clustering of keypoints is used with a spatio-temporal tracking to distinguish the candidate foreground regions that contain moving objects. To follow, a change detection step is locally applied into candidate foreground regions to obtain the moving object silhouettes. Our solution can be perfectly integrated within the pipeline proposed in the latter cited work, replacing the original model and reducing false positives in the foreground detection stage.

3 Proposed Method

In this section, the proposed solution to perform dynamic background modeling, extending the pipeline of the ABM method, is described.

Fig. 1.
figure 1

An overview of the proposed solution used to improve the ABM method.

3.1 ABM Method

The original ABM method, taking a video sequence as input, is mainly composed of the following steps:

  • using the A-KAZE feature extractor, a keypoint set \(K_{b_{t}}\) and a descriptor set \(D_{b_{t}}\) are computed on the first frame \(f_{0}\);

  • the background model is initialized using \(K_{b_t}\) and \(D_{b_t}\);

  • for each frame \(f_t\) of the video sequence, with \(t>0\):

    • keypoints \(K_{t}\) and descriptors \(D_{t}\) are extracted from \(f_t\);

    • the candidate foreground keypoint set \(K_{F_t}\) and the background keypoint set \(K_{B_t}\) are estimated using a feature matching operation between \(D_{b_t}\) and \(D_{t}\), based on Hamming distance;

    • the set \(C_t\) of clusters, representing the foreground element regions, is computed using DBSCAN and \(K_{F_t}\);

    • the background model is updated using the keypoints in \(K_{B_t}\) and their descriptors.

The choice of using A-KAZE feature extraction method derives from both its good performance in terms of speed and accuracy [1], and its great results obtained in several other fields, including object detection [4, 8, 11] and mosaicking [3, 9, 30]. The DBSCAN algorithm is preferred over other standard clustering algorithms, like K-Means, because it does not require a priori knowledge of the cluster number.

3.2 Proposed Descriptor

In our solution, a new descriptor is proposed to be used both in background model initialization and in feature matching phase, instead of relying on A-KAZE descriptor as in the ABM method (although A-KAZE is used to extract the keypoints). This new descriptor is obtained combining RootSIFT and average pooling operations, as shown in Fig. 1. For each keypoint \(k_t\) extracted in the frame \(f_t\) (or \(k_b\), if we consider the background model keypoints), the RootSIFT is used to extract the \(d_{k_t}\) descriptor vector. RootSIFT feature extractor extends SIFT descriptors, using an L1-normalization and applying the square-root of each element in the SIFT vector. Afterwards, a neighbourhood \(\varPsi _{k_t}\) of size 8 \(\times \) 8 pixels is taken from each keypoint \(k_t\), considering the pixels outside the image as zero value if the keypoint is placed at the border. Two average pooling operations are applied on \(\varPsi _{k_t}\), where the first one uses a filter size of \(2\times 2\), whereas the second one uses a size of \(3\times 3\). All the operations use a stride of 1, and their results are scaled using the L1-normalization. The outputs are represented by two feature vectors called \(\alpha ^{'}_{k_t}\) and \(\alpha ^{''}_{k_t}\), respectively. Finally, the \(d_{k_t}\), \(\alpha ^{'}_{k_t}\) and \(\alpha ^{''}_{k_t}\) vectors are concatenated in a single descriptor, called \(\hat{d}_{k_t}\) (or \(\hat{d}_{b_{t-1}}\) for \(k_b \in K_{b_{t-1}}\)).

In the ABM feature matching module, the previous Hamming distance, used for comparing the A-KAZE binary descriptors, must be replaced by another suitable metric. Our choice fell on Bhattacharyya distance where, given two different keypoints \(k_t \in K_t\) and \(k_b \in K_{b_{t-1}}\) and their descriptors \(\hat{d}_{k_t}\) and \(\hat{d}_{b_{t-1}}\), the similarity between \(k_t\) and \(k_b\) is measured as follows:

$$\begin{aligned} dist(k_t,k_b) = \sum _{i=0}^{n-1} \sqrt{\hat{d}_{k_t}(i)\hat{d}_{b_{t-1}}(i)}. \end{aligned}$$
(1)

Performing these changes, our modified ABM method can handle small variations caused by dynamic backgrounds, focusing only on significant movements, typically associated to the foreground elements. In this way, robustness is ensured on bootstrapping and moving camera problems, as well as A-KAZE descriptors.

3.3 Foreground Detection

As described in the introduction, in this work, a customized version of the foreground detection stage, presented in [7], is integrated to identify moving objects within scenes. For each frame \(f_t\) of the video sequence, this stage performs a change detection algorithm between the candidate foreground areas \(A_{f_t}\) and the background model. The final result is represented by a binary image, called \(I_{mask_t}\), containing the mask of identified foreground element silhouettes in \(f_t\). The change detection can be summarized in these main steps:

  • The background model representation \(I_{b_t}\) (i.e., the obtained image representing the reconstruction of the background at time t, without foreground elements) and \(f_t\) are converted to grayscale;

  • initially, a mean filter of size \(3\times 3\) and a stride of 3 are applied both in \(I_{b_t}\) and \(f_t\);

  • the differences between the average values obtained by filters at the same location in \(I_{b_t}\) and \(f_t\) are computed;

  • for each \(3\times 3\) filter region, if the difference value exceeds the threshold \(\tau _1\) and some pixels of the filter interpolate with a foreground cluster \(a\in A_{f_t}\), it means that something between \(f_t\) and \(I_{b_t}\) is changed and further analysis must be performed in that region:

    • within the analysed \(3\times 3\) region, the difference between the pixel values of \(f_t\) and \(I_{b_t}\) is performed;

    • if the difference at position (hw) exceeds the threshold \(\tau _2\), then \(I_{mask_t}(h,w)=1\), otherwise \(I_{mask_t}(h,w)=0\).

  • else, for each pixel with coordinates (hw), included in the considered \(3\times 3\) region, \(I_{mask_t}(h,w) = 0\);

  • once all the regions are analysed, to reduce the noise in \(I_{mask_t}\), an opening morphological operation is performed.

The two thresholds \(\tau _1\) and \(\tau _2\) are set to 15 and 30, respectively, based on empirical tests performed on the FBMS-59 dataset video sequences. Through this foreground detection pipeline, the proposed model can be used effectively to separate the background from the moving objects present in the scene.

Fig. 2.
figure 2

Examples of moving object detection and background updating. The video sequences from the column (a) up to the column (c) belong to the SBMnet dataset. The other sequences belong to the FBMS-59 dataset. For each column (from the top to the bottom), the first picture is the generic frame, the second is the keypoint clustering stage, the third is the moving object mask, and the last is the background updating. From the first column up to the last one the following videos are shown: fall, canoe, pedestrians, cars4, dog, and farm01.

4 Experimental Results

This section reports the experimental results obtained on SBMnet and FBMS-59 datasets. The first one was used to prove the robustness of the proposed solution in dynamic background and very short scene reconstruction. The second one was used to perform a comparison with selected key works of the current literature in foreground detection task.

4.1 SBMnet Dataset

The SBMnet dataset provides several sets of videos focused on the following challenges: Basic, Intermittent Motion, Clutter, Jitter, Illumination Changes, Background Motion, Very Long and Very Short. To show the improvements with respect to the previous ABM method, in this paper, we focused mainly on the Background Motion and Very Short categories. The visual representations of some results, obtained on key videos of these two categories, are shown in Fig. 2 from column (a) to (c). It should be noted that the proposed method is able to distinguish the foreground (for example the car or the canoe) from movements performed by the dynamic background (i.e., the water and trees blowing in the wind). In Tables 1 and 2, the experimental results and comparisons, with some key works in the current literature, on both selected categories are reported.

Based on the protocol proposed in [23], the following metrics were used: the Average Gray-level Error (AGE), the Percentage of Error Pixels (pEPs), the Percentage of Clustered Error Pixels (pCEPs), the Multi-Scale Structural Similarity Index (MS-SSIM), the Peak-Signal-to-Noise-Ratio (PSNR), and the Color image Quality Measure (CQM). Considering the background motion category, the proposed solution outperforms the previous ABM method, in terms of pEPs, PSNR, and MSSIM metrics. This means that our reconstruction contains few noisy points distributed within the whole image. For the veryShort category, we improve the performance on the AGE metric, with respect to the ABM method, while the average measures of the other metrics are very close to the state-of-the-art results.

Table 1. Results and comparison of the proposed method on background reconstruction using the Motion Background category.
Table 2. Results and comparison of the proposed method on background reconstruction using the Very Short category.

4.2 FBMS-59 Dataset

The FBMS-59 dataset is an extensive benchmark for testing feature based motion segmentation algorithms. The visual representations of the results, obtained on several video sequences, are shown in Fig. 2 from the column (d) to column (f). As can be observed, the proposed model perfectly reconstructs the background and, then, isolates the foreground element silhouettes. In Table 3, the comparisons, in terms of precision, recall, and F1-measure, between the proposed solution and key works of the current state-of-art are reported. Instead, in Table 4 additional results on video sequences not tested by the mentioned key works are also reported to provide a further term of comparison for future works. Anyway, the overall results highlight how our approach achieves good performance with respect to the ongoing literature, even overcoming the keypoint-based method proposed in [5]. The obtained high recall values prove that our method is able to capture most of the foreground object pixels, thus resulting suitable for person re-identification or surveillance applications. Especially the latter requires greater precision in estimating the silhouette of foreground objects, probably belonging to intruders, accepting false positives as a compromise.

Table 3. Comparison with key works of the current literature on the FBMS-59 dataset.
Table 4. Additional experiments on FBMS-59 dataset.

5 Conclusion

This work presents an innovative combination of the RootSIFT descriptor and an average pooling for real-time dynamic background modeling and foreground detection. Unlike the previous keypoint-based method, the proposed solution is invariant to small local changes caused by dynamic background. The method provides remarkable results, with respect to different key works of the current state-of-the-art, in both background reconstruction challenges and foreground detection challenges.