Abstract
Detecting relevant changes is a fundamental problem of video surveillance. Because of the high variability of data and the difficulty of properly annotating changes, unsupervised methods dominate the field. Arguably one of the most critical issues to make them practical is to reduce their false alarm rate. In this work, we develop a non-semantic, method-agnostic, weakly supervised a-contrario validation process, based on high-dimensional statistical modeling of deep features using a Gaussian mixture model, that can reduce the number of false alarms of any change detection algorithm. We also raise the insufficiency of the conventionally used pixel-wise evaluation, as it fails to precisely capture the performance needs of most real applications. For this reason, we complement pixel-wise metrics with component-wise metrics and evaluate the impact of our approach at both pixel and object levels, on six methods and several sequences from different datasets. Our experimental results reveal that the a-contrario theory can be applied to a statistical model of the background of a scene and largely reduce the number of false positives at both pixel and component levels.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Video change detection is a fundamental problem in computer vision and the first step of many applications. While it is an easy task for humans in many contexts, it turns out to be very difficult to automate due to the wide range of possible scenarios. The goal is to assign a change label to those pixels whose photometric properties deviate from those of the background of the scene [1], providing a segmentation map of the temporal anomalies observed at each frame.
In the domains of security and surveillance, change detection can be used for spotting temporal anomalies such as suspicious individuals or stolen objects [2, 3]. In urban scenarios, it can be exploited to analyze common activities such as monitoring illegal parking of vehicles [4]. Change detection may also serve climate and humanitarian causes. Satellite image time series can be used to monitor urban development [5] of specific regions and the variability of gas concentrations in the atmosphere across time [6, 7].
Sample results of the proposed a-contrario validation process for three different background subtraction algorithms. As shown, the proposed a-contrario validation is able to remove false positives of large and small sizes, while keeping true detections. The object-wise \(F_1\) score of the corresponding sequence is highlighted in each case
Traditional change detection methods work by learning a statistical model of a scene under normal conditions. This so-called background model is based on past samples [8, 9]. When a new frame is provided, it is compared to the background reference model, which can lead to a detection and/or to an update of this background model. Traditional pixel-wise methods are convenient because only a short local training is required, i.e., a statistical model of the scene is quickly built using a limited set of recent frames, keeping the computational complexity low. Nevertheless, these techniques are limited by the locality of the features, such as RGB pixels, which makes them prone to false alarms. To overcome this problem, deep learning has been used to leverage the ability of deep neural networks (DNNs) to learn suitable, high-level descriptors of a scene. Despite having shown improvements over traditional video change detection methods [10,11,12], such approaches are constrained by their supervised nature and experience a substantial decrease of performance when tested on out of distribution data. Recently, some works have focused on exploiting the semantic information provided by DNNs in an unsupervised manner [13, 14]. These methods are more robust than classic approaches and do not require labeled examples. However, they are still sensitive to false positives in complex environments such as dynamic backgrounds or adverse weather conditions.
Decreasing the number of false positives in unsupervised methods is a high priority goal. Indeed, a significant number of false alarms may saturate a detection system or require human intervention, which is expensive and time-consuming. Change detection methods are conventionally evaluated on pixel-wise metrics, regardless of the spatial organization of faulty pixels: multiple small false detections are counted on par with a single false detection with equivalent area. As a result, pixel-wise scores may not realistically represent the performance of target applications. The number of false alarms is better evaluated at the object level than at the pixel level, because the cost of a false alarm is generally independent of its size. Hence, we shall favor object-wise performance metrics, where by object we understand a connected component.
In this work, we present a weakly supervised a-contrario validation process, based on high-dimensional modeling of deep features, to largely reduce the number of false positives at both the pixel and component levels. It is non-semantic, thus not limited by the known classes of a pretrained neural network. The contributions of our work are as follows:
-
We propose a method-agnostic, non-semantic, weakly supervised a-contrario validation process that can significantly reduce false alarms in video change detection. To the best of our knowledge, this is the first work to use the a-contrario framework at the DNN feature level for a detection problem.
-
We evaluate our work on six methods at both pixel and object levels. Furthermore, we test them on a set of sequences from different datasets, namely CDNet [1, 15], LASIESTA [16] and J. Zhong and S. Sclaroff [17].
-
Our results show a considerable increase in object-wise performance metrics, while also improving or maintaining the pixel-wise results. Figure 1 illustrates this improvement.
2 Related Works
The detection of temporal anomalies in a video sequence or an image time series is known as change detection. It is difficult to find a definition of temporal anomalies that suits all cases. The methods in the literature look for changes with respect to previously observed examples that are semantically meaningful for the desired downstream task. Change detection algorithms in the literature can be categorized into traditional and neural network-based methods.
Traditional change detection. Traditional change detection methods use statistical computer vision techniques to model the background of a scene and update it online [18]. These approaches commonly follow a three-step workflow consisting in 1) building a background model of the scene, 2) comparing the new observed frames to the background model and 3) updating the model accordingly. Background modeling consists in building a faithful probabilistic representation of the past, which is used as a reference for further observed examples. The seminal example of this approach is the adaptive Gaussian mixture model (GMM), first introduced by Stauffer and Grimson [19], which modeled each pixel with a mixture of K Gaussian functions. Several modifications of their method have been proposed [20,21,22] to improve performance and efficiency. Later methods proposed to model the background using a buffer of past samples, which can alleviate computational complexity. New samples are compared against the stored examples based on a consensus. The popular unsupervised methods ViBe [8] and SuBSENSE [9] use such consensus-based algorithms. During inference, a new unseen frame is compared to the generated background model using an error metric that maps pixels to either background or foreground clusters, producing a binary mask with this two-level information.
Recent work has shown that convolutional neural networks (CNNs) have great ability to deal with classification problems in pattern recognition field. Moving objects detection, regarding as a classification process, labels every pixel as a foreground pixel or a background pixel. In this paper, we proposed an effective post-processing approach, residual background networks (ResBGNets), to improve the accuracy of moving objects detection in video sequences. Instead of learning the ground truth directly, our model learns the residual pictures between the results of existing methods and the ground truth. It benefits to understand the hidden character of each algorithm and correct the misclassification. Inside ResBGNets, we build feature pyramid networks (FPN) to combine spatial information of the low-resolution level with semantical features of high-level of the high-resolution level. Evaluation performed on the 2014 CDnet dataset reveals that through our approach, most of the existing background subtraction methods can get better detection results and a significant higher FM score.
Deep learning-based change detection methods. More recent works have exploited current deep learning algorithms to replace one or more steps of the traditional flow. Braham and Van Droogenbroeck [10] show that the complex background modeling task can be simplified by training a CNN with scene-specific examples. An autoencoder-based architecture called FgSegNet is proposed in [23], which adapts a VGG-16 [24] architecture into a triplet framework, processing images at three different scales. Tezcan et al. [25] proposed BSUV-Net, which trains a scene agnostic network so that it can be tested on new, unseen scenes without individually fine-tuning the network. A newer version of their approach, BSUV-Net 2.0 [26], was later proposed. The ability of DNNs to learn suitable, high-level descriptors of a scene has proved to yield better results than traditional approaches [10,11,12]. Nevertheless, supervised methods require large amounts of annotated data, a tedious and time-consuming task. Furthermore, the performance of supervised methods often declines on out of domain examples. Consequently, unsupervised methods are often chosen over recent supervised methods [27, 28]. For this reason, several recent works have focused on leveraging DNN high-level representations without supervision. Braham et al. [13] proposed SemanticBGS, where a classic method is complemented with semantic information provided by a pretrained network. Moreover, a real-time version of the same approach, named RT-SemanticBGS [29], was later introduced. G-LBM, introduced by Rezaei et al. [30], models the background of a scene with a generative adversarial network (GAN) in the presence of noise and sparse outliers. More recently, An et al. [31] introduced zero-shot background subtraction (ZBS), a method that leverages recent advances in zero-shot object detection to build an open-vocabulary instance-level background model via CLIP [32] embeddings.
Unsupervised DNN-based methods achieve better performances than traditional approaches. Nonetheless, they still fall behind supervised methods in popular benchmarks. Similarly to classic methods, these techniques may detect a substantial number of false alarms, which can critically saturate detection systems.
The remote sensing community has a particular interest in the change detection problem. In this case, a sequence of satellite images from the same scene is provided with the objective to discriminate relevant changes over time. The definition of change will depend on the target application, as different datasets and methods focus on land-cover changes or semantic changes [33,34,35,36] while others provide no presupposition of the nature of the changes beforehand [37]. This distinction is often made due to the fact that in high-resolution images we can easily identify the semantics of images, while in low-resolution images the observed changes often consist of a few pixels with an unclear semantic meaning. In remote sensing, the amount of available images is limited, and the revisit time of satellites leads to high temporal difference between acquisition (often days, or even weeks). For this reason, change detection in remote sensing is generally approached from a very extreme case of low frame rate, short sequences.
Post-processing techniques to reduce false positives have been thoroughly discussed in the literature of video background subtraction and change detection. Identifying and removing such cases is considered necessary in order to reduce the visual noise introduced by unavoidable factors such as background movements, light changes and artifacts [38]. Initial approaches relied on local features to correct the results of any classical approach [38,39,40], e.g., shape regularity, color, motion differences, texture, etc. Wang et al. [41] proposed to train a linear classifier on a set of hand-crafted features to discriminate false positives from true positives. Lin et al. [42] later introduced residual background networks (ResBGNets), improving the results of existing methods by learning the residual pictures between the results of existing methods and the ground truth using a CNN. While removing wrongly detected areas is considered a crucial process to avoid system saturation, some works have also focused on additionally recovering undetected segments, i.e., extending segmented foreground pixels to undetected foreground object areas [43]. In addition, some works have focused on how to evaluate the provided segmentation binary maps, and developed alternative metrics to reliably address the quality of these predictions[44,45,46].
Statistical modeling for video scene understanding. Surveillance applications need to distinguish foreground from background elements so that target instances can be further processed. Early methods attempted to statistically model appearance information with parametric models such as GMMs [19,20,21,22]. However, background modeling in complex scenarios (e.g., dynamic backgrounds or adverse weather conditions) has proved to be challenging for those approaches. Other works introduce optical flow to understand the motion patterns of a scene. For example, Saleemi et al. [47] proposed to model the motion patterns with a mixture of Gaussians using optical flow. Similarly, Ghahremannezhad et al. [48] proposed a method for real-time foreground segmentation modeling optical flow with a GMM. Some recent works have proposed to model the feature space of deep neural networks for image anomaly detection. PaDiM, proposed by Defard et al. [49], is a framework that models patches of a DNN feature map with a Gaussian model for anomaly detection and localization. Artola et al. [50] generalized this attempt with GLAD, a method that learns a robust GMM globally, and then localizes the learned Gaussians with a spatial weight map. Modeling spatial features in a high-dimensional space has shown promising results at recognizing complex patterns.
A-contrario detection theory. The a-contrario detection theory is a mathematical formulation of the non-accidentalness principle, which states that an observed structure is meaningful only when the relation between its parts is too regular to be the result of an accidental arrangement of independent parts [51, 52]. The a-contrario methodology [51, 52] allows one to control the number of false alarms by considering an observed structure only when the expectation of its occurrences is small in a stochastic background model. The Number of false alarms (NFA) of an event e observed up to a probability z(e) in the background model \(\mathcal {H}_0\) is defined by
where \(\mathbbm {P}[Z_{\mathcal {H}_0}(e) \ge z(e)]\) is the probability of obtaining a precision \(Z_{\mathcal {H}_0}(e)\) better or equal to the observed one z(e) in the background model \(\mathcal {H}_0\). The term \(N_T\) corresponds to the number of tests, following the statistical multiple hypothesis testing framework [53]. A small NFA indicates that the event e is unlikely to be randomly observed in the background model \(\mathcal {H}_0\). Hence, the lower the NFA the more meaningful the event. A value \(\epsilon \) is specified and candidates with \(\textrm{NFA} < \epsilon \) are accepted as valid detections. It can be shown [51] that in these conditions \(\epsilon \) is an upper-bound to the expected number of false detections under \(\mathcal {H}_0\).
A-contrario methods have been previously proposed in the literature to address computer vision problems. Lisani and Ramis developed a method in [54] that applied an a-contrario methodology on a normal distribution for the detection of faces in images. In surveillance, a-contrario methods have been used mainly in the remote sensing field, [5, 55, 56], where the temporal difference between images is large, and no tracking of temporal objects is feasible. Grompone et al. [5] proposed an a-contrario method based on a uniform distribution and a greedy algorithm to compute candidate regions, detecting visible ground areas in satellite imagery. Tailanian et al. [57] proposed to control the number of false alarms in image anomaly detection by applying an a-contrario strategy on anomaly maps generated by a multi-scale transformer architecture. Recently, Ciocarlan et al. [58, 59] introduced an a-contrario criterion in the neural network training loop, considering the unexpectedness of an object during training based on the NFA. While these works shed some light on how to integrate a-contrario theory with neural networks, they only detect a particular object class and certain assumptions are taken, i.e., that the background model distribution and the independence of tests.
3 Pixel and Object-wise Evaluation
Change detection algorithms tend to be evaluated on pixel-wise metrics. While this approach allows one to assess how well methods classify pixels into foreground or background clusters, it often fails to represent the performance needs of real applications. Detection systems today tend to consider detections as sets of connected components instead of independent pixels. Then, they process each detection separately for further analysis. An algorithm with high pixel-wise evaluation scores might still predict a considerable amount of false alarms at the object level, which can lead to bottlenecks in the system. Focusing on the performance at the object level will provide a more accurate account of the usability of methods for surveillance applications. Hence, reducing false positives at the object level is one of the most important elements to minimize in order to increase speed and avoid system saturation.
Consequently, we consider both pixel-wise and object-wise evaluation metrics to analyze the performance of our work and existing algorithms. We emphasize our evaluation on the reduction of false alarms and the accuracy of the detections.
Pixel-wise metrics. Let tn, tp, fn, fp be the usual pixel-wise number of true negatives, true positives, false negatives and false positives, respectively. Our experiments consider the following pixel-wise metrics:
-
Precision: \(\textsc {pr} ^{pi} = \textsc {tp}/ (\textsc {tp} + \textsc {fp})\)
-
Recall: \(\textsc {re} ^{pi} = \textsc {tp}/ (\textsc {tp} +\textsc {fn})\)
-
False Positive Rate: \(\textsc {fpr} ^{pi} = \textsc {fp}/(\textsc {fp} +\textsc {tn})\)
-
Percentage of Wrong Classifications: \(\textsc {pwc} ^{pi} = 100 \times (\textsc {fn} +\textsc {fp}) / (\textsc {tp} +\textsc {fn} +\textsc {fp} +\textsc {tn})\)
-
F-measure: f \(_1\) \(^{pi}=2 (\textsc {pr} ^{pi} \times \textsc {re} ^{pi}) / (\textsc {pr} ^{pi} + \textsc {re} ^{pi})\).
Object-wise metrics. We also evaluate the results at the object level, where \(\textsc {tp} ^{ob}\), \(\textsc {fn} ^{ob}\) and \(\textsc {fp} ^{ob}\) now correspond to true positives, false negatives and false positives for sets of connected components. To define our metrics, we consider two perspectives. Firstly, we evaluate the spatial alignment between predictions and ground truth using the traditional intersection over union (IoU). However, such measurement might consider as faulty detections those that do not spatially align with the ground truth to a degree. While this can be considered valid for most cases, sometimes the nature of the target application might not allow us to miss any detections, even if the precision of the blob is quite dissimilar from the ground truth. Hence, we propose two sets of metrics to evaluate our results on the object level.
In one hand, we take the approach of Chan et al. [46] and use a variation of the traditional intersection over union (IoU) first introduced by Rottman et al. [45]. Unlike the conventional IoU, which penalizes cases where a ground truth region is fragmented into multiple predictions by assigning each prediction a moderate IoU score, the adapted metric, named sIoU, does not penalize predictions of a segment when the remaining ground truth is sufficiently covered by other predicted segments.
More formally, let \(\mathcal {K}\) be the set of anomalous components in the ground truth, and \(\hat{\mathcal {K}}\) the set of anomalous components predicted by a change detection algorithm. Hence, the sIoU metric consists in a mapping \(sIoU: \mathcal {K} \rightarrow [0, 1]\) defined for \(k \in \mathcal {K}\) by
where \(\mathcal {A}(k) = \{z\in k^{\prime }: k^{\prime } \in \mathcal {K} \backslash \{k\}\}\). The introduction of \(\mathcal {A}(k)\) excludes all pixels from the union if and only if they correctly intersect with another ground truth component. Hence, given a threshold \(\tau \in [0, 1)\), we define a target \(k\in \mathcal {K}\) as \(\textsc {tp} ^{ob}\) if \(sIoU > \tau \), and as \(\textsc {fn} ^{ob}\) otherwise. Then, \(\textsc {fp} ^{ob}\) is computed as the positive predictive value (PPV) for \(\hat{k}\in \hat{\mathcal {K}}\), defined as:
Thus, \(\hat{k}\in \hat{\mathcal {K}}\) is \(\textsc {fp} ^{ob}\) if \(PPV(\hat{k}) \le \tau \). Lastly, the sIoU-based F-measure is computed as follows:
-
F-measure: f \(_1\) \(^{sIoU}(\tau )=\frac{2\cdot \textsc {tp} ^{ob}(\tau )}{2\cdot \textsc {tp} ^{ob}(\tau )+\textsc {fn} ^{ob}(\tau )+\textsc {fp} ^{ob}(\tau )}\)
We follow the approach of Chan et al. and average the results for different thresholds \(\tau = \{0.25, 0.5, 0.75\}\).
On the other hand, we adapt the pixel-level metrics defined at the beginning of the section to the object level, where any detection which overlaps by at least one pixel with the ground truth will be considered as a good detection. While we previously defined \(\textsc {tp} ^{ob}\), \(\textsc {fn} ^{ob}\) and \(\textsc {fp} ^{ob}\), true negatives at the component level lack clear meaning. Alternatively, we propose to compute the fpr relative to the number of frames \(n_f\). Instead of selecting \(\textsc {tp} ^{ob}\), \(\textsc {fn} ^{ob}\) and \(\textsc {fp} ^{ob}\) by IoU thresholding, we define \(\textsc {tp} ^{ob}\) as detected regions containing at least one real positive pixel. Then, \(\textsc {fp} ^{ob}\) are detections that do not overlap with any real positive pixels and \(\textsc {fn} ^{ob}\) are regions that should have been detected but no change was predicted by the method. To avoid the fragmentation issue, we carefully mark the entire ground truth region for each true positive detection as already checked. Thus, we define the object-wise metrics as follows:
-
\(\textsc {pr} ^{ob} = \textsc {tp} ^{ob} / (\textsc {tp} ^{ob} + \textsc {fp} ^{ob})\)
-
\(\textsc {re} ^{ob} = \textsc {tp} ^{ob} / (\textsc {tp} ^{ob}+\textsc {fn} ^{ob})\)
-
\(\textsc {fpr} ^{ob} = \textsc {fp} ^{ob}/n_f\)
-
\(\textsc {pwc} ^{ob} = 100 \times (\textsc {fn} ^{ob}+\textsc {fp} ^{ob}) / (\textsc {tp} ^{ob}+\textsc {fn} ^{ob}+\textsc {fp} ^{ob})\)
-
f \(_1\) \(^{ob} = 2 (\textsc {pr} ^{ob} \times \textsc {re} ^{ob}) / (\textsc {pr} ^{ob} + \textsc {re} ^{ob})\).
For visualization purposes and since we prioritize the reduction of false positives, we provide the following pixel and object metrics in the main text tables: \(\textsc {fpr} ^{pi}\), \(\textsc {pwc} ^{pi}\), f \(_1\) \(^{pi}\), sIoU, f \(_1\) \(^{sIoU}\) and \(\textsc {fpr} ^{ob}\). In addition, providing all defined metrics would be redundant, but they were computed for completeness and their raw values are provided in the appendix.
4 Our Approach
We propose to supplement any change detection method from the literature with a final a-contrario validation step applied on connected components. Such validation is based on a statistical model of the DNN representations of the scene and requires no annotation. The feature representations can be obtained by a pretrained CNN used as backbone (see the experimental Sect. 5). Hence, we first extract such deep representations of our scene at one or more stages of the network. Furthermore, we build a background model of the scene by learning a Gaussian Mixture in a global-to-local manner. Lastly, we develop an a-contrario validation process to control the number of false alarms at both the pixel and object levels using the learned model and a preexistent change detection algorithm. In the following sections, we present in detail the general detection framework, which is independent of the choice of the pretrained backbone. Figure 2 shows a high-level diagram of the proposed approach.
The proposed deep feature a-contrario validation extracts the representations given by a pretrained network and models them with a global-to-local mixture of Gaussians. For a new, unseen frame, it computes the probability map of observing a temporal anomaly using the trained GMM. Given a change mask predicted by any algorithm in the literature, a validation process based on a-contrario theory addresses all detected regions and removes false alarms
4.1 Background Feature Modeling
We model the extracted deep representations with a mixture of Gaussians to assess the likelihood of an image patch to be a part of the background. For that we extend the GLAD [50] framework to the background modeling of videos. This requires no dense annotations, only a selection of training frames with none or few anomalies present, thus we label our approach as weakly supervised. A mixture is first learned globally, i.e., without taking into consideration the spatial location of the data points. This yields a first Gaussian mixture model \(\theta = (\phi _i, \mu _i, \Sigma _i)_{i \in \{1,\dots , K\}}\), where \(\mu _i\) and \(\Sigma _i\) are the mean and variance components, while \(\phi _i\) are the mixture weights. Then, a local model is derived by assigning position-dependent weights for each Gaussian, so that an image position is represented by a local mixture of the most relevant Gaussian distributions. This gives a localized model that depends on the pixel position (x, y) such that \(\theta (x,y) = (\phi _i(x,y), \mu _i, \Sigma _i)_{i \in \{1,\dots , K\}}\), where \(\mu _i\) and \(\Sigma _i\) do not depend on the position (x, y). This global-to-local approach enables one to exploit information from other similar pixels and to build a good representation of each observed pixel.
Definition 1
The probability of observing \({\textbf {p}}\) at position (x, y) is
where only the weights \(\phi _i(x,y)\) depend on the position. Then, the \(p{-}\textrm{value}\) of a given pixel \({\textbf {p}}\) at position (x, y) is defined by
where \(\mathcal {D}({\textbf {p}}|\theta (x,y)) =\big \{{\textbf {q}}~|~\mathbb {P}({\textbf {q}}|\theta (x,y)) \le \mathbb {P}({\textbf {p}}|\theta (x,y))\big \}\).
This quantity cannot be easily computed, but an upper-bound can be derived. We can invert the sum of the GMM and the integral of the p value to get K Gaussian integrals. However, the set \(\mathcal {D}\) cannot be computed, so we introduce \(\mathcal {D}_i({\textbf {p}}|\theta (x,y)) =\big \{{\textbf {q}}~|~ \phi _i(x,y) \mathbb {P}({\textbf {q}}~|~\mu _i, \Sigma _i) \le \mathbb {P}({\textbf {p}}|\theta (x,y)) \big \}\) which contains it (\(\mathcal {D}\subseteq \mathcal {D}_i\)). We thus find ourselves with the upper-bound
These integrals are equivalent to the \(\chi ^2\) survival function [60]; in the case where the features are of even dimension they are equal to a finite sum that can be computed exactly, as stated in the next proposition.
Proposition 1
Consider a mixture of classical Gaussian distributions
and define the p value as the integral of the density where the probability is lower than the probability density \(\mathbb {P}({\textbf {p}}|\theta )\),
where \(\mathcal {D}({\textbf {p}}|\theta ) =\big \{{\textbf {q}}~|~\mathbb {P}({\textbf {q}}|\theta )\le \mathbb {P}({\textbf {p}}|\theta )\big \}\). Set \(\mathcal {D}_i({\textbf {p}}|\theta ) =\big \{{\textbf {q}}~|~ \phi _i \mathbb {P}({\textbf {q}}~|~\mu _i, \Sigma _i) \le \mathbb {P}({\textbf {p}}|\theta ) \big \}\). Consider the upper-bound of the p value obtained by replacing \(\mathcal {D}\) by the corresponding \(\mathcal {D}_i\) in the integrals, i.e.,
Then, we have
for even m. For odd m, we have
Proof
We first rewrite the condition for a feature to be included in \(D_i\), showing that this is the outside area of an ellipsoid characterized by \(R_i^2 \) as defined below:
We then introduce two changes of variables to facilitate the integral. The first one is a normalization of the Gaussian, \(u=\Sigma _i^{-1/2}(p-\mu _i)\) with the determinant of the Jacobian \(|J|=\sqrt{|\Sigma _i|}\), so that
Subsequently, we shift to hyper-spherical coordinates through another change of variables:
The determinant of the Jacobian of this change is \(|J|=r^{d-1}\prod _{k=1}^{d-2}\sin ^{d-k-1}\theta _k\). Thus, we have the integral
We first integrate the angles
Then, we recognize the probability density of \(\chi ^2\) in the integral and therefore the whole is the survival function of \(\chi ^2\), such that
The r integral can be computed via integration by parts, and the result depends on whether the dimension d of the features is even or odd. We will first see the even case \(d=2m\):
We proceed in the same way for the odd case \(d=2m+1\):
The advantage of the even case is that it takes the form of a finite sum that is simple to compute, whereas the odd case requires estimating the error function. We get the complete formula (9) of the even case by combining (15), (16), (17), and the complete formula (10) of the odd case by combining (15), (16), (18). \(\square \)
4.2 A-contrario Validation
We propose an a-contrario method designed to control the number of false alarms on any change detection method. A change candidate will only be meaningful when the expectation of all its independent parts is low. In our case, we define the background model \(\mathcal {H}_0\) as the local Gaussian mixture \(\theta \) learned during training. Thus, we define the number of false alarms \(\textrm{NFA}(R)\) over a region R, as
where z(R) measures how anomalous the observed values in region R are and \(N_T\) is the overall number of potential anomaly regions.
Remark 1
To understand the rationale of this definition of the NFA, we recall that if a value \(\epsilon \) is specified and candidates with \(\textrm{NFA} < \epsilon \) are accepted as valid detections, and then, it can be shown [51] that \(\epsilon \) is an upper-bound to the expected number of false detections under \(\mathcal {H}_0\).
The corresponding random variable \(Z_{\theta }\) is a random vector with the same dimension as the number of pixels of R. Then, assuming pixel independence (see below), we can measure the anomaly relative to a Gaussian mixture \(\theta \) by computing the probability term of (19) as
where \(p{-}\textrm{value}\big ({\textbf {p}}~|~\theta (x,y)\big )\), given by Equation (5), is evaluated on each pixel \({\textbf {p}}\) of the region R.
We need to define the number of tests \(N_T\) to complete the NFA formulation (19). \(N_T\) is related to the total number of candidate regions that can, in theory, be considered for evaluation. Inspired by the approach in [5], we will consider regions of any shape formed by 4-connected pixels. Regions of pixels with 4-connectivity are known as polyominoes [53, 61]. The exact number \(b_n\) of different polyomino configurations of a given size n is not known in general; however, a good estimate [62] is given by \(b_n \approx \alpha \frac{\beta ^n}{n}\) where \(\alpha \approx 0.317\) and \(\beta \approx 4.06\). Additionally, we need to consider that any pixel in the image can be the center of a region and that a region can be of size from 1 to XY, where X and Y are the width and height of the image, respectively. Thus, we can define the number of tests \(N_T\) as
where \(n=|R|\) is the size of the region R. Notice that this is not exact, as Equation (21) allows for some potential polyominoes extending outside of the image boundaries, but it is an approximation of the same magnitude.
The a-contrario theory states that a structure is meaningful when the relation between its parts is too regular to be the outcome of an accidental arrangement of independent parts. However, feature map pixels given by a neural network are not completely independent. A feature vector will not be independent with respect to its neighboring vectors within the receptive field. Hence, close-by pixels will not guarantee the independence criteria. A solution to this would be to perform the a-contrario validation by only considering pixels far enough to fulfill the independence criteria. One can visualize this as selecting grids of points that are independent inside the region. Naturally, there are several possible grids assuring independence for a given region. Indeed, if \(c_f\) is the size of the receptive field, there are \(c_f\) such grids and all of them should be considered. However, a more practical solution is to consider all the pixels of the region and then take the \(c_f\)-root; this computes the geometric mean of the probability term of all those grids. This is a conservative estimation, as there is certainly one of those grids with a probability term better or equal than the mean.
All in all, Equation (19) can be expressed as
where \(c_f=35\) for stage 1 and \(c_f=91\) for stage 2, according to their receptive fields.
A region R with \(\textrm{NFA}(R) < \epsilon \) is declared a change detection. Since the \(\textrm{NFA}\) can take very extreme values, \(\log (\textrm{NFA})\) is often easier to handle.
Concerning our way to handle non-independence, we chose the simplest way, which is to subsample the image. Yet, more sophisticated methods have also been proposed such as a first-order Markov chains in the a-contrario model (which takes into account the dependencies between the neighboring pixels, see [63]). In this paper, the Markov model is used for line segment detection, it uses a Markov chain model, which enables computing detection probabilities by dynamic programming, given that the pixels are aligned on each tested segment. We cannot do that in a region, this would require modeling 2D Markov random field (MRF) models as a-contrario models. More precisely, we refer to the core paper Beyond Independence: An extension of the a-contrario decision procedure by Myaskouvskey et al. [64], which does not handle 2D Markov random fields either. In short, the question of how to compute probabilities of false alarms in a 2D Markov random fields seems to be still open. Furthermore, in our case of application, an a-contrario Markov random field model would have to be learned from the images of descriptors observed in a detection network when applied to an input image which is white noise. The learned MRF would then be different for each detection method we want to apply our method too. This Markov random field would be different for each detection network, as it should take into account its connectivity to define the underlying clique. Finally, given this MRF and its underlying structure we would need a method to evaluate its parameters. We could assume for example that it is Gaussian, but we would have no special argument in favor of this simplifying assumption and this would again depend on the underlying detection method. In short, this endeavor would go much beyond the purpose of our paper. It would lead us to the question of modeling stochastic a-contrario random field models for all convolutional networks, which we cannot do in the limited context of this paper. This is why we decided to rely on an empirical independence assumption only to evaluate the p values.
5 Experiments
Given a set of input frames of a sequence, we obtain its feature representations at stages 1 and 2 of a pretrained ResNet-50 [65] architecture. We use a backbone pretrained on ImageNet [66] with self-supervision using the VicReg method [67].
Implementation details. We trained a GMM on each sequence setting \(K=1000\) initial Gaussian distributions and we let GLAD remove unnecessary Gaussians, as proposed by its authors [50]. For training, we selected the first few hundred subsequent frames of each scene without any obvious anomalies. For the sequences where anomalies are continuously present (e.g., cars driving by a road), a temporal median filter was applied to filter them out. For the scenes where the first frames only contain a few outliers, we kept them in the training set. The a-contrario validation was applied with a threshold \(\epsilon =1\), and we computed the geometric mean of the NFA scores of each stage. For simplicity, the sequences and their ground truths were resized to 256\(\times \)256.
Results on the CDNet benchmark. CDNet [1, 15] is a benchmark for video change detection consisting of 53 videos divided into 11 categories. Each category corresponds to a specific challenge such as shadow or lowFramerate. It provides pixel-wise annotations for all frames except the first few hundred, which are used for initialization. We apply the proposed approach on all categories of the dataset except for categories cameraJitter and PTZ, as our method assumes that the camera is static and these categories contain variations of the point of view. We evaluated the impact of our approach on a range of methods for the pixel and object-wise metrics introduced in Sect. 3. The selected methods were ViBe [8], SuBSENSE [9], SemanticBGS [13], BSUV-Net [25], BSUVFPM-Net [25] and BSUV-Net 2.0 [26]. The classic ViBe and SuBSENSE are still popular and remain on the state of the art. SemanticBGS combines classical methods with neural networks, and the BSUV family are supervised, scene agnostic methods (though extracting appropriate features from BSUV is difficult due to their reliance on skip connections). Table 1 shows the relative change percentage when applying the proposed a-contrario validation compared to the methods alone on the CDNet dataset (except cameraJitter and PTZ). Furthermore, we selected two categories that are substantially prone to false alarms: dynamic background and bad weather, and show their results separately on Table 2 and Table 3. Notice that we provide the results for each algorithm after applying a 5 by 5 median filter to remove noise and very small false positives, as done by the CDNet authors. For tables containing the exact metric scores, we refer to the Appendix. Formally, we denote by \(S_b\) and \(S_{a}\) the metric scores before and after the a-contrario validation, respectively. We then compute the percentage improvement as
As observed, our approach is able to largely improve the object-wise metrics in all methods, while maintaining or improving their pixel-wise counterparts. Notice that in some cases, such as for ViBe, the percentage improvement is surprisingly large. This occurs as a consequence of removing all FPs that correspond to isolated pixels, as the method commonly yields significant segmentation noise in complex scenes. Figure 3 shows examples of the achieved qualitative results for different sequences and methods. Lastly, the proposed method proved to be robust to anomalies in the training data, yielding weak clusters and low probabilities for such occurrences, as per the cases where a few anomalies were present in the training set.
Results on LASIESTA sequences. The LASIESTA (Labeled and Annotated Sequences for Integral Evaluation of SegmenTation Algorithms) dataset [16] is a fully annotated benchmark for change detection and foreground segmentation proposed by Cuevas et al. It is composed of different indoor and outdoor scenes organized in categories, such as camouflage, shadows or dynamic background. Consequently, we evaluate all sequences with the exception of the ones with camera motion and simulated motion, as we assume the camera is stable. We train on the first 50 frames, without checking whether they contain desired detections. The impact of our approach is evaluated in Table 4. In this case, SemanticGBS is not considered because the required semantic segmentation precomputed maps are only provided for CDNet sequences. We can observe that most methods alone already reach good pixel-wise scores and that the a-contrario validation does not decrease them. Furthermore, object-wise metrics improve in all cases. This demonstrates the efficacy of our approach in scenarios with both indoor and outdoor scenes.
Results on sequences from J. Zhong and S. Sclaroff [17]. Lastly, we evaluated our method on two sequences from J. Zhong and S. Sclaroff [17] containing challenging dynamic backgrounds. One of them shows an escalator moving continuously, while the other displays a floating plastic bottle. In each case, sequences with and without the target objects are provided; we used the latter for training and tested on the former. We evaluated our approach on the same methods as for LASIESTA sequences. The results, reported in Table 5, show a clear improvement of both pixel-wise and object-wise scores in all methods with the exception of BSUV 2.0, where the pixel-wise F-score drops slightly. The water sequence, seen in the second to last row of Fig. 3, illustrates a particular case where a purely semantic approach will not suffice to model the nature of the changes. In this sequence, we are modeling trash floating away in the water. While the bottle shown could be semantically identified, other pieces of plastic and/or other materials and objects, which constitute trash, will fail to do so. Hence, a non-semantic approach like ours can better address these challenges by modeling the environment. Furthermore, we complement this with an additional scenario. The problem of detecting littering in the street-level surveillance cameras shows similar characteristics. Modeling “litter” is difficult because it can correspond to small pieces of garbage, paper, plastic, cans, etc., that are left in public places instead of being properly disposed of. Thus, a purely semantic approach will struggle to clearly discriminate which of the observed elements correspond to littering and those that do not.
Comparison of the histograms of TP and FP computed by the SuBSENSE algorithm for the sequence escalator, by size of the detection (top) and by the \(\log (\textrm{NFA})\) score (bottom). As shown, filtering out small detections is not efficient. The \(\log (\textrm{NFA})\) provides a more suitable separation
Could the obtained drastic reduction of false alarms be achieved by a simpler method than the a-contrario test? An obvious argument that comes to mind is that eliminating small sized objects might lead to similar performance. It could indeed be argued that most datasets contain large true positives, compared to the size of predicted false positives. Therefore, we checked if simply removing all small detections could lead to significant improvement of the false alarm rate. To study the capability of our approach in discriminating FPs from TPs regardless of the size of the detections, we analyzed their separability in relation to the size and the proposed \(\log (\textrm{NFA})\) score. Doing so also leads to finding an optimal value of the threshold \(\epsilon \). Figure 4 compares the separability of FPs and TPs based on the region size against the \(\log (\textrm{NFA})\) score, for the results predicted by SuBSENSE for the sequence escalator. This particular example shows how FPs and TPs are not separable by region size, but the a-contrario assessment provides a clear separation instead. Additional examples for other sequences and methods are provided in Figs. 5, 6, 7 and 8. The value of \(\log (\epsilon )\) is then set to 0 (thus \(\epsilon \)=1), which provides a reasonable separation of FPs and TPs without discarding a high number of true detections. Moreover, the selection of \(\epsilon \)=1 holds a strong significance as it corresponds to an expected number of false alarms equal to one.
Computational cost. We provide an analysis of the computational cost of the provided approach. The code is implemented in JAX in a batch-like manner, where at each iteration a batch of frames is randomly selected from a set of training images. The algorithm stops training when convergence is reached or when a maximum number of iterations \(max\_iter\) is reached. Hence, the length of the sequence does not condition training time. We set the batch size to 32, the convergence threshold to \(1e-3\) and \(max\_iter\) to 100. We noted that various sequences reached the maximum number of iterations \(max\_iter\), which could be decreased in order to accelerate training. Another option would be to decrease the convergence threshold for easier convergence. Table 6 shows the training time of the 2012 (part 1) CDNet sequences and 2014 (part 2). The average training time per sequence yielded 1.63 h. As it can be seen, training varies from 20 min to about 3.5 h. The reason for this is that different sequences show a different level of complexity in the background. For example, the sequences in dynamicBackground generally have complex dynamic patterns that are more challenging to model by the GMM. On the other hand, GLAD removes unnecessary Gaussians when these are not required for training, thus reducing the training time.
6 Conclusion
In this work, we introduced statistical modeling approach of deep features for the reduction of false alarms in video surveillance applications. In Sect. 5, a series of experiments were conducted to measure the impact of the proposed a-contrario validation on several change detection algorithms. The results indicate that substantial improvement is achieved in object-wise measures in virtually all cases, without decreasing the pixel-wise results. The ability to reduce the number of false alarms by such large margins without hampering detection accuracy is an important step toward real automation of surveillance systems. Moreover, we showed the capability of our approach to discard FPs regardless of their size. To the best of our knowledge, this is the first work that uses statistical modeling in the deep feature space for video surveillance applications.
Limitations and future work. While our work successfully decreases the number of false alarms and improves several algorithms on short- and mid-size sequences, each sequence needs to be trained offline. This compromises the method in long, evolving sequences. Hence, future work will focus on an online version to adapt to such cases. In addition, the training time might not be acceptable for certain applications, thus reducing it would lead to practical improvements as well.
Data Availibility
No datasets were generated or analysed during the current study.
References
Wang, Y., Jodoin, P.-M., Porikli, F., Konrad, J., Benezeth, Y., Ishwar, P.: Cdnet 2014: an expanded change detection benchmark dataset. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 393–400 (2014). https://doi.org/10.1109/CVPRW.2014.126
Ishikawa, T., Zin, T.T.: A study on detection of suspicious persons for intelligent monitoring system. In: Big Data Analysis and Deep Learning Applications, pp. 292–301 (2019). https://doi.org/10.1007/978-981-13-0869-7_33
Lyubymenko, K., Adamek, M., Kralik, L.: Detection of suspicious persons and special software. In: 2017 12th Iberian Conference on Information Systems and Technologies (CISTI), pp. 1–4 (2017). https://doi.org/10.23919/CISTI.2017.7975906
Michael, M., Feist, C., Schuller, F., Tschentscher, M.: Fast change detection for camera-based surveillance systems. In: 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC), pp. 2481–2486 (2016). https://doi.org/10.1109/ITSC.2016.7795955
Grompone, R., Hessel, C., Dagobert, T., Morel, J.-M., Franchis, C.: Ground visibility in satellite optical time series based on a contrario local image matching. Image Process. On Line 11, 212–233 (2021). https://doi.org/10.5201/ipol.2021.342
Ouerghi, E., Ehret, T., Franchis, C., Facciolo, G., Lauvaux, T., Meinhardt, E., Morel, J.-M.: Detection of methane plumes in hyperspectral images from sentinel-5p by coupling anomaly detection and pattern recognition. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences V-3-2021, 81–87 (2021) https://doi.org/10.5194/isprs-annals-V-3-2021-81-2021
Ouerghi, E., Ehret, T., Franchis, C., Facciolo, G., Lauvaux, T., Meinhardt, E., Morel, J.-M.: Automatic methane plumes detection in time series of sentinel-5p l1b images. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences V-3-2022, 147–154 (2022) https://doi.org/10.5194/isprs-annals-V-3-2022-147-2022
Barnich, O., Van Droogenbroeck, M.: Vibe: a universal background subtraction algorithm for video sequences. IEEE Trans. Image Process. 20(6), 1709–1724 (2011). https://doi.org/10.1109/TIP.2010.2101613
St-Charles, P.-L., Bilodeau, G.-A., Bergevin, R.: Subsense: a universal change detection method with local adaptive sensitivity. IEEE Trans. Image Process. 24(1), 359–373 (2015). https://doi.org/10.1109/TIP.2014.2378053
Braham, M., Van Droogenbroeck, M.: Deep background subtraction with scene-specific convolutional neural networks. In: 2016 International Conference on Systems, Signals and Image Processing (IWSSIP), pp. 1–4 (2016). https://doi.org/10.1109/IWSSIP.2016.7502717
Babaee, M., Dinh, D.T., Rigoll, G.: A deep convolutional neural network for background subtraction. Arxiv (2017) arXiv: 1702.01731
Sakkos, D., Liu, H., Han, J., Shao, L.: End-to-end video background subtraction with 3d convolutional neural networks. Multimed. Tools Appl. (2018). https://doi.org/10.1007/s11042-017-5460-9
Braham, M., Pierard, S., Van Droogenbroeck, M.: Semantic background subtraction. In: IEEE International Conference on Image Processing (ICIP), pp. 4552–4556 (2017). https://doi.org/10.1109/ICIP.2017.8297144
Noh, H., Ju, J., Seo, M., Park, J., Choi, D.-G.: Unsupervised change detection based on image reconstruction loss. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 1352–1361 (2022). https://doi.org/10.1109/CVPRW56347.2022.00141
Goyette, N., Jodoin, P.-M., Porikli, F., Konrad, J., Ishwar, P.: Changedetection.net: a new change detection benchmark dataset. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 1–8 (2012). https://doi.org/10.1109/CVPRW.2012.6238919
Cuevas, C., Yáñez, E.M., García, N.: Labeled dataset for integral evaluation of moving object detection algorithms: Lasiesta. Comput. Vis. Image Underst. 152, 103–117 (2016). https://doi.org/10.1016/j.cviu.2016.08.005
Zhong, J., Sclaroff, S.: Segmenting foreground objects from a dynamic textured background via a robust kalman filter. In: Proceedings of the 9th IEEE International Conference on Computer Vision (ICCV), pp. 44–501 (2003). https://doi.org/10.1109/ICCV.2003.1238312
Sobral, A., Vacavant, A.: A comprehensive review of background subtraction algorithms evaluated with synthetic and real videos. Comput. Vis. Image Underst. 122, 4–21 (2014). https://doi.org/10.1016/j.cviu.2013.12.005
Stauffer, C., Grimson, W.E.L.: Adaptive background mixture models for real-time tracking. In: Proceedings of the 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. 246–2522 (1999). https://doi.org/10.1109/CVPR.1999.784637
Zivkovic, Z.: Improved adaptive gaussian mixture model for background subtraction. In: Proceedings of the 17th International Conference on Pattern Recognition (ICPR)., vol. 2, pp. 28–312 (2004). https://doi.org/10.1109/ICPR.2004.1333992
Harville, M.: A framework for high-level feedback to adaptive, per-pixel, mixture-of-gaussian background models. In: European Conference on Computer Vision (ECCV), pp. 543–560 (2002). https://doi.org/10.1007/3-540-47977-5_36
Mittal, A., Huttenlocher, D.: Scene modeling for wide area surveillance and image synthesis. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 160–1672 (2000). https://doi.org/10.1109/CVPR.2000.854767
Lim, L.A., Yalim Keles, H.: Foreground segmentation using convolutional neural networks for multiscale feature encoding. Pattern Recognit. Lett. 112, 256–262 (2018). https://doi.org/10.1016/j.patrec.2018.08.002
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: 3rd International Conference on Learning Representations (ICLR), pp. 1–14 (2015). arXiv: 1409.1556
Tezcan, M.O., Ishwar, P., Konrad, J.: Bsuv-net: a fully-convolutional neural network for background subtraction of unseen videos. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 2763–2772 (2020). https://doi.org/10.1109/WACV45572.2020.9093464
Tezcan, M.O., Ishwar, P., Konrad, J.: Bsuv-net 2.0: spatio-temporal data augmentations for video-agnostic supervised background subtraction. IEEE Access 9, 53849–53860 (2021). https://doi.org/10.1109/ACCESS.2021.3071163
Garcia-Garcia, B., Bouwmans, T., Rosales Silva, A.J.: Background subtraction in real applications: challenges, current models and future directions. Comput. Sci. Rev. 35, 100204 (2020). https://doi.org/10.1016/j.cosrev.2019.100204
Bouwmans, T., Javed, S., Sultana, M., Jung, S.K.: Deep neural network concepts for background subtraction: a systematic review and comparative evaluation. Neural Netw. 117, 8–66 (2019). https://doi.org/10.1016/j.neunet.2019.04.024
Cioppa, A., Droogenbroeck, M.V., Braham, M.: Real-time semantic background subtraction. In: 2020 IEEE International Conference on Image Processing (ICIP), pp. 3214–3218 (2020). https://doi.org/10.1109/ICIP40778.2020.9190838
Rezaei, B., Farnoosh, A., Ostadabbas, S.: G-lbm: Generative low-dimensional background model estimation from video sequences. In: European Conference on Computer Vision (ECCV), pp. 293–310 (2020). https://doi.org/10.1007/978-3-030-58610-2_18
An, Y., Zhao, X., Yu, T., Gu, H., Zhao, C., Tang, M., Wang, J.: Zbs: Zero-shot background subtraction via instance-level background modeling and foreground selection. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6355–6364 (2023). https://doi.org/10.1109/CVPR52729.2023.00615
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML), pp. 8748–8763 (2021). arXiv: 2103.00020
Yang, K., Xia, G.-S., Liu, Z., Du, B., Yang, W., Pelillo, M., Zhang, L.: Asymmetric siamese networks for semantic change detection in aerial images. In: IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–18 (2022). https://doi.org/10.1109/TGRS.2021.3113912
Caye Daudt, R., Le Saux, B., Boulch, A., Gousseau, Y.: HRSCD - high resolution semantic change detection dataset. https://doi.org/10.21227/azv7-ta17
Lv, Z., Wang, F., Cui, G., Benediktsson, J.A., Lei, T., Sun, W.: Spatial-spectral attention network guided with change magnitude image for land cover change detection using remote sensing images. IEEE Trans. Geosci. Remote Sens. 60, 1–12 (2022). https://doi.org/10.1109/TGRS.2022.3197901
Wu, C., Du, B., Zhang, L.: Fully convolutional change detection framework with generative adversarial network for unsupervised, weakly supervised and regional supervised change detection. IEEE Trans. Pattern Anal Mach. Intell. 45(8), 9774–9788 (2023). https://doi.org/10.1109/TPAMI.2023.3237896
Daudt, R.C., Le Saux, B., Boulch, A., Gousseau, Y.: Urban change detection for multispectral earth observation using convolutional neural networks. In: IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pp. 2115–2118 (2018). https://doi.org/10.1109/IGARSS.2018.8518015
Giordano, D., Kavasidis, I., Palazzo, S., Spampinato, C.: Rejecting false positives in video object segmentation. In: Computer Analysis of Images and Patterns, pp. 100–112 (2015). https://doi.org/10.1007/978-3-319-23192-1_9
Brutzer, S., Höferlin, B., Heidemann, G.: Evaluation of background subtraction techniques for video surveillance. In: CVPR 2011, pp. 1937–1944 (2011). https://doi.org/10.1109/CVPR.2011.5995508
Nguyen-Ngoc Tran, D., Phuoc Nguyen, T., Nhu Do, T., Viet-Uyen Ha, S.: Subsequent processing of background modeling for traffic surveillance system. Int. J. Comput. Theory Eng. 8(3), 235–239 (2016). https://doi.org/10.7763/IJCTE.2016.V8.1050
Wang, L., Zhao, X., Liu, Y.: Reduce false positives for human detection by a priori probability in videos. In: 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), pp. 584–588 (2015). https://doi.org/10.1109/ACPR.2015.7486570
Lin, L., Wang, B., Gu, Y.: A post-processing approach in moving objects detection via feature pyramid networks. In: 2018 9th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP), pp. 191–195 (2018). https://doi.org/10.1109/PAAP.2018.00040
Ortego, D., Sanmiguel, J.C., Martínez, J.M.: In: Hierarchical Improvement of Foreground Segmentation Masks in Background Subtraction, vol. 29, pp. 1645–1658 (2019). https://doi.org/10.1109/TCSVT.2018.2851440
Margolin, R., Zelnik-Manor, L., Tal, A.: How to evaluate foreground maps? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2014)
Rottmann, M., Colling, P., Paul Hack, T., Chan, R., Hüger, F., Schlicht, P., Gottschalk, H.: Prediction error meta classification in semantic segmentation: Detection via aggregated dispersion measures of softmax probabilities. In: 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–9 (2020). https://doi.org/10.1109/IJCNN48605.2020.9206659
Chan, R., Lis, K., Uhlemeyer, S., Blum, H., Honari, S., Siegwart, R., Fua, P., Salzmann, M., Rottmann, M.: Segmentmeifyoucan: a benchmark for anomaly segmentation. In: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, vol. 1 (2021). https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/d67d8ab4f4c10bf22aa353e27879133c-Paper-round2.pdf
Saleemi, I., Hartung, L., Shah, M.: Scene understanding by statistical modeling of motion patterns. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2069–2076 (2010). https://doi.org/10.1109/CVPR.2010.5539884
Ghahremannezhad, H., Shi, H., Liu, C.: Real-time hysteresis foreground detection in video captured by moving cameras. In: 2022 IEEE International Conference on Imaging Systems and Techniques (IST), pp. 1–6 (2022). https://doi.org/10.1109/IST55454.2022.9827719
Defard, T., Setkov, A., Loesch, A., Audigier, R.: Padim: a patch distribution modeling framework for anomaly detection and localization. In: International Conference on Pattern Recognition (ICPR), pp. 475–489 (2021). https://doi.org/10.1007/978-3-030-68799-1_35
Artola, A., Kolodziej, Y., Morel, J.-M., Ehret, T.: Glad: a global-to-local anomaly detector. In: 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 5490–5499 (2023). https://doi.org/10.1109/WACV56688.2023.00546
Desolneux, A., Moisan, L., Morel, J.-M.: Meaningful alignments. Int. J. Comput. Vis. 40, 7–23 (2000). https://doi.org/10.1023/A:1026593302236
Delsolneux, A., Moisan, L., Morel, J.-M.: From Gestalt Theory to Image Analysis: A Probabilistic Approach, vol. 34. Springer, Paris (2008). https://doi.org/10.1007/978-0-387-74378-3
Gordon, A., Glazko, G., Qiu, X., Yakovlev, A.: Control of the mean number of false discoveries, bonferroni and stability of multiple testing. Ann. Appl. Stat. (2007). https://doi.org/10.1214/07-AOAS102
Lisani, J.-L., Ramis, S.: A contrario detection of faces with a short cascade of classifiers. Image Process. On Line 9, 269–290 (2019). https://doi.org/10.5201/ipol.2019.272
Robin, A., Moisan, L., Le Hegarat-Mascle, S.: An a-contrario approach for subpixel change detection in satellite imagery. IEEE Trans. Pattern Anal. Mach. Intell. 32(11), 1977–1993 (2010). https://doi.org/10.1109/TPAMI.2010.37
Robin, A., Mercier, G., Moser, G., Serpico, S.: An a-contrario approach for unsupervised change detection in radar images. In: 2009 IEEE International Geoscience and Remote Sensing Symposium, vol. 4, pp. 240–243 (2009). https://doi.org/10.1109/IGARSS.2009.5417327
Tailanian, M., Pardo, Á., Musé, P.: U-flow: a u-shaped normalizing flow for anomaly detection with unsupervised threshold. J. Math. Image Vis. (JMIV) (2024). https://doi.org/10.1007/s10851-024-01193-y
Ciocarlan, A., Le Hégarat-Mascle, S., Lefebvre, S., Woiselle, A.: Deep-nfa: a deep a contrario framework for tiny object detection. Pattern Recognit. 150, 110312 (2024). https://doi.org/10.1016/j.patcog.2024.110312
Ciocarlan, A., Le Hegarat-Mascle, S., Lefebvre, S., Woiselle, A., Barbanson, C.: A contrario paradigm for yolo-based infrared small target detection. In: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5630–5634 (2024). https://doi.org/10.1109/ICASSP48485.2024.10446505
Klein, J.P., Moeschberger, M.L.: Survival Analysis: Techniques for Censored and Truncated Data, vol. 2. Springer, New York (2003). https://doi.org/10.1007/b97377
Golomb, S.W.: Polyominoes: Puzzles, Patterns, Problems, and Packings - Revised and Expanded, 2nd edn. Princeton University Press, United States (2020). https://doi.org/10.1515/9780691215051
Jensen, I., Guttmann, A.J.: Statistics of lattice animals (polyominoes) and polygons. J. Phys. A: Math. General 33(29), 257 (2000). https://doi.org/10.1088/0305-4470/33/29/102
Liu, C., Abergel, R., Gousseau, Y., Tupin, F.: A line segment detector for sar images with controlled false alarm rate. In: IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing Symposium, pp. 8464–8467 (2018). https://doi.org/10.1109/IGARSS.2018.8518258
Myaskouvskey, A., Gousseau, Y., Lindenbaum, M.: Beyond independence: an extension of the a contrario decision procedure. Int J. Comput. Vis. 101, 22–44 (2013)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848
Bardes, A., Ponce, J., Lecun, Y.: VICReg: variance-invariance-covariance regularization for self-supervised learning. In: International Conference on Learning Representations (ICLR) (2022). https://inria.hal.science/hal-03541297
Piérard, S., Droogenbroeck, M.V.: Summarizing the performances of a background subtraction algorithm measured on several videos. In: 2020 IEEE International Conference on Image Processing (ICIP), pp. 3234–3238 (2020). https://doi.org/10.1109/ICIP40778.2020.9190865
Acknowledgements
This work was funded by AID-DGA (l’Agence de l’Innovation de Defense a la Direction Generale de l’Armement-Minitsere des Armees), and was performed using HPC resources from GENCI-IDRIS (grants 2023-AD011011801R3, 2023-AD011012453R2, 2023-AD011012458R2) and from the “Mésocentre” computing center of CentraleSupélec and ENS Paris-Saclay supported by CNRS and Région Île-de-France (http://mesocentre.universite-paris-saclay.fr). Centre Borelli is also with Université Paris Cité, SSA and INSERM.
Funding
Open access funding provided by Université Paris-Saclay.
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. Material preparation, data collection and experiments were performed by XB. The first draft of the manuscript was written by XB and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Quantitative Results
Quantitative Results
This section provides the results of our experiments with all raw metric values defined in Sect. 3.
1.1 CDNet Dataset
Table 7 shows the overall results obtained for each evaluation metric, before and after the a-contrario validation, considering all sequences and all categories of the CDNet benchmark. Notice that, since we assume the camera is static, the categories cameraJitter and PTZ have not been considered. The results for each category are provided separately, which include the ordinary IoU scores for completion. As common practice, we provide the average of all metrics over the set of videos in each category. Nevertheless, we would like to point out that is theoretically incorrect, since arithmetically averaging these indicators does not lead to a value that preserves the probabilistic meaning, as described by Piérard and Van Droogenbroeck [68]. Each category is linked to its corresponding table in the list down below:
-
baseline: Table 8
-
dynamicBackground: Table 9
-
badWeather: Table 10
-
intermittentObjectMotion: Table 11
-
lowFramerate: Table 12
-
nightVideos: Table 13
-
thermal: Table 14
-
shadow: Table 15
-
turbulence: Table 16
1.2 LASIESTA Dataset
The full quantitative results for both pixel and object metrics are provided in Table 17.
1.3 Sequences from Zhong and S. Sclaroff
Table 18 shows the full quantitative results for both pixel and object metrics, for the sequences from Zhong and S. Sclaroff.
1.4 Examples on Different Image Size
We report the results for the CDNet sequences of 2012 with a different image size of 128x128, to show the effectiveness for different image sizes. To this end, Tables 19, 20, 21, 22 and 23 show the results for baseline, dynamicBackground, badWeather, intermittentObjectMotion and lowerFramerate, respectively.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Bou, X., Artola, A., Ehret, T. et al. Statistical Modeling of Deep Features to Reduce False Alarms in Video Change Detection. J Math Imaging Vis 67, 19 (2025). https://doi.org/10.1007/s10851-025-01238-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10851-025-01238-w