Statistical Modeling of Deep Features to Reduce False Alarms in Video Change Detection

Bou, Xavier; Artola, Aitor; Ehret, Thibaud; Facciolo, Gabriele; Morel, Jean-Michel; Gioi, Rafael Grompone von

doi:10.1007/s10851-025-01238-w

Statistical Modeling of Deep Features to Reduce False Alarms in Video Change Detection

Open access
Published: 29 March 2025

Volume 67, article number 19, (2025)
Cite this article

Download PDF

You have full access to this open access article

Journal of Mathematical Imaging and Vision Aims and scope Submit manuscript

Statistical Modeling of Deep Features to Reduce False Alarms in Video Change Detection

Download PDF

Xavier Bou¹,
Aitor Artola¹,
Thibaud Ehret¹,
Gabriele Facciolo¹,
Jean-Michel Morel² &
…
Rafael Grompone von Gioi¹

232 Accesses
Explore all metrics

Abstract

Detecting relevant changes is a fundamental problem of video surveillance. Because of the high variability of data and the difficulty of properly annotating changes, unsupervised methods dominate the field. Arguably one of the most critical issues to make them practical is to reduce their false alarm rate. In this work, we develop a non-semantic, method-agnostic, weakly supervised a-contrario validation process, based on high-dimensional statistical modeling of deep features using a Gaussian mixture model, that can reduce the number of false alarms of any change detection algorithm. We also raise the insufficiency of the conventionally used pixel-wise evaluation, as it fails to precisely capture the performance needs of most real applications. For this reason, we complement pixel-wise metrics with component-wise metrics and evaluate the impact of our approach at both pixel and object levels, on six methods and several sequences from different datasets. Our experimental results reveal that the a-contrario theory can be applied to a statistical model of the background of a scene and largely reduce the number of false positives at both pixel and component levels.

Real-Time Permanent Change Proposals for Abandoned Object Detection

Detecting Significant Changes in Image Sequences

A foveated vision framework for visual change detection using motion and textural features

Article 03 January 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Video change detection is a fundamental problem in computer vision and the first step of many applications. While it is an easy task for humans in many contexts, it turns out to be very difficult to automate due to the wide range of possible scenarios. The goal is to assign a change label to those pixels whose photometric properties deviate from those of the background of the scene [1], providing a segmentation map of the temporal anomalies observed at each frame.

In the domains of security and surveillance, change detection can be used for spotting temporal anomalies such as suspicious individuals or stolen objects [2, 3]. In urban scenarios, it can be exploited to analyze common activities such as monitoring illegal parking of vehicles [4]. Change detection may also serve climate and humanitarian causes. Satellite image time series can be used to monitor urban development [5] of specific regions and the variability of gas concentrations in the atmosphere across time [6, 7].

Traditional change detection methods work by learning a statistical model of a scene under normal conditions. This so-called background model is based on past samples [8, 9]. When a new frame is provided, it is compared to the background reference model, which can lead to a detection and/or to an update of this background model. Traditional pixel-wise methods are convenient because only a short local training is required, i.e., a statistical model of the scene is quickly built using a limited set of recent frames, keeping the computational complexity low. Nevertheless, these techniques are limited by the locality of the features, such as RGB pixels, which makes them prone to false alarms. To overcome this problem, deep learning has been used to leverage the ability of deep neural networks (DNNs) to learn suitable, high-level descriptors of a scene. Despite having shown improvements over traditional video change detection methods [10,11,12], such approaches are constrained by their supervised nature and experience a substantial decrease of performance when tested on out of distribution data. Recently, some works have focused on exploiting the semantic information provided by DNNs in an unsupervised manner [13, 14]. These methods are more robust than classic approaches and do not require labeled examples. However, they are still sensitive to false positives in complex environments such as dynamic backgrounds or adverse weather conditions.

Decreasing the number of false positives in unsupervised methods is a high priority goal. Indeed, a significant number of false alarms may saturate a detection system or require human intervention, which is expensive and time-consuming. Change detection methods are conventionally evaluated on pixel-wise metrics, regardless of the spatial organization of faulty pixels: multiple small false detections are counted on par with a single false detection with equivalent area. As a result, pixel-wise scores may not realistically represent the performance of target applications. The number of false alarms is better evaluated at the object level than at the pixel level, because the cost of a false alarm is generally independent of its size. Hence, we shall favor object-wise performance metrics, where by object we understand a connected component.

In this work, we present a weakly supervised a-contrario validation process, based on high-dimensional modeling of deep features, to largely reduce the number of false positives at both the pixel and component levels. It is non-semantic, thus not limited by the known classes of a pretrained neural network. The contributions of our work are as follows:

We propose a method-agnostic, non-semantic, weakly supervised a-contrario validation process that can significantly reduce false alarms in video change detection. To the best of our knowledge, this is the first work to use the a-contrario framework at the DNN feature level for a detection problem.
We evaluate our work on six methods at both pixel and object levels. Furthermore, we test them on a set of sequences from different datasets, namely CDNet [1, 15], LASIESTA [16] and J. Zhong and S. Sclaroff [17].
Our results show a considerable increase in object-wise performance metrics, while also improving or maintaining the pixel-wise results. Figure 1 illustrates this improvement.

2 Related Works

The detection of temporal anomalies in a video sequence or an image time series is known as change detection. It is difficult to find a definition of temporal anomalies that suits all cases. The methods in the literature look for changes with respect to previously observed examples that are semantically meaningful for the desired downstream task. Change detection algorithms in the literature can be categorized into traditional and neural network-based methods.

Traditional change detection. Traditional change detection methods use statistical computer vision techniques to model the background of a scene and update it online [18]. These approaches commonly follow a three-step workflow consisting in 1) building a background model of the scene, 2) comparing the new observed frames to the background model and 3) updating the model accordingly. Background modeling consists in building a faithful probabilistic representation of the past, which is used as a reference for further observed examples. The seminal example of this approach is the adaptive Gaussian mixture model (GMM), first introduced by Stauffer and Grimson [19], which modeled each pixel with a mixture of K Gaussian functions. Several modifications of their method have been proposed [20,21,22] to improve performance and efficiency. Later methods proposed to model the background using a buffer of past samples, which can alleviate computational complexity. New samples are compared against the stored examples based on a consensus. The popular unsupervised methods ViBe [8] and SuBSENSE [9] use such consensus-based algorithms. During inference, a new unseen frame is compared to the generated background model using an error metric that maps pixels to either background or foreground clusters, producing a binary mask with this two-level information.

Recent work has shown that convolutional neural networks (CNNs) have great ability to deal with classification problems in pattern recognition field. Moving objects detection, regarding as a classification process, labels every pixel as a foreground pixel or a background pixel. In this paper, we proposed an effective post-processing approach, residual background networks (ResBGNets), to improve the accuracy of moving objects detection in video sequences. Instead of learning the ground truth directly, our model learns the residual pictures between the results of existing methods and the ground truth. It benefits to understand the hidden character of each algorithm and correct the misclassification. Inside ResBGNets, we build feature pyramid networks (FPN) to combine spatial information of the low-resolution level with semantical features of high-level of the high-resolution level. Evaluation performed on the 2014 CDnet dataset reveals that through our approach, most of the existing background subtraction methods can get better detection results and a significant higher FM score.

Deep learning-based change detection methods. More recent works have exploited current deep learning algorithms to replace one or more steps of the traditional flow. Braham and Van Droogenbroeck [10] show that the complex background modeling task can be simplified by training a CNN with scene-specific examples. An autoencoder-based architecture called FgSegNet is proposed in [23], which adapts a VGG-16 [24] architecture into a triplet framework, processing images at three different scales. Tezcan et al. [25] proposed BSUV-Net, which trains a scene agnostic network so that it can be tested on new, unseen scenes without individually fine-tuning the network. A newer version of their approach, BSUV-Net 2.0 [26], was later proposed. The ability of DNNs to learn suitable, high-level descriptors of a scene has proved to yield better results than traditional approaches [10,11,12]. Nevertheless, supervised methods require large amounts of annotated data, a tedious and time-consuming task. Furthermore, the performance of supervised methods often declines on out of domain examples. Consequently, unsupervised methods are often chosen over recent supervised methods [27, 28]. For this reason, several recent works have focused on leveraging DNN high-level representations without supervision. Braham et al. [13] proposed SemanticBGS, where a classic method is complemented with semantic information provided by a pretrained network. Moreover, a real-time version of the same approach, named RT-SemanticBGS [29], was later introduced. G-LBM, introduced by Rezaei et al. [30], models the background of a scene with a generative adversarial network (GAN) in the presence of noise and sparse outliers. More recently, An et al. [31] introduced zero-shot background subtraction (ZBS), a method that leverages recent advances in zero-shot object detection to build an open-vocabulary instance-level background model via CLIP [32] embeddings.

Unsupervised DNN-based methods achieve better performances than traditional approaches. Nonetheless, they still fall behind supervised methods in popular benchmarks. Similarly to classic methods, these techniques may detect a substantial number of false alarms, which can critically saturate detection systems.

The remote sensing community has a particular interest in the change detection problem. In this case, a sequence of satellite images from the same scene is provided with the objective to discriminate relevant changes over time. The definition of change will depend on the target application, as different datasets and methods focus on land-cover changes or semantic changes [33,34,35,36] while others provide no presupposition of the nature of the changes beforehand [37]. This distinction is often made due to the fact that in high-resolution images we can easily identify the semantics of images, while in low-resolution images the observed changes often consist of a few pixels with an unclear semantic meaning. In remote sensing, the amount of available images is limited, and the revisit time of satellites leads to high temporal difference between acquisition (often days, or even weeks). For this reason, change detection in remote sensing is generally approached from a very extreme case of low frame rate, short sequences.

Post-processing techniques to reduce false positives have been thoroughly discussed in the literature of video background subtraction and change detection. Identifying and removing such cases is considered necessary in order to reduce the visual noise introduced by unavoidable factors such as background movements, light changes and artifacts [38]. Initial approaches relied on local features to correct the results of any classical approach [38,39,40], e.g., shape regularity, color, motion differences, texture, etc. Wang et al. [41] proposed to train a linear classifier on a set of hand-crafted features to discriminate false positives from true positives. Lin et al. [42] later introduced residual background networks (ResBGNets), improving the results of existing methods by learning the residual pictures between the results of existing methods and the ground truth using a CNN. While removing wrongly detected areas is considered a crucial process to avoid system saturation, some works have also focused on additionally recovering undetected segments, i.e., extending segmented foreground pixels to undetected foreground object areas [43]. In addition, some works have focused on how to evaluate the provided segmentation binary maps, and developed alternative metrics to reliably address the quality of these predictions[44,45,46].

Statistical modeling for video scene understanding. Surveillance applications need to distinguish foreground from background elements so that target instances can be further processed. Early methods attempted to statistically model appearance information with parametric models such as GMMs [19,20,21,22]. However, background modeling in complex scenarios (e.g., dynamic backgrounds or adverse weather conditions) has proved to be challenging for those approaches. Other works introduce optical flow to understand the motion patterns of a scene. For example, Saleemi et al. [47] proposed to model the motion patterns with a mixture of Gaussians using optical flow. Similarly, Ghahremannezhad et al. [48] proposed a method for real-time foreground segmentation modeling optical flow with a GMM. Some recent works have proposed to model the feature space of deep neural networks for image anomaly detection. PaDiM, proposed by Defard et al. [49], is a framework that models patches of a DNN feature map with a Gaussian model for anomaly detection and localization. Artola et al. [50] generalized this attempt with GLAD, a method that learns a robust GMM globally, and then localizes the learned Gaussians with a spatial weight map. Modeling spatial features in a high-dimensional space has shown promising results at recognizing complex patterns.

A-contrario detection theory. The a-contrario detection theory is a mathematical formulation of the non-accidentalness principle, which states that an observed structure is meaningful only when the relation between its parts is too regular to be the result of an accidental arrangement of independent parts [51, 52]. The a-contrario methodology [51, 52] allows one to control the number of false alarms by considering an observed structure only when the expectation of its occurrences is small in a stochastic background model. The Number of false alarms (NFA) of an event e observed up to a probability z(e) in the background model $\mathcal {H}_0$ is defined by

$$\begin{aligned} \textrm{NFA}(e) = N_T \cdot \mathbbm {P}[Z_{\mathcal {H}_0}(e) \ge z(e)], \end{aligned}$$

(1)

where $\mathbbm {P}[Z_{\mathcal {H}_0}(e) \ge z(e)]$ is the probability of obtaining a precision $Z_{\mathcal {H}_0}(e)$ better or equal to the observed one z(e) in the background model $\mathcal {H}_0$. The term $N_T$ corresponds to the number of tests, following the statistical multiple hypothesis testing framework [53]. A small NFA indicates that the event e is unlikely to be randomly observed in the background model $\mathcal {H}_0$. Hence, the lower the NFA the more meaningful the event. A value $\epsilon $ is specified and candidates with $\textrm{NFA} < \epsilon $ are accepted as valid detections. It can be shown [51] that in these conditions $\epsilon $ is an upper-bound to the expected number of false detections under $\mathcal {H}_0$.

A-contrario methods have been previously proposed in the literature to address computer vision problems. Lisani and Ramis developed a method in [54] that applied an a-contrario methodology on a normal distribution for the detection of faces in images. In surveillance, a-contrario methods have been used mainly in the remote sensing field, [5, 55, 56], where the temporal difference between images is large, and no tracking of temporal objects is feasible. Grompone et al. [5] proposed an a-contrario method based on a uniform distribution and a greedy algorithm to compute candidate regions, detecting visible ground areas in satellite imagery. Tailanian et al. [57] proposed to control the number of false alarms in image anomaly detection by applying an a-contrario strategy on anomaly maps generated by a multi-scale transformer architecture. Recently, Ciocarlan et al. [58, 59] introduced an a-contrario criterion in the neural network training loop, considering the unexpectedness of an object during training based on the NFA. While these works shed some light on how to integrate a-contrario theory with neural networks, they only detect a particular object class and certain assumptions are taken, i.e., that the background model distribution and the independence of tests.

3 Pixel and Object-wise Evaluation

Change detection algorithms tend to be evaluated on pixel-wise metrics. While this approach allows one to assess how well methods classify pixels into foreground or background clusters, it often fails to represent the performance needs of real applications. Detection systems today tend to consider detections as sets of connected components instead of independent pixels. Then, they process each detection separately for further analysis. An algorithm with high pixel-wise evaluation scores might still predict a considerable amount of false alarms at the object level, which can lead to bottlenecks in the system. Focusing on the performance at the object level will provide a more accurate account of the usability of methods for surveillance applications. Hence, reducing false positives at the object level is one of the most important elements to minimize in order to increase speed and avoid system saturation.

Consequently, we consider both pixel-wise and object-wise evaluation metrics to analyze the performance of our work and existing algorithms. We emphasize our evaluation on the reduction of false alarms and the accuracy of the detections.

Pixel-wise metrics. Let tn, tp, fn, fp be the usual pixel-wise number of true negatives, true positives, false negatives and false positives, respectively. Our experiments consider the following pixel-wise metrics:

Precision: $\textsc {pr} ^{pi} = \textsc {tp}/ (\textsc {tp} + \textsc {fp})$
Recall: $\textsc {re} ^{pi} = \textsc {tp}/ (\textsc {tp} +\textsc {fn})$
False Positive Rate: $\textsc {fpr} ^{pi} = \textsc {fp}/(\textsc {fp} +\textsc {tn})$
Percentage of Wrong Classifications: $\textsc {pwc} ^{pi} = 100 \times (\textsc {fn} +\textsc {fp}) / (\textsc {tp} +\textsc {fn} +\textsc {fp} +\textsc {tn})$
F-measure: f $_1$ $^{pi}=2 (\textsc {pr} ^{pi} \times \textsc {re} ^{pi}) / (\textsc {pr} ^{pi} + \textsc {re} ^{pi})$.

Object-wise metrics. We also evaluate the results at the object level, where $\textsc {tp} ^{ob}$, $\textsc {fn} ^{ob}$ and $\textsc {fp} ^{ob}$ now correspond to true positives, false negatives and false positives for sets of connected components. To define our metrics, we consider two perspectives. Firstly, we evaluate the spatial alignment between predictions and ground truth using the traditional intersection over union (IoU). However, such measurement might consider as faulty detections those that do not spatially align with the ground truth to a degree. While this can be considered valid for most cases, sometimes the nature of the target application might not allow us to miss any detections, even if the precision of the blob is quite dissimilar from the ground truth. Hence, we propose two sets of metrics to evaluate our results on the object level.

In one hand, we take the approach of Chan et al. [46] and use a variation of the traditional intersection over union (IoU) first introduced by Rottman et al. [45]. Unlike the conventional IoU, which penalizes cases where a ground truth region is fragmented into multiple predictions by assigning each prediction a moderate IoU score, the adapted metric, named sIoU, does not penalize predictions of a segment when the remaining ground truth is sufficiently covered by other predicted segments.

More formally, let $\mathcal {K}$ be the set of anomalous components in the ground truth, and $\hat{\mathcal {K}}$ the set of anomalous components predicted by a change detection algorithm. Hence, the sIoU metric consists in a mapping $sIoU: \mathcal {K} \rightarrow [0, 1]$ defined for $k \in \mathcal {K}$ by

$$\begin{aligned} \begin{aligned} sIoU(k):= \frac{| k\cap \hat{\mathcal {K}}(k) |}{| (k\cup \hat{\mathcal {K}}(k) )\backslash \mathcal {A}(k) |} \\ \text {with} \quad \hat{\mathcal {K}}(k) = \bigcup _{\hat{k}\in \hat{\mathcal {K}},\,\, \hat{k}\cap k \ne \varnothing } \hat{k}, \end{aligned} \end{aligned}$$

(2)

where $\mathcal {A}(k) = \{z\in k^{\prime }: k^{\prime } \in \mathcal {K} \backslash \{k\}\}$. The introduction of $\mathcal {A}(k)$ excludes all pixels from the union if and only if they correctly intersect with another ground truth component. Hence, given a threshold $\tau \in [0, 1)$, we define a target $k\in \mathcal {K}$ as $\textsc {tp} ^{ob}$ if $sIoU > \tau $, and as $\textsc {fn} ^{ob}$ otherwise. Then, $\textsc {fp} ^{ob}$ is computed as the positive predictive value (PPV) for $\hat{k}\in \hat{\mathcal {K}}$, defined as:

$$\begin{aligned} PPV(\hat{k}):= \frac{|\hat{k} \cap \mathcal {K}(\hat{k})|}{|\hat{k}|}. \end{aligned}$$

(3)

Thus, $\hat{k}\in \hat{\mathcal {K}}$ is $\textsc {fp} ^{ob}$ if $PPV(\hat{k}) \le \tau $. Lastly, the sIoU-based F-measure is computed as follows:

F-measure: f $_1$ $^{sIoU}(\tau )=\frac{2\cdot \textsc {tp} ^{ob}(\tau )}{2\cdot \textsc {tp} ^{ob}(\tau )+\textsc {fn} ^{ob}(\tau )+\textsc {fp} ^{ob}(\tau )}$

We follow the approach of Chan et al. and average the results for different thresholds $\tau = \{0.25, 0.5, 0.75\}$.

On the other hand, we adapt the pixel-level metrics defined at the beginning of the section to the object level, where any detection which overlaps by at least one pixel with the ground truth will be considered as a good detection. While we previously defined $\textsc {tp} ^{ob}$, $\textsc {fn} ^{ob}$ and $\textsc {fp} ^{ob}$, true negatives at the component level lack clear meaning. Alternatively, we propose to compute the fpr relative to the number of frames $n_f$. Instead of selecting $\textsc {tp} ^{ob}$, $\textsc {fn} ^{ob}$ and $\textsc {fp} ^{ob}$ by IoU thresholding, we define $\textsc {tp} ^{ob}$ as detected regions containing at least one real positive pixel. Then, $\textsc {fp} ^{ob}$ are detections that do not overlap with any real positive pixels and $\textsc {fn} ^{ob}$ are regions that should have been detected but no change was predicted by the method. To avoid the fragmentation issue, we carefully mark the entire ground truth region for each true positive detection as already checked. Thus, we define the object-wise metrics as follows:

$\textsc {pr} ^{ob} = \textsc {tp} ^{ob} / (\textsc {tp} ^{ob} + \textsc {fp} ^{ob})$
$\textsc {re} ^{ob} = \textsc {tp} ^{ob} / (\textsc {tp} ^{ob}+\textsc {fn} ^{ob})$
$\textsc {fpr} ^{ob} = \textsc {fp} ^{ob}/n_f$
$\textsc {pwc} ^{ob} = 100 \times (\textsc {fn} ^{ob}+\textsc {fp} ^{ob}) / (\textsc {tp} ^{ob}+\textsc {fn} ^{ob}+\textsc {fp} ^{ob})$
f $_1$ $^{ob} = 2 (\textsc {pr} ^{ob} \times \textsc {re} ^{ob}) / (\textsc {pr} ^{ob} + \textsc {re} ^{ob})$.

For visualization purposes and since we prioritize the reduction of false positives, we provide the following pixel and object metrics in the main text tables: $\textsc {fpr} ^{pi}$, $\textsc {pwc} ^{pi}$, f $_1$ $^{pi}$, sIoU, f $_1$ $^{sIoU}$ and $\textsc {fpr} ^{ob}$. In addition, providing all defined metrics would be redundant, but they were computed for completeness and their raw values are provided in the appendix.

4 Our Approach

We propose to supplement any change detection method from the literature with a final a-contrario validation step applied on connected components. Such validation is based on a statistical model of the DNN representations of the scene and requires no annotation. The feature representations can be obtained by a pretrained CNN used as backbone (see the experimental Sect. 5). Hence, we first extract such deep representations of our scene at one or more stages of the network. Furthermore, we build a background model of the scene by learning a Gaussian Mixture in a global-to-local manner. Lastly, we develop an a-contrario validation process to control the number of false alarms at both the pixel and object levels using the learned model and a preexistent change detection algorithm. In the following sections, we present in detail the general detection framework, which is independent of the choice of the pretrained backbone. Figure 2 shows a high-level diagram of the proposed approach.

4.1 Background Feature Modeling

We model the extracted deep representations with a mixture of Gaussians to assess the likelihood of an image patch to be a part of the background. For that we extend the GLAD [50] framework to the background modeling of videos. This requires no dense annotations, only a selection of training frames with none or few anomalies present, thus we label our approach as weakly supervised. A mixture is first learned globally, i.e., without taking into consideration the spatial location of the data points. This yields a first Gaussian mixture model $\theta = (\phi _i, \mu _i, \Sigma _i)_{i \in \{1,\dots , K\}}$, where $\mu _i$ and $\Sigma _i$ are the mean and variance components, while $\phi _i$ are the mixture weights. Then, a local model is derived by assigning position-dependent weights for each Gaussian, so that an image position is represented by a local mixture of the most relevant Gaussian distributions. This gives a localized model that depends on the pixel position (x, y) such that $\theta (x,y) = (\phi _i(x,y), \mu _i, \Sigma _i)_{i \in \{1,\dots , K\}}$, where $\mu _i$ and $\Sigma _i$ do not depend on the position (x, y). This global-to-local approach enables one to exploit information from other similar pixels and to build a good representation of each observed pixel.

Definition 1

The probability of observing ${\textbf {p}}$ at position (x, y) is

$$\begin{aligned} \mathbb {P}\big ({\textbf {p}}~|~\theta (x,y)\big ) = \sum _{i=1}^K \phi _i(x,y) \mathbb {P}({\textbf {p}}~|~\mu _i, \Sigma _i), \end{aligned}$$

(4)

where only the weights $\phi _i(x,y)$ depend on the position. Then, the $p{-}\textrm{value}$ of a given pixel ${\textbf {p}}$ at position (x, y) is defined by

$$\begin{aligned} p{-}\textrm{value}\big ({\textbf {p}}~|~\theta (x,y)\big ) = \hspace{-12mm} \int \limits _{\hspace{12mm}\mathcal {D}({\textbf {p}}|\theta (x,y))} \hspace{-10mm} \mathbb {P}\big ({\textbf {q}}~|~\theta (x,y)\big ) \, d{\textbf {q}}, \end{aligned}$$

(5)

where $\mathcal {D}({\textbf {p}}|\theta (x,y)) =\big \{{\textbf {q}}~|~\mathbb {P}({\textbf {q}}|\theta (x,y)) \le \mathbb {P}({\textbf {p}}|\theta (x,y))\big \}$.

This quantity cannot be easily computed, but an upper-bound can be derived. We can invert the sum of the GMM and the integral of the p value to get K Gaussian integrals. However, the set $\mathcal {D}$ cannot be computed, so we introduce $\mathcal {D}_i({\textbf {p}}|\theta (x,y)) =\big \{{\textbf {q}}~|~ \phi _i(x,y) \mathbb {P}({\textbf {q}}~|~\mu _i, \Sigma _i) \le \mathbb {P}({\textbf {p}}|\theta (x,y)) \big \}$ which contains it ($\mathcal {D}\subseteq \mathcal {D}_i$). We thus find ourselves with the upper-bound

$$\begin{aligned} p{-}\textrm{value}\big ({\textbf {p}}~|~\theta (x,y)\big ) \le \sum _{i=1}^K \hspace{-13mm} \int \limits _{\hspace{13mm} \mathcal {D}_i({\textbf {p}}|\theta (x,y))} \hspace{-10mm} \mathbb {P}({\textbf {q}}~|~\mu _i, \Sigma _i) \, d{\textbf {q}}. \end{aligned}$$

(6)

These integrals are equivalent to the $\chi ^2$ survival function [60]; in the case where the features are of even dimension they are equal to a finite sum that can be computed exactly, as stated in the next proposition.

Proposition 1

Consider a mixture of classical Gaussian distributions

$$\begin{aligned} \mathbb {P}\big ({\textbf {p}}~|~\theta \big ) = \sum _{i=1}^K \phi _i \mathbb {P}({\textbf {p}}~|~\mu _i, \Sigma _i). \end{aligned}$$

(7)

and define the p value as the integral of the density where the probability is lower than the probability density $\mathbb {P}({\textbf {p}}|\theta )$,

$$\begin{aligned} \begin{array}{r c l} p{-}\textrm{value}\big ({\textbf {p}}~|~\theta \big ) & = & \hspace{-10mm} \int \limits _{\hspace{12mm}\mathcal {D}({\textbf {p}}|\theta )} \hspace{-7mm} \mathbb {P}\big ({\textbf {q}}~|~\theta \big ) \, d{\textbf {q}}\\ & = & \sum _{i=1}^K \phi _i \hspace{-6mm} \int \limits _{\hspace{6mm} \mathcal {D}({\textbf {p}}|\theta )} \hspace{-6mm} \mathbb {P}({\textbf {q}}~|~\mu _i, \Sigma _i) \, d{\textbf {q}}, \end{array} \end{aligned}$$

(8)

where $\mathcal {D}({\textbf {p}}|\theta ) =\big \{{\textbf {q}}~|~\mathbb {P}({\textbf {q}}|\theta )\le \mathbb {P}({\textbf {p}}|\theta )\big \}$. Set $\mathcal {D}_i({\textbf {p}}|\theta ) =\big \{{\textbf {q}}~|~ \phi _i \mathbb {P}({\textbf {q}}~|~\mu _i, \Sigma _i) \le \mathbb {P}({\textbf {p}}|\theta ) \big \}$. Consider the upper-bound of the p value obtained by replacing $\mathcal {D}$ by the corresponding $\mathcal {D}_i$ in the integrals, i.e.,

$$\begin{aligned} p{-}\textrm{value}\big ({\textbf {p}}~|~\theta \big ) \le \sum _{i=1}^K \phi _i \hspace{-6mm} \int \limits _{\hspace{6mm} \mathcal {D}_i({\textbf {p}}|\theta )} \hspace{-6mm} \mathbb {P}({\textbf {q}}~|~\mu _i, \Sigma _i) \, d{\textbf {q}}. \end{aligned}$$

Then, we have

$$\begin{aligned} p{-}\textrm{value}\big ({\textbf {p}}~|~\theta \big ) \le \sum _{i=1}^K\phi _i \sum _{j=0}^{m-1}\frac{2^{-j}}{j!}R_i^{2j}e^{-\frac{1}{2}R_i^2} \end{aligned}$$

(9)

for even m. For odd m, we have

$$\begin{aligned}&p{-}\textrm{value}\big ({\textbf {p}}~|~\theta \big ) \le&\sum _{i=1}^K\phi _i \left( \sum _{j=1}^{m}2^m\sqrt{\frac{2}{\pi }}\frac{j!}{(2j)!}R_i^{2j-1}e^{-\frac{1}{2}R_i^2} \right. \nonumber \\ & \left. + 2^{m-1}\left[ 1-{\textbf {erf}}(R_i/\sqrt{2})) \right] \right) . \end{aligned}$$

(10)

Proof

We first rewrite the condition for a feature to be included in $D_i$, showing that this is the outside area of an ellipsoid characterized by $R_i^2 $ as defined below:

$$\begin{aligned} \begin{aligned}&{r c l m} \phi _i \mathbb {P}({\textbf {q}}~|~\mu _i, \Sigma _i) \le \mathbb {P}({\textbf {p}}|\theta ) \Leftrightarrow \\&\frac{\phi _i}{\sqrt{(2\pi )^d|\Sigma _i|}} \exp -\frac{1}{2}(q-\mu _i)^T\Sigma _i^{-1}(q-\mu _i) \le \mathbb {P}({\textbf {p}}|\theta ) \Leftrightarrow \\&(q-\mu _i)^T\Sigma _i^{-1}(q-\mu _i) \ge R_i^2 \quad \text {where} \\&\max (0, -2\log \mathbb {P}({\textbf {p}}|\theta ) + 2\log \phi _i -\log |\Sigma _i| -d\log 2\pi ) = R_i^2. \end{aligned} \end{aligned}$$

(11)

We then introduce two changes of variables to facilitate the integral. The first one is a normalization of the Gaussian, $u=\Sigma _i^{-1/2}(p-\mu _i)$ with the determinant of the Jacobian $|J|=\sqrt{|\Sigma _i|}$, so that

$$\begin{aligned} \begin{array}{r c l} \hspace{-6mm} \displaystyle \int \limits _{\hspace{6mm} \mathcal {D}_i({\textbf {p}}|\theta )} \hspace{-6mm} \mathbb {P}({\textbf {q}}~|~\mu _i, \Sigma _i) \, d{\textbf {q}}= & (2\pi )^{-d/2} \hspace{-6mm} \displaystyle \int \limits _{\hspace{6mm} u^Tu\ge R_i^2} \hspace{-6mm} e^{-\frac{1}{2}u^Tu}\, du. \end{array} \end{aligned}$$

(12)

Subsequently, we shift to hyper-spherical coordinates through another change of variables:

$$\begin{aligned} \begin{array}{r l} u_1 & = r \cos {\theta _1} \\ u_2 & = r \sin {\theta _1}\cos {\theta _2}\\ \vdots & = \vdots \\ u_{d-2} & = r \sin {\theta _1}\ldots \sin {\theta _{d-2}}\cos {\theta _{d-1}}\\ u_{d-1} & = r \sin {\theta _1}\ldots \sin {\theta _{d-2}}\sin {\theta _{d-1}}. \end{array} \end{aligned}$$

(13)

The determinant of the Jacobian of this change is $|J|=r^{d-1}\prod _{k=1}^{d-2}\sin ^{d-k-1}\theta _k$. Thus, we have the integral

$$\begin{aligned} & (2\pi )^{-d/2} \int \limits _{ u^Tu\ge R_i^2} e^{-\frac{1}{2}u^Tu}\, du = \nonumber \\= & (2\pi )^{-d/2} \int _0^{2\pi }d\theta _d \prod _{k=1}^{d-2}\int _0^\pi \sin ^{d-k-1}\theta _kd\theta _k\nonumber \\ & \int _{R_i}^{+\infty }r^{d-1}\exp -\frac{1}{2}r^2\, dr. \end{aligned}$$

(14)

We first integrate the angles

$$\begin{aligned} \begin{array}{r c l} \prod _{k=1}^{d-2}\displaystyle \int \limits _0^\pi \sin ^{d-k-1}\theta _kd\theta _k= & 2\frac{\pi ^{d/2}}{\Gamma (d/2)}. \end{array} \end{aligned}$$

(15)

Then, we recognize the probability density of $\chi ^2$ in the integral and therefore the whole is the survival function of $\chi ^2$, such that

$$\begin{aligned} \hspace{-6mm} \int \limits _{\hspace{6mm} \mathcal {D}_i({\textbf {p}}|\theta )} \hspace{-6mm} \mathbb {P}({\textbf {q}}~|~\mu _i, \Sigma _i) \, d{\textbf {q}}= & \displaystyle \int _{R_i}^{+\infty } \frac{1}{2^{d/2 -1}\Gamma (d/2)}r^{d-1}e^{-\frac{1}{2}r^2}\, dr \nonumber \\= & \displaystyle \int _{R_i^2}^{+\infty } \frac{1}{2^{d/2}\Gamma (d/2)}x^{d/2 - 1}e^{-\frac{1}{2}x}\, dx \nonumber \\= & {\textbf {SF}}_{\chi ^2} (R_i^2). \end{aligned}$$

(16)

The r integral can be computed via integration by parts, and the result depends on whether the dimension d of the features is even or odd. We will first see the even case $d=2m$:

$$\begin{aligned} & \int _{R_i}^{+\infty } r^{2m-1}e^{-\frac{1}{2}r^2}\, dr = R_i^{2m-2}e^{-\frac{1}{2}R_i^2}\nonumber \\ & \quad +(2m-2)\int _{R_i}^{+\infty } r^{2m-3}e^{-\frac{1}{2}r^2}\, dr \nonumber \\= & R_i^{2m-2}e^{-\frac{1}{2}R_i^2}+(2m-2)\left[ R_i^{2m-4}e^{-\frac{1}{2}R_i^2}\right. \nonumber \\ & \left. \quad +(2m-4)\int _{R_i}^{+\infty } r^{2m-5}e^{-\frac{1}{2}r^2}\, dr\right] \nonumber \\ & \vdots \nonumber \\= & \sum _{j=1}^{m-1}\frac{(m-1)!}{j!}2^{m-1-j}R_i^{2j}e^{-\frac{1}{2}R_i^2} \nonumber \\ & \quad + 2^{m-1}(m-1)!\int _{R_i}^{+\infty } re^{-\frac{1}{2}r^2} \nonumber \\= & \sum _{j=1}^{m-1}\frac{(m-1)!}{j!}2^{m-1-j}R_i^{2j}e^{-\frac{1}{2}R_i^2} \nonumber \\ & \quad + 2^{m-1}(m-1)!e^{-\frac{1}{2}R_i^2} \nonumber \\= & \sum _{j=0}^{m-1}\frac{(m-1)!}{j!}2^{m-1-j}R_i^{2j}e^{-\frac{1}{2}R_i^2}. \end{aligned}$$

(17)

We proceed in the same way for the odd case $d=2m+1$:

$$\begin{aligned} & \int _{R_i}^{+\infty } r^{2m}e^{-\frac{1}{2}r^2}\, dr = \nonumber \\= & R_i^{2m-1}e^{-\frac{1}{2}R_i^2}+(2m-1)\int _{R_i}^{+\infty } r^{2m-2}e^{-\frac{1}{2}r^2}\, dr \nonumber \\= & R_i^{2m-1}e^{-\frac{1}{2}R_i^2}+(2m-1)\left[ R_i^{2m-3}e^{-\frac{1}{2}R_i^2}\right. \nonumber \\ & \quad \left. +(2m-3)\int _{R_i}^{+\infty } r^{2m-4}e^{-\frac{1}{2}r^2}\, dr\right] \nonumber \\ & \vdots \nonumber \\= & \sum _{j=1}^{m}\frac{(2m)!j!}{m!(2j!)}R_i^{2j-1}e^{-\frac{1}{2}R_i^2} \nonumber \\ & \quad + \frac{(2m)!}{2(m!)}\int _{R_i}^{+\infty }e^{-\frac{1}{2}r^2} \,dr \nonumber \\= & \sum _{j=1}^{m}\frac{(2m)!j!}{m!(2j!)}R_i^{2j-1}e^{-\frac{1}{2}R_i^2} \nonumber \\ & \quad + \frac{(2m)!}{2(m!)}\sqrt{\frac{\pi }{2}}\left[ 1-{\textbf {erf}}(R_i/\sqrt{2})) \right] . \end{aligned}$$

(18)

The advantage of the even case is that it takes the form of a finite sum that is simple to compute, whereas the odd case requires estimating the error function. We get the complete formula (9) of the even case by combining (15), (16), (17), and the complete formula (10) of the odd case by combining (15), (16), (18). $\square $

4.2 A-contrario Validation

We propose an a-contrario method designed to control the number of false alarms on any change detection method. A change candidate will only be meaningful when the expectation of all its independent parts is low. In our case, we define the background model $\mathcal {H}_0$ as the local Gaussian mixture $\theta $ learned during training. Thus, we define the number of false alarms $\textrm{NFA}(R)$ over a region R, as

$$\begin{aligned} \textrm{NFA}(R) = N_T \cdot \mathbbm {P}[Z_{\theta }(R) \ge z(R)], \end{aligned}$$

(19)

where z(R) measures how anomalous the observed values in region R are and $N_T$ is the overall number of potential anomaly regions.

Remark 1

To understand the rationale of this definition of the NFA, we recall that if a value $\epsilon $ is specified and candidates with $\textrm{NFA} < \epsilon $ are accepted as valid detections, and then, it can be shown [51] that $\epsilon $ is an upper-bound to the expected number of false detections under $\mathcal {H}_0$.

The corresponding random variable $Z_{\theta }$ is a random vector with the same dimension as the number of pixels of R. Then, assuming pixel independence (see below), we can measure the anomaly relative to a Gaussian mixture $\theta $ by computing the probability term of (19) as

$$\begin{aligned} \mathbbm {P}[Z_{\theta }(R) \ge z(R)] = \prod _{{\textbf {p}}\in R} p{-}\textrm{value}\big ({\textbf {p}}~|~\theta (x,y)\big ), \end{aligned}$$

(20)

where $p{-}\textrm{value}\big ({\textbf {p}}~|~\theta (x,y)\big )$, given by Equation (5), is evaluated on each pixel ${\textbf {p}}$ of the region R.

We need to define the number of tests $N_T$ to complete the NFA formulation (19). $N_T$ is related to the total number of candidate regions that can, in theory, be considered for evaluation. Inspired by the approach in [5], we will consider regions of any shape formed by 4-connected pixels. Regions of pixels with 4-connectivity are known as polyominoes [53, 61]. The exact number $b_n$ of different polyomino configurations of a given size n is not known in general; however, a good estimate [62] is given by $b_n \approx \alpha \frac{\beta ^n}{n}$ where $\alpha \approx 0.317$ and $\beta \approx 4.06$. Additionally, we need to consider that any pixel in the image can be the center of a region and that a region can be of size from 1 to XY, where X and Y are the width and height of the image, respectively. Thus, we can define the number of tests $N_T$ as

$$\begin{aligned} N_T = X^2Y^2 \alpha \frac{\beta ^n}{n}, \end{aligned}$$

(21)

where $n=|R|$ is the size of the region R. Notice that this is not exact, as Equation (21) allows for some potential polyominoes extending outside of the image boundaries, but it is an approximation of the same magnitude.

The a-contrario theory states that a structure is meaningful when the relation between its parts is too regular to be the outcome of an accidental arrangement of independent parts. However, feature map pixels given by a neural network are not completely independent. A feature vector will not be independent with respect to its neighboring vectors within the receptive field. Hence, close-by pixels will not guarantee the independence criteria. A solution to this would be to perform the a-contrario validation by only considering pixels far enough to fulfill the independence criteria. One can visualize this as selecting grids of points that are independent inside the region. Naturally, there are several possible grids assuring independence for a given region. Indeed, if $c_f$ is the size of the receptive field, there are $c_f$ such grids and all of them should be considered. However, a more practical solution is to consider all the pixels of the region and then take the $c_f$-root; this computes the geometric mean of the probability term of all those grids. This is a conservative estimation, as there is certainly one of those grids with a probability term better or equal than the mean.

All in all, Equation (19) can be expressed as

$$\begin{aligned} \textrm{NFA}(R) = X^2Y^2 \alpha \frac{\beta ^n}{n} \left( \prod _{{\textbf {p}}\in R} p{-}\textrm{value}\big ({\textbf {p}}~|~\theta (x,y)\big )\right) ^{\frac{1}{c_f}}\hspace{-5mm}, \end{aligned}$$

(22)

where $c_f=35$ for stage 1 and $c_f=91$ for stage 2, according to their receptive fields.

A region R with $\textrm{NFA}(R) < \epsilon $ is declared a change detection. Since the $\textrm{NFA}$ can take very extreme values, $\log (\textrm{NFA})$ is often easier to handle.

Concerning our way to handle non-independence, we chose the simplest way, which is to subsample the image. Yet, more sophisticated methods have also been proposed such as a first-order Markov chains in the a-contrario model (which takes into account the dependencies between the neighboring pixels, see [63]). In this paper, the Markov model is used for line segment detection, it uses a Markov chain model, which enables computing detection probabilities by dynamic programming, given that the pixels are aligned on each tested segment. We cannot do that in a region, this would require modeling 2D Markov random field (MRF) models as a-contrario models. More precisely, we refer to the core paper Beyond Independence: An extension of the a-contrario decision procedure by Myaskouvskey et al. [64], which does not handle 2D Markov random fields either. In short, the question of how to compute probabilities of false alarms in a 2D Markov random fields seems to be still open. Furthermore, in our case of application, an a-contrario Markov random field model would have to be learned from the images of descriptors observed in a detection network when applied to an input image which is white noise. The learned MRF would then be different for each detection method we want to apply our method too. This Markov random field would be different for each detection network, as it should take into account its connectivity to define the underlying clique. Finally, given this MRF and its underlying structure we would need a method to evaluate its parameters. We could assume for example that it is Gaussian, but we would have no special argument in favor of this simplifying assumption and this would again depend on the underlying detection method. In short, this endeavor would go much beyond the purpose of our paper. It would lead us to the question of modeling stochastic a-contrario random field models for all convolutional networks, which we cannot do in the limited context of this paper. This is why we decided to rely on an empirical independence assumption only to evaluate the p values.

5 Experiments

Given a set of input frames of a sequence, we obtain its feature representations at stages 1 and 2 of a pretrained ResNet-50 [65] architecture. We use a backbone pretrained on ImageNet [66] with self-supervision using the VicReg method [67].

Implementation details. We trained a GMM on each sequence setting $K=1000$ initial Gaussian distributions and we let GLAD remove unnecessary Gaussians, as proposed by its authors [50]. For training, we selected the first few hundred subsequent frames of each scene without any obvious anomalies. For the sequences where anomalies are continuously present (e.g., cars driving by a road), a temporal median filter was applied to filter them out. For the scenes where the first frames only contain a few outliers, we kept them in the training set. The a-contrario validation was applied with a threshold $\epsilon =1$, and we computed the geometric mean of the NFA scores of each stage. For simplicity, the sequences and their ground truths were resized to 256$\times $256.

Results on the CDNet benchmark. CDNet [1, 15] is a benchmark for video change detection consisting of 53 videos divided into 11 categories. Each category corresponds to a specific challenge such as shadow or lowFramerate. It provides pixel-wise annotations for all frames except the first few hundred, which are used for initialization. We apply the proposed approach on all categories of the dataset except for categories cameraJitter and PTZ, as our method assumes that the camera is static and these categories contain variations of the point of view. We evaluated the impact of our approach on a range of methods for the pixel and object-wise metrics introduced in Sect. 3. The selected methods were ViBe [8], SuBSENSE [9], SemanticBGS [13], BSUV-Net [25], BSUVFPM-Net [25] and BSUV-Net 2.0 [26]. The classic ViBe and SuBSENSE are still popular and remain on the state of the art. SemanticBGS combines classical methods with neural networks, and the BSUV family are supervised, scene agnostic methods (though extracting appropriate features from BSUV is difficult due to their reliance on skip connections). Table 1 shows the relative change percentage when applying the proposed a-contrario validation compared to the methods alone on the CDNet dataset (except cameraJitter and PTZ). Furthermore, we selected two categories that are substantially prone to false alarms: dynamic background and bad weather, and show their results separately on Table 2 and Table 3. Notice that we provide the results for each algorithm after applying a 5 by 5 median filter to remove noise and very small false positives, as done by the CDNet authors. For tables containing the exact metric scores, we refer to the Appendix. Formally, we denote by $S_b$ and $S_{a}$ the metric scores before and after the a-contrario validation, respectively. We then compute the percentage improvement as

$$\begin{aligned} {\text {relative change percentage} } = \frac{S_{a} - S_b}{S_b}\times 100. \end{aligned}$$

(23)

As observed, our approach is able to largely improve the object-wise metrics in all methods, while maintaining or improving their pixel-wise counterparts. Notice that in some cases, such as for ViBe, the percentage improvement is surprisingly large. This occurs as a consequence of removing all FPs that correspond to isolated pixels, as the method commonly yields significant segmentation noise in complex scenes. Figure 3 shows examples of the achieved qualitative results for different sequences and methods. Lastly, the proposed method proved to be robust to anomalies in the training data, yielding weak clusters and low probabilities for such occurrences, as per the cases where a few anomalies were present in the training set.

Table 1 Results on the CDNet benchmark including all evaluated categories, expressed as relative change percentage as described in Eq. 23

Statistical Modeling of Deep Features to Reduce False Alarms in Video Change Detection

Abstract

Similar content being viewed by others

Real-Time Permanent Change Proposals for Abandoned Object Detection

Detecting Significant Changes in Image Sequences

A foveated vision framework for visual change detection using motion and textural features

1 Introduction

2 Related Works

3 Pixel and Object-wise Evaluation

4 Our Approach

4.1 Background Feature Modeling

Definition 1

Proposition 1

Proof

4.2 A-contrario Validation

Remark 1

5 Experiments

6 Conclusion

Data Availibility

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Quantitative Results

Quantitative Results

1.1 CDNet Dataset

1.2 LASIESTA Dataset

1.3 Sequences from Zhong and S. Sclaroff

1.4 Examples on Different Image Size

Rights and permissions

About this article

Cite this article

Share this article

Keywords