VALD-GAN: video anomaly detection using latent discriminator augmented GAN

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Rituraj Singh¹,
Anikeit Sethi¹^na1,
Krishanu Saini¹^na1,
Sumeet Saurav²,
Aruna Tiwari¹ &
…
Sanjay Singh²

424 Accesses
Explore all metrics

Abstract

The most crucial and difficult challenge for intelligent video surveillance is to identify anomalies in a video that comprises anomalous behavior or occurrences. The ambiguous definition of the anomaly makes the detection of it a challenging task. Inspired by the wide adoption of generative adversarial networks (GANs), we proposed video anomaly detection using latent discriminator augmented GAN (VALD-GAN), which combines the representation power of GANs with a novel latent discriminator framework to make the latent space follow a pre-defined distribution. We show through our experimental results that the proposed method significantly increases the anomaly discrimination capability of the model. VALD-GAN achieves an AUC and EER score of 97.98, 6.0% on UCSD Peds1, 97.74, 7.01% on UCSD Peds2, and 91.03, 9.04% on CUHK Avenue dataset, respectively. Also, it is able to detect 62 out of a total of 66 anomalous events with 4 as false alarms and 19 out of a total of 19 with 1 false alarm from Subway Entrance and Exit video datasets, respectively.

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

1 Introduction

Due to a rapid increase in video surveillance data, manual labeling (for example, road accidents, burglary, etc.) and analysis to detect anomalies is costly keeping in mind the amount of generation of video surveillance data. Thus, the study of an intelligent video surveillance system that can recognize irregularities in real time has been the main focus of research in the area of computer vision and intelligent systems [1,2,3].

Detection of an anomaly in the video is the method of detecting and recognizing unusual actions or occurrences in it. The fact that there is no clear definition of an anomaly makes it the most difficult problem in video anomaly identification. An anomaly is anything that happens seldom, does not follow normal patterns, or differs greatly from the majority of items in a scene. Over the course of the research, it is typically assumed that video frame samples that appear usually in the dataset are normal and video frame samples that show infrequently are anomalies. The fundamental component of detecting anomalies is also known as one-class learning or novelty detection. The use of video anomaly detection (AD) applies to intelligent surveillance systems [4], image processing in the medical domain [5], detection of defect [6], diagnosis of fault [7], etc.

Early AD techniques typically relied on frameworks of learning the domain knowledge to match the manual feature, but deep neural networks with end-to-end framework emerged resulting from the fast advancement of deep learning including numerous procedures like the estimation of probability [8], one-class learning [9,10,11] and frame prediction and reconstruction [12, 13]. The frame-reconstruction strategy was the one that attracted the most interest. Typically, a reconstruction-based method to detect anomaly frequently assumes that the reconstruction network cannot effectively extract the anomaly data if it has only been trained on normal samples. As a result, the reconstruction error is utilized to detect the abnormality from the anomaly samples. Hasan et al. [14] utilize an auto-encoder (AE)-based network for reconstruction, to compute the $L_2$ distance of the reconstructed image to identify anomalies. Zhao et al. [15] proposed a 3D convolution-based two-branch model incorporating prediction and reconstruction that will enhance the sample motion information learning.

Reconstruction methods in certain cases make use of adversarial learning an AE-based adversarial learning framework that detects anomalies by fusing motion and appearance information in a two-stream structure. On the other hand, Liu et al. [13] evaluate the video frame as anomalous or normal. A special AE [16] is developed to execute the detection of anomalies to learn the training data distribution in both the latent vector space and image space. The utilization of space obtained from the encoder in the AE framework known as latent space plays a crucial role in improving the reconstructed image [2, 17]. Li et al. [18] proposed ST-CaAE, a two-stream approach for AD that uses an adversarial AE and a convolutional AE based on spatial-temporal structure. Although the approach based on reconstruction is criticized for giving unseen samples more uncertainty, it can deliver multi-scale representations for video AD with a higher spatial resolution. In addition to being independent of past information and class labels, the reconstruction network’s learning makes it ideal for practical applications. Therefore, the proposed model is based on a reconstruction approach.

To enhance the effectiveness of AD, an end-to-end architecture is proposed called VALD-GAN that integrates the reconstruction method [9, 19,20,21] with the GAN-based strategy. The main drawback of GAN-based video AD is the reconstructed frame is not at par to detect the anomaly for which we need to utilize both the adversarial loss and the reconstruction loss to enhance the reconstructed frame. The main challenge of GAN is to achieve the equilibrium between the generator and discriminator. The generator network receives all of the regular samples initially and is then intended to produce the corresponding reconstructed video frame, as shown in Fig. 1. A novel latent discriminator is proposed which is able to make the latent space of the generator follow a pre-defined distribution in order to improve the reconstructed video frame. A discriminator network will be able to recognize anomalous frames by comparing the variations in the reconstructed frame and the input frame for normal samples and anomaly samples. The following are our main contributions, in brief:

(1)
We propose an adversarial trained denoising GAN autoencoder that can be applied for real-world video surveillance.
(2)
We present a novel latent discriminator that effectively constrains the latent distribution, playing a vital role in improving the distinguishability of the anomaly video frame.
(3)
Further, we combined the Jeffrey divergence and discriminative capability of GAN to develop the new anomaly score, allowing us to capture the unique characteristics of anomalies resulting in better AD.
(4)
Our proposed model VALD-GAN surpasses existing state-of-the-art approaches and learns end-to-end through extensive experimentation on the benchmark datasets.

The structure of the rest of the paper is as follows. Section 2 provides related work on AD and GAN-based AD models. Section 4, presents our proposed VALD-GAN framework. In Sect. 5, we conduct a series of experiments to optimize and validate the effectiveness of our AD technique. Section 6 discusses the results and insights derived from our proposed model. Finally, Sect. 7 summarizes concludes the paper by summarizing the main findings and highlighting potential avenues for future research.

2 Related work

2.1 Anomaly detection

Recently, anomaly detection has become an active interest for researchers. With this, the model is trained to learn the features and reconstruct the input video frame based on it. Ribeiro et al. [19] introduced an image-to-image reconstruction approach using an encoder–decoder network. This method effectively recovers input frames and allows for the detection of anomalies by leveraging the reconstruction error. Normal samples exhibit a small reconstruction error, while abnormal samples show a significantly higher error. Convolutional LSTM and autoencoder networks are combined by Xu et al. [20] to reconstruct and predict the frame sequences of videos, which improves the capacity of the network to reflect motion patterns and also raises the bar for network training. Cong et al. [22] utilized sparse representation learning to make use of self-representation to find irregular occurrences in videos. Chong et al. [23] propose incorporating a convolutional autoencoder (CAE) into a two-stream network architecture to address training convergence issues. The addition of $L_2$ weight regularization and bias terms improves the reliability of the AD method, as suggested by Chalapathy et al. [24].

Gordon et al. [25] proposed a method to compare stored anomalies with video frame reconstructions for AE. Lu et al. [26] utilized sparse representation-based learning for AE. Chong et al. [23] introduced ST-Autoencoder for spatial-temporal reconstruction-based AE. Yan et al. [27] adopted a two-stream recurrent variational AE, while Wang et al. [28] proposed an LSTM-based encoder–decoder architecture for AD in videos. Nawaratne et al. [29] introduced an incremental spatio-temporal learner (ISTL) with active learning techniques for recognizing abnormal occurrences. However, reconstruction networks may mistakenly identify new normal samples as anomalies, necessitating additional restrictions to limit their generalization of regular data.

2.2 Generative adversarial networks

GAN has emerged as a significant breakthrough in deep learning [30]. In GAN, the discriminator and generator engage in a game, with the generator aiming to generate realistic data and the discriminator striving to distinguish between real and generated data. Unsupervised counteraction is applied until a stable and balanced state is reached [31]. Lee et al. [32] employed a dual-directional generator with LSTM to capture spatial and temporal properties of normal patterns, using a 3D-CNN as a discriminator for video AD. The generator utilizes encoder–decoder networks for image-to-image reconstruction [9, 33], using the discriminator’s output as metrics for anomaly assessment. The Ada-Net [34] merges an auto-encoder network with a GAN-like model, and the addition of a decoder with an attention model to dynamically select significant portions for decoding the encoding features, which is beneficial for preserving important information to learn inherent normal patterns. For the purpose of identifying anomalies, more advanced GAN-based models are being developed. In order to detect abnormalities, a cross-channel network structure with dual-channel GAN is developed, as noted in [35]. In this structure, optical flow is generated from frames via the cross-channel GAN, and frames are generated from optical flow. Pour et al. [36] provide a novel approach for AD that generates some aberrant samples using partial convergence WGAN and trains binary classifiers to distinguish between anomalous and normal frames. Yu et al. [37] utilize the Wasserstein loss for GAN to find anomalies using the future and past discriminator. It makes the supposition that everything that traditional training cannot adequately define is an abnormality. To improve the capacity to discriminate, a restricted generative adversarial network (GAN) with greater distinguishability should be proposed.

3 Preliminaries

The encoder–decoder-based GAN is composed of two networks: generator (G) and discriminator (D). The D network assesses whether the output sample originates from the real distribution ($p_{\text {normal}}$) or the generated distribution ($p_t$), while the G network serves the goal of reconstructing (recreating) the input sample. The G composed of an encoder ($G_1$) and a decoder ($G_2$). G assures that the produced image frames come from the true distribution to convince D. This is a consequence of adversarial training. The reconstructed anomalous frame, which we will use to detect anomaly events, will be distorted by G since it has not seen the anomalous distribution.

The Wasserstein loss function [38] is required since the function is condensed in the range $\mathbb {G} \{0,1\}$ as shown in Eq. 1.

$$\begin{aligned} \min _{G} \max _{D} (\mathbb {E}_{M \sim p_{\text{ normal } }}(D(M))-\mathbb {E}_{M \sim p_{\text{ normal }}}(D(G(M)))) \end{aligned}$$

(1)

where G and D training have been summarized in the form of a min-max game.

The contemporary method of AD utilizes the $L_2$ loss of G’s reconstruction as the anomaly score to identify the structural differences. These methods are based on improper reconstruction when faced with unseen events. Thus, existing methods have employed a thresholding technique on the reconstruction score to depict the anomaly, as shown below:

$$\begin{aligned} {\text {Label}}(M)= {\left\{ \begin{array}{ll} \text{ Normal, } &{} \text{ if } \mathcal {S(M)} \le \tau \\ \text{ Anomaly, } &{} \text{ else } \text{ if } \mathcal {S(M)} >\tau \end{array}\right. } \end{aligned}$$

(2)

where the label of a frame (M) is determined on the basis of anomaly score S(M) given the threshold $\tau $.

4 Proposed end-to-end pipeline model: VALD-GAN

The overall architecture of VALD-GAN is shown in Fig. 1. The VALD-GAN takes a Gaussian noise-augmented video frame as input and learns to reconstruct the denoised frame which matches the actual data distribution. The generator (G) can reconstruct the normal frames well; however, it faces difficulty while reconstructing the abnormal frame. The complete pipeline of VALD-GAN is shown in Fig. 2. The discriminator (D) calculates the anomaly score of the frames, while the G is responsible for frame reconstruction. The anomaly score (AS) is the score obtained from D which, along with the proposed distance metric, is utilized to detect and localize the anomaly.

We describe the G and D based on end-to-end pipeline model. Training G has the objective of increasing the possibility that D will make a mistake, and this framework is similar to a two-player minimax game. The G consists of encoder–decoder architecture $G_1$ and $G_2$ where we emphasize utilizing only the normal video frames for training and using the reconstruction error to determine the part of the frame that turned out to be an anomaly. The encoder $G_1$ maps input samples to a latent plane: $G_{1}: M \rightarrow \beta $ and the decoder $G_2$ reconstructs the input video frame according to the latent features: $G_{2}: \beta \rightarrow {\overline{M}} $. The output of $G_1$ is then fed to the latent discriminator and the generated result ${\overline{M}}$ (obtained from $G_2$) is directly provided to D. The latent discriminator ($D_L$) obtains the latent distribution $\beta $ to the Gaussian distribution.

4.1 Network architecture

The network architecture of G and D constitutes with 2D-CNNs and a fully connected neural network (FC-layer) as shown in Figs. 3 and 4, respectively. When input samples are provided, 2-D CNNs are used to concurrently capture the appearance information and FC-layer is used to abstract the learned representation derived from 2-D CNNs. Similarly, the network design of D includes FC layers and 2-D CNNs, whereas the network design of $D_L$ consists of four FC-layer with sizes 1024, 1024, 1024, and 512 dimensions, respectively. Instead of using the $p_\mathrm{{normal}}$ distribution, the Gaussian noise is added as indicated in Eq. 3.

$$\begin{aligned} {\tilde{M}}= & {} \left( M \sim p_{\text{ normal } }\right) +\left( \eta \sim {\mathcal {N}}\left( 0, \sigma ^{2} {\textbf{I}}\right) \right) \nonumber \\{} & {} \quad \longrightarrow M^{\prime } \sim p_{\text{ normal } } \end{aligned}$$

(3)

where $\eta $ is sampled from ${\mathcal {N}}\left( 0, \sigma ^{2} {\textbf{I}}\right) $ which is a standard Gaussian distribution. The output of D and latent discriminator $D_L$ is in the range [0,1] which is defined as shown in Eq. 4 and 5, respectively.

$$\begin{aligned}{} & {} {\mathcal {D}}\left( {\overline{M}}\right) \in [0,1] \end{aligned}$$

(4)

$$\begin{aligned}{} & {} \mathcal {D_L}\left( {G_{1}}{({\tilde{M}})}\right) = \mathcal {D_L}\left( \beta \right) \in [0,1] \end{aligned}$$

(5)

The G acts as a transformer, transforming ${\tilde{M}}$ into $p_{t}$ distribution as shown in Eq. 6.

$$\begin{aligned} {G}\left( {\tilde{M}}\right) =G_{1} \cdot G_{2}\left( {\tilde{M}}\right) =\overline{{M}} \end{aligned}$$

(6)

We utilized the latent discriminator loss to take into account that the latent space follows the Gaussian distribution. The loss function for G is shown in Eq. 7.

$$\begin{aligned} {L}_{{G}}= & {} \mathbb {E}_{{\tilde{M}} \sim p_{\text{ normal }} }[{D}({G}\left( {\tilde{M}}\right) )] +\mathbb {E}_{M \sim p_{\text {normal}}}\left[ 1-{D}\left( {G}\left( {M}\right) \right) \right] \nonumber \\{} & {} +\mathbb {E}_{\beta \sim p_{\text {latent}}}\left[ \log \left( 1-{\mathcal {D}}_{L}(\beta )\right) \right] \end{aligned}$$

(7)

To distinguish between the real and reconstructed distribution is the goal of D. An adversarial loss for the $G+D$ network is used to train the model. The other reconstruction loss shown in Eq. 8 is another loss component that is added to the overall loss function to help it produce data distributions that are similar to the normal class distribution. We propose to utilize both the $L_2$ and $L_1$ loss as the former is advantageous for smoothness and continuous outputs while the latter is advantageous for robustness against outliers and preserving sharp details. We experimented with various $\lambda $ values in the range [0,1].

$$\begin{aligned} {\mathcal {L}}_{{r}}=\lambda _{1}\left\| G({\tilde{M}})-M\right\| ^{2} + \lambda _{2}\left\| G({\tilde{M}})-M\right\| \end{aligned}$$

(8)

Consequently, the total loss function for the generator which is cumulative of adversarial training of the generator and reconstruction loss is shown in Eq. 9.

$$\begin{aligned}{} & {} {\mathcal {L}}_{G^{*}}={\mathcal {L}}_{G} + {\mathcal {L}}_{r} \end{aligned}$$

(9)

$$\begin{aligned}{} & {} \begin{aligned} {\mathcal {L}}_{{\mathcal {D}}}=&\mathbb {E}_{{M} \sim p_{\text {normal}}}\left[ {\mathcal {D}}\left( {G}\left( {\tilde{M}}\right) \right) \right] -\mathbb {E}_{{M} \sim p_{normal}}\left[ {\mathcal {D}}\left( {M}\right) \right] \end{aligned}\nonumber \\ \end{aligned}$$

(10)

In addition, we add an additional loss for the latent feature discriminator $D_L$ as follows:

$$\begin{aligned} {\mathcal {L}}_{\mathcal {D_L}}= & {} \mathbb {E}_{z \sim {\mathcal {N}}(0,\sigma _l)}\left[ \log \left( {\mathcal {D}}_L(z)\right) \right] \nonumber \\{} & {} \quad +\mathbb {E}_{\beta \sim p_\mathrm{{normal}}}\left[ \log \left( 1-{\mathcal {D}}_L(\beta )\right) \right] \end{aligned}$$

(11)

where $\beta $ denotes the output from $G_1$. When detecting learning anomalies, ${\mathcal {L}}_{\mathcal {D_L}}$ attempts to encrypt M to $\beta $ with a distribution close to ${\mathcal {N}} (0, 0.34)$. With the use of $D_L$, the latent feature distribution can be constraint by VALD-GAN to the normal distribution [39] and it helps to derive precise $p_{latent}$.

The adversarial loss of the $G+D$ network and the reconstruction loss make up our total loss function. D and G are trained by maximizing or reducing the associated loss function in light of the formulation of the aforementioned loss functions and are shown in Eq. 12.

$$\begin{aligned} \begin{aligned}&\theta _{{\mathcal {D}}}=\theta _{{\mathcal {D}}}+\gamma \frac{d {\mathcal {L}}_{{\mathcal {D}}}}{d \theta _{{\mathcal {D}}}},&\theta _{{\mathcal {D}}_{L}}=\theta _{{\mathcal {D}}_{L}}+\gamma \frac{d {\mathcal {L}}_{{\mathcal {D}}_{L}}}{d \theta _{{\mathcal {D}}_{L}}}, \\&\theta _{{\mathcal {G}}}=\theta _{{\mathcal {G}}}+\gamma \frac{d {\mathcal {L}}_{{\mathcal {G}}^{}}}{d \theta _{{\mathcal {G}}}} \end{aligned} \end{aligned}$$

(12)

In order to compare the two frames $\overline{{M}}$ and M as shown in Eq. 2, utilizing the modified KL divergence known as the Jeffrey divergence that takes the symmetric condition, we propose a distance metric. The Jeffrey divergence is symmetric, numerically stable, noise and input-scale-invariant [40]. Using the Jeffrey divergence method, the distance metric is defined as in Eq. 13.

$$\begin{aligned} d\left( {\overline{M}}, M\right)= & {} \sum _{i, j}\left( {\bar{m}}_{i, j} \log _{10} \frac{{\bar{m}}_{i, j}}{x_{i, j}}-m_{i, j} \log _{10} \frac{m_{i, j}}{x_{i, j}}\right) \nonumber \\ x_{i, j}= & {} \frac{{\bar{m}}_{i, j}+m_{i, j}}{2} \end{aligned}$$

(13)

The discriminator D completes the responsibilities of an AD model with the aid of G. As a result, the G and the D network both are utilized during testing. The predicted discriminator score D(G(M)), referred to as AS(M), is combined with the Jeffrey divergence to develop the final anomaly condition. In Eq. 2, the new thresholding process is described as follows:

$$ \begin{aligned} AD\left( M\right) =\left\{ \begin{aligned} \text{ Abnormal, }&AS(M)>\tau _1 \, { \& }\, d\left( {\overline{M}}, M\right) \ge \tau _2 \\ \text{ Normal, }&\text{ Otherwise } \end{aligned}\right. \nonumber \\ \end{aligned}$$

(14)

where $\tau _1$ and $\tau _2$ are the predetermined threshold.

5 Experiments

In this section, the capability of VALD-GAN is evaluated on three benchmark video datasets, namely UCSD Peds [41], CUHK Avenue [26], and Subway datasets [42]. By using the frame-level measurement, the ability of our proposed approach to identify an abnormality is assessed. The abnormality threshold affects how well our model works. Therefore, rather than measuring the performance of our model based on a certain threshold, it is fairer to assess how much more discriminative performance our approach may deliver under several threshold selections. The receiver operating characteristic (ROC) curve is utilized because a binary classifier system’s discrimination threshold might be altered, potentially providing diagnostic capabilities. The ROC curve compares the ratio of true positives to false positives (TPR/FPR). TPR = TP/(FN+TP) and FPR = FP/(FP +TN) are used to compute TPR and FPR, respectively. The true positive and false negative are denoted by TP and FN, respectively. FP and TN, respectively, stand for false positives and true negatives. We followed the quantitative evaluation metrics proposed in previous work [37], the description of which are given below:

1.
Equal error rate (EER) It represents the value of FPR at a point on the ROC curve where FPR is approximately equal to 1-TPR. Here, FPR is the false positive rate and TPR is the true positive rate. The enhanced performance of the method is reflected by a lower EER value, signifying an effective reduction in miss-classification. To calculate EER value, we follow the iterative algorithm 1, which requires iteration over various thresholds utilized to plot the ROC curve. For the input to the algorithm, we used the AUC metrics library of the Sklearn package which takes the ground-truth label of each video frame and the predicted video frame label as inputs. It returns fpr_list as the list of false positives, tpr_list as the list of true positives, and a list of threshold values. In algorithm 1, we first initialize the minimum difference to 1 and the EER value to 0. Further, iteratively passing through the thresholds list gives the threshold point ($EER\_Threshold$) where the difference between FPR and (1-TPR) is minimum. $EER\_Threshold$ is a point on the ROC curve where FPR is approximately equal to 1-TPR, and this FPR value is the EER which lies between 0 and 1. The lower value of EER indicates the robustness of the method.
2.
Area under curve (AUC It is a threshold-independent evaluation measure used in binary classification tasks. It quantifies a method’s ability to distinguish between positive and negative instances by calculating the area under the ROC curve. A high AUC value indicates a strong ability to distinguish between normal and anomalous activities, enhancing the method’s reliability and effectiveness for identifying unusual events.

5.1 Datasets

In this section, the proposed AD model is tested using benchmark datasets from UCSD Peds, CUHK Avenue, and Subway for real-time applications of video surveillance.

5.1.1 UCSD Peds dataset

It comprises of two subsets, namely Peds1 and Peds2, captured by outdoor security cameras. Peds1 takes a frame size of $158 \times 234$, while Peds2 takes a frame size of $240 \times 360$. Peds1 consists of 34 training videos and 36 testing videos, with 40 abnormalities in total. Peds2 includes 16 training videos, 12 testing videos, and 12 abnormalities.

Table 1 Comparing reconstruction-based approaches at the frame-level using the CUHK Avenue and UCSD-Peds datasets. “–" denotes that no results are given. The best performance is shown by the bolded digits

Full size table

5.1.2 CUHK avenue dataset

It consists of 47 anomalies, including tossing, wandering, and running, with 16 training videos and 21 test videos having frame size $360 \times 640$. Due to the shifting placements and camera angles, it is notable that the people in the dataset change in size and scale.

5.1.3 Subway dataset

It consists of “Entrance video” and “Exit video" having frame size $512 \times 384$. The duration of “Entrance video" is 1 h, 36 min, and it has 144249 frames. The runtime of the “Exit video" is 43 min, and it consists of 64900 frames.

5.2 Experimental results

The training process utilizes specific parameters: the video frame size of $160 \times 160$, the momentum of 0.9, learning & decay rate of $10^{-4}$, batch size of 64, and convergence tolerance of $10^{-6}$. During experimentation, the value of $\sigma $ and $\sigma _l$ is varied within the range of [0, 0.5] and [0, 10], respectively. It is observed that better results are achieved for $\sigma =0.34$ and $\sigma _l=1$. Also, we test our proposed approach using various $\lambda $ values in the range [0,1] and found experimentally the values of $\lambda _1$ = 0.4 and $\lambda _2$ = 0.8. All tests are conducted on a dedicated GPU server running Xubuntu 22.04, equipped with an Intel Xeon Gold 6226R processor clocked at 2.9 GHz, 128 GB of RAM, and an Nvidia TESLA GPU. We implement our VALD-GAN architecture using the Keras framework.

The result analysis of the above datasets is discussed below:

1.
UCSD Peds1 For the UCSD Peds1 dataset our proposed VALD-GAN outperforms the other state-of-the-art reconstruction-based approaches. The ROC curve for the UCSD Peds1 dataset is shown in Fig. 5. The AUC and EER are shown in Table 1. The AUC and EER score of VALD-GAN is 0.06 and 0.07, respectively, better than $AEP_\mathrm{{MTRM}}$. VALD-GAN outperforms $AEP_\mathrm{{MTRM}}$ due to the advantage of utilizing 2D-CNN over 3D-CNN architecture and avoiding the usage of past and future frame sequences which affects the anomaly score of the frame.
2.
UCSD Peds2 For the UCSD Peds2 dataset, our proposed VALD-GAN outperforms the other state-of-the-art reconstruction-based approaches. The ROC curve for the UCSD Peds2 dataset is shown in Fig. 5. The AUC and EER are shown in Table 1. The AUC and EER score of VALD-GAN is 0.43 and 0.51, respectively, better than $AEP_\mathrm{{MTRM}}$.
3.
CUHK avenue For the CUHK Avenue dataset, VALD-GAN outperforms the current state-of-the-art reconstruction-based approaches by a greater margin as compared to other datasets. The AUC and EER are shown in Table 1. The AUC and EER score of VALD-GAN is 0.83 and 1.03, respectively, better than $AEP_\mathrm{{MTRM}}$.
4.
Subway “Entry" and “Exit" datasets In this dataset, we computed the False Alarm rate and the number of AD for both the “Entrance" and “Exit" datasets. We compared the performance of VALD-GAN with the current state-of-the-art reconstruction-based approaches. As shown in Table 2, VALD-GAN found a total of 62 and 19 in the “Entrance" and “Exit" videos out of which 4 and 1 are False Alarms. On the other, the false alarm detected by VALD-GAN is lower than the other state-of-the-art approaches.

5.3 Anomaly visualization

Figure 6 shows the anomaly score visualized from UCSD Peds2 and CUHK Avenue dataset. In Fig. 6, M denotes the input video frame, M$^{\prime }$ denotes the noise added to the video frame, G(M$^{\prime }$) denotes the reconstructed video frame from G, and the anomaly score from D is denoted by D(G(M$^{\prime }$)) and finally, the anomaly area in video frame identified is highlighted with red pixels. The D value for normal video frames is close to 0, whereas the anomaly video frame score is close to 1.

Table 2 Quantitative comparison of different OCC on the Subway dataset where “Entrance video” and “Exit video” are represented by EN and EX and “False Alarm” is represented as FA

Full size table

Table 3 On the UCSD-Peds2 dataset, the VALD-GAN was compared to various reconstruction-based approaches in terms of Execution speed in terms of seconds to process each frame

Full size table

Table 4 The impact of noise added to input frame on the AUC values of UCSD Peds2 and CUHK avenue dataset

Full size table

5.4 Time complexity

We conducted a comparison of the execution speed of VALD-GAN with other state-of-the-art approaches during testing on the UCSD Peds2 dataset. Table 3 presents the average duration required to process each frame during testing. We compared VALD-GAN with Unmasking [43], Lu et al. [26], FFP+MC [13], ALOCC [9], AMDN [20], and $AEP_\mathrm{{MTRM}}$ [37]. Notably, VALD-GAN exhibits faster computational performance compared to $AEP_\mathrm{{MTRM}}$ [37], despite both approaches utilizing deep architectures and video frames for training. The key distinction is that while $AEP_\mathrm{{MTRM}}$ employs a 3D-CNN model and detects anomalies based on deviations from future frames, VALD-GAN utilizes an end-to-end training procedure with a 2D-CNN architecture, enabling real-time AD based on deviations from normal video frames.

6 Discussion

In this section, we discuss the effect of noise in improving the generalization of the proposed approach and the impact of constraining the latent space in improving the AUC score. Also, we discuss the weights given to the hyperparameter $\lambda $ and its impact on the AUC score.

6.1 Effect of noise in input

To improve the generalizability and robustness of VALD-GAN, we modified the input frames by adding a noise component $\eta $. Table 4 shows the ablation study on the effect of the noises and their intensity on the functioning of our model. We observe that the Gaussian noise ${\mathcal {N}}\left( 0, \sigma ^{2} {\textbf{I}}\right) $, with $\sigma = 0.34$, performs better than no-noise, resulting in score $+2.5$ higher on UCSD Peds2 and $+0.5$ higher on CUHK Avenue, thus implying its importance to detect an anomaly. However, the salt-and-pepper noise performs worse than the no-noise scenario.

6.2 Constraining the latent space

Another constraint placed in VALD-GAN is utilized for the latent space $\beta $ in G. Specifically, the latent discriminator tries to distinguish $\beta $ from samples in ${\mathcal {N}}\left( 0, \sigma _{l}^{2} {\textbf{I}}\right) $. Table 5 shows how the variation of $\sigma _l$ impact on the performance of VALD-GAN. We observe that $\sigma _l = 1$ forms a suitable choice to prevent stability of training as well as prevent mode collapse, which occurs if $\beta $ is too small.

Table 5 The impact of standard deviation $\sigma _l$ of latent distribution constraint on the AUC values of UCSD Peds2 and CUHK avenue dataset

Full size table

6.3 Hyperparameter for anomaly score estimation

The hyperparameter $\lambda $ is varied to balance between the $L_1$ and $L_2$ score which are added further with Jeffrey’s divergence loss to get the combined anomaly score of the input video frame. We first fix the weight of the $L_2$ loss at 0.4 and find the weight of the $L_1$ loss that achieves the best AUC score on the CUHK Avenue dataset, which we get as 0.8. We then set the weight of the $L_1$ loss to 0.8 and find the weight of the $L_2$ loss that achieves the best AUC score. The obtained weight is 0.38, which we round up to 0.4 for experimentation. Figure 7 shows the variation of $\lambda _1$ and $\lambda _2$ for $L_1$ and $L_2$ loss. As can be seen, the best AUC score is achieved when $\lambda _1$ = 0.4 and $\lambda _2$ = 0.8. This resulted in an AUC score of 91.03%.

7 Conclusion

In this paper, we propose a novel VALD-GAN for video AD. Our proposed approach is an end-to-end reconstruction model based on GAN architecture which learns the features from normal video frames and detects the abnormality from the anomaly samples. In order to improve the reconstruction frame we propose a novel latent discriminator model which makes the latent space of the generator follow a pre-defined distribution. Through the use of the Jeffrey divergence distance metric and the discriminative capabilities of GAN, our model is able to capture the unique characteristics of anomalies, resulting in enhanced anomaly scoring. Extensive experimentation on benchmark datasets shows that VALD-GAN outperforms existing state-of-the-art approaches, highlighting its effectiveness in end-to-end learning for video anomaly detection. Although, the experimental results show the role of latent space in improving the anomaly score, however, learning the latent space is still a challenge. In future work, we will explore new methods based on spatio-temporal latent features for video AD.

Data Availability

The datasets analyzed during the current study UCSD Peds and CUHK Avenue are publicly available; however, the Subway dataset used is requested from the corresponding author of [42]. All the used datasets are cited.

References

Sun, Q., Liu, H., Harada, T.: Online growing neural gas for anomaly detection in changing surveillance scenes. Pattern Recogn. 64, 187–201 (2017)
Article Google Scholar
Sabokrou, M., Fayyaz, M., Fathy, M., Moayed, Z., Klette, R.: Deep-anomaly: fully convolutional neural network for fast anomaly detection in crowded scenes. Comput. Vis. Image Underst. 172, 88–97 (2018)
Article Google Scholar
Aziz, Z., Bhatti, N., Mahmood, H., Zia, M.: Video anomaly detection and localization based on appearance and motion models. Multimed. Tools Appl. 80(17), 25875–25895 (2021)
Article Google Scholar
Ye, R., Li, X.: Collective representation for abnormal event detection. J. Comput. Sci. Technol. 32(3), 470–479 (2017)
Article MathSciNet Google Scholar
Schlegl, T., Seeböck, P., Waldstein, S.M., Schmidt-Erfurth, U., Langs, G.: Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In: International Conference on Information Processing in Medical Imaging, Springer, pp. 146–157 (2017)
Mei, S., Wang, Y., Wen, G.: Automatic fabric defect detection with a multi-scale convolutional denoising autoencoder network model. Sensors 18(4), 1064 (2018)
Article Google Scholar
Dong, L., Shulin, L., Zhang, H.: A method of anomaly detection and fault diagnosis with online adaptive learning under small training samples. Pattern Recogn. 64, 374–385 (2017)
Article Google Scholar
Yu, J.-H., Moon, J.-H., Sohn, K.-A.: Attention-guided residual frame learning for video anomaly detection. Multimed. Tools Appl. pp. 1–18 (2022)
Isola, P., Xiao, J., Torralba, A., Oliva, A.: What makes an image memorable? In: CVPR 2011, IEEE, pp. 145–152 (2011)
Chriki, A., Touati, H., Snoussi, H., Kamoun, F.: Deep learning and handcrafted features for one-class anomaly detection in UAV video. Multimed. Tools Appl. 80(2), 2599–2620 (2021)
Article Google Scholar
Wang, J., Cherian, A.: Gods: Generalized one-class discriminative subspaces for anomaly detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8201–8211 (2019)
Xu, M., Yu, X., Chen, D., Wu, C., Jiang, Y.: An efficient anomaly detection system for crowded scenes using variational autoencoders. Appl. Sci. 9(16), 3337 (2019)
Article Google Scholar
Liu, W., Luo, W., Lian, D., Gao, S.: Future frame prediction for anomaly detection–a new baseline. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6536–6545 (2018)
Hasan, M., Choi, J., Neumann, J., Roy-Chowdhury, A.K., Davis, L.S.: Learning temporal regularity in video sequences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 733–742 (2016)
Zhao, Y., Deng, B., Shen, C., Liu, Y., Lu, H., Hua, X.-S.: Spatio-temporal autoencoder for video anomaly detection. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 1933–1941 (2017)
Akcay, S., Atapour-Abarghouei, A., Breckon, T.: Computer vision-accv 2018. Semi-supervised Anomaly Detection via Adversarial Training, GANomaly (2019)
Google Scholar
Hu, X., Lian, J., Zhang, D., Gao, X., Jiang, L., Chen, W.: Video anomaly detection based on 3d convolutional auto-encoder. SIViP 16(7), 1885–1893 (2022)
Article Google Scholar
Li, N., Chang, F., Liu, C.: Spatial-temporal cascade autoencoder for video anomaly detection in crowded scenes. IEEE Trans. Multimed. 23, 203–215 (2020)
Article Google Scholar
Ribeiro, M., Lazzaretti, A.E., Lopes, H.S.: A study of deep convolutional auto-encoders for anomaly detection in videos. Pattern Recogn. Lett. 105, 13–22 (2018)
Article Google Scholar
Xu, D., Ricci, E., Yan, Y., Song, J., Sebe, N.: Learning deep representations of appearance and motion for anomalous event detection. arXiv preprint arXiv:1510.01553 (2015)
Deepak, K., Chandrakala, S., Mohan, C.K.: Residual spatiotemporal autoencoder for unsupervised video anomaly detection. SIViP 15(1), 215–222 (2021)
Article Google Scholar
Cong, Y., Yuan, J., Liu, J.: Sparse reconstruction cost for abnormal event detection. In: CVPR 2011, IEEE, pp. 3449–3456 (2011)
Chong, Y.S., Tay, Y.H.: Abnormal event detection in videos using spatiotemporal autoencoder. In: International Symposium on Neural Networks, Springer, pp. 189–196 (2017)
Chalapathy, R., Menon, A.K., Chawla, S.: Robust, deep and inductive anomaly detection. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, pp. 36–51 (2017)
Gordon, A., Li, H., Jonschkowski, R., Angelova, A.: Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8977–8986 (2019)
Lu, C., Shi, J., Jia, J.: Abnormal event detection at 150 fps in matlab. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2720–2727 (2013)
Yan, S., Smith, J.S., Lu, W., Zhang, B.: Abnormal event detection from videos using a two-stream recurrent variational autoencoder. IEEE Trans. Cogn. Dev. Syst. 12(1), 30–42 (2018)
Article Google Scholar
Wang, L., Zhou, F., Li, Z., Zuo, W., Tan, H.: Abnormal event detection in videos using hybrid spatio-temporal autoencoder. In: 2018 25th IEEE International Conference on Image Processing (ICIP), IEEE, pp. 2276–2280 (2018)
Nawaratne, R., Alahakoon, D., De Silva, D., Yu, X.: Spatiotemporal anomaly detection using deep learning for real-time video surveillance. IEEE Trans. Ind. Inf. 16(1), 393–402 (2019)
Article Google Scholar
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. Advances in Neural Information Processing Systems 27 (2014)
Kiran, B.R., Thomas, D.M., Parakkal, R.: An overview of deep learning based methods for unsupervised and semi-supervised anomaly detection in videos. J. Imag. 4(2), 36 (2018)
Article Google Scholar
Lee, S., Kim, H.G., Ro, Y.M.: Stan: Spatio-temporal adversarial networks for abnormal event detection. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 1323–1327 (2018)
Sabokrou, M., Pourreza, M., Fayyaz, M., Entezari, R., Fathy, M., Gall, J., Adeli, E.: Avid: Adversarial visual irregularity detection. In: Asian Conference on Computer Vision, Springer, pp. 488–505 (2018)
Song, H., Sun, C., Wu, X., Chen, M., Jia, Y.: Learning normal patterns via adversarial attention-based autoencoder for abnormal event detection in videos. IEEE Trans. Multimed. 22(8), 2138–2148 (2019)
Article Google Scholar
Ravanbakhsh, M., Sangineto, E., Nabi, M., Sebe, N.: Training adversarial discriminators for cross-channel abnormal event detection in crowds. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, pp. 1896–1904 (2019)
Pourreza, M., Mohammadi, B., Khaki, M., Bouindour, S., Snoussi, H., Sabokrou, M.: G2d: generate to detect anomaly. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2003–2012 (2021)
Yu, J., Lee, Y., Yow, K.C., Jeon, M., Pedrycz, W.: Abnormal event detection and localization via adversarial event prediction. IEEE Trans. Neural Netw. Learn. Syst. 33(8), 3572–86 (2021)
Article Google Scholar
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: International Conference on Machine Learning, PMLR, pp. 214–223 (2017)
Pidhorskyi, S., Almohsen, R., Doretto, G.: Generative probabilistic novelty detection with adversarial autoencoders. Advances in Neural Information Processing Systems 31 (2018)
Puzicha, J., Hofmann, T., Buhmann, J.M.: Non-parametric similarity measures for unsupervised texture segmentation and image retrieval. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE, pp. 267–272 (1997)
Chan, A., Vasconcelos, N.: UCSD pedestrian database. IEEE Trans. Pattern Anal. Mach. Intell. 6 (2008)
Adam, A., Rivlin, E., Shimshoni, I., Reinitz, D.: Robust real-time unusual event detection using multiple fixed-location monitors. IEEE Trans. Pattern Anal. Mach. Intell. 30(3), 555–560 (2008)
Tudor Ionescu, R., Smeureanu, S., Alexe, B., Popescu, M.: Unmasking the abnormal events in video. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2895–2903 (2017)
Luo, W., Liu, W., Gao, S.: Remembering history with convolutional lstm for anomaly detection. In: 2017 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp. 439–444 (2017)
Perera, P., Nallapati, R., Xiang, B.: Ocgan: One-class novelty detection using gans with constrained latent representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2898–2906 (2019)

Download references

Acknowledgements

The authors would like to thank the Ministry of Electronics and Information Technology (MeiTY) to grant this project titled “Resource Constrained Artificial Intelligence" with grant number: 4(16)/2019-ITEA.

Author information

Anikeit Sethi and Krishanu Saini have contributed equally to this work.

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology, Indore, Indore, Madhya Pradesh, 453552, India
Rituraj Singh, Anikeit Sethi, Krishanu Saini & Aruna Tiwari
Intelligent System Groups, CSIR-CEERI, Pilani, Rajasthan, 333031, India
Sumeet Saurav & Sanjay Singh

Authors

Rituraj Singh
View author publications
You can also search for this author in PubMed Google Scholar
Anikeit Sethi
View author publications
You can also search for this author in PubMed Google Scholar
Krishanu Saini
View author publications
You can also search for this author in PubMed Google Scholar
Sumeet Saurav
View author publications
You can also search for this author in PubMed Google Scholar
Aruna Tiwari
View author publications
You can also search for this author in PubMed Google Scholar
Sanjay Singh
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

RS, AS, and KS are involved in methodology, conceptualization, experimentation, and the main manuscript writing. SU contributed to manuscript editing and revision. AT and SS provided oversight and guidance.

Corresponding author

Correspondence to Rituraj Singh.

Ethics declarations

Conflict of interest

The authors declared that we have no conflict of interest.

Ethical approval

This declaration is not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Singh, R., Sethi, A., Saini, K. et al. VALD-GAN: video anomaly detection using latent discriminator augmented GAN. SIViP 18, 821–831 (2024). https://doi.org/10.1007/s11760-023-02750-5

Download citation

Received: 26 July 2023
Revised: 09 August 2023
Accepted: 16 August 2023
Published: 18 October 2023
Issue Date: February 2024
DOI: https://doi.org/10.1007/s11760-023-02750-5

VALD-GAN: video anomaly detection using latent discriminator augmented GAN

Abstract

1 Introduction