1 Introduction

Due to a rapid increase in video surveillance data, manual labeling (for example, road accidents, burglary, etc.) and analysis to detect anomalies is costly keeping in mind the amount of generation of video surveillance data. Thus, the study of an intelligent video surveillance system that can recognize irregularities in real time has been the main focus of research in the area of computer vision and intelligent systems [1,2,3].

Detection of an anomaly in the video is the method of detecting and recognizing unusual actions or occurrences in it. The fact that there is no clear definition of an anomaly makes it the most difficult problem in video anomaly identification. An anomaly is anything that happens seldom, does not follow normal patterns, or differs greatly from the majority of items in a scene. Over the course of the research, it is typically assumed that video frame samples that appear usually in the dataset are normal and video frame samples that show infrequently are anomalies. The fundamental component of detecting anomalies is also known as one-class learning or novelty detection. The use of video anomaly detection (AD) applies to intelligent surveillance systems [4], image processing in the medical domain [5], detection of defect [6], diagnosis of fault [7], etc.

Early AD techniques typically relied on frameworks of learning the domain knowledge to match the manual feature, but deep neural networks with end-to-end framework emerged resulting from the fast advancement of deep learning including numerous procedures like the estimation of probability [8], one-class learning [9,10,11] and frame prediction and reconstruction [12, 13]. The frame-reconstruction strategy was the one that attracted the most interest. Typically, a reconstruction-based method to detect anomaly frequently assumes that the reconstruction network cannot effectively extract the anomaly data if it has only been trained on normal samples. As a result, the reconstruction error is utilized to detect the abnormality from the anomaly samples. Hasan et al. [14] utilize an auto-encoder (AE)-based network for reconstruction, to compute the \(L_2\) distance of the reconstructed image to identify anomalies. Zhao et al. [15] proposed a 3D convolution-based two-branch model incorporating prediction and reconstruction that will enhance the sample motion information learning.

Reconstruction methods in certain cases make use of adversarial learning an AE-based adversarial learning framework that detects anomalies by fusing motion and appearance information in a two-stream structure. On the other hand, Liu et al. [13] evaluate the video frame as anomalous or normal. A special AE [16] is developed to execute the detection of anomalies to learn the training data distribution in both the latent vector space and image space. The utilization of space obtained from the encoder in the AE framework known as latent space plays a crucial role in improving the reconstructed image [2, 17]. Li et al. [18] proposed ST-CaAE, a two-stream approach for AD that uses an adversarial AE and a convolutional AE based on spatial-temporal structure. Although the approach based on reconstruction is criticized for giving unseen samples more uncertainty, it can deliver multi-scale representations for video AD with a higher spatial resolution. In addition to being independent of past information and class labels, the reconstruction network’s learning makes it ideal for practical applications. Therefore, the proposed model is based on a reconstruction approach.

To enhance the effectiveness of AD, an end-to-end architecture is proposed called VALD-GAN that integrates the reconstruction method [9, 19,20,21] with the GAN-based strategy. The main drawback of GAN-based video AD is the reconstructed frame is not at par to detect the anomaly for which we need to utilize both the adversarial loss and the reconstruction loss to enhance the reconstructed frame. The main challenge of GAN is to achieve the equilibrium between the generator and discriminator. The generator network receives all of the regular samples initially and is then intended to produce the corresponding reconstructed video frame, as shown in Fig. 1. A novel latent discriminator is proposed which is able to make the latent space of the generator follow a pre-defined distribution in order to improve the reconstructed video frame. A discriminator network will be able to recognize anomalous frames by comparing the variations in the reconstructed frame and the input frame for normal samples and anomaly samples. The following are our main contributions, in brief:

Fig. 1
figure 1

Distinguishing between the normal and anomaly video frames by the proposed reconstruction-based architecture

  1. (1)

    We propose an adversarial trained denoising GAN autoencoder that can be applied for real-world video surveillance.

  2. (2)

    We present a novel latent discriminator that effectively constrains the latent distribution, playing a vital role in improving the distinguishability of the anomaly video frame.

  3. (3)

    Further, we combined the Jeffrey divergence and discriminative capability of GAN to develop the new anomaly score, allowing us to capture the unique characteristics of anomalies resulting in better AD.

  4. (4)

    Our proposed model VALD-GAN surpasses existing state-of-the-art approaches and learns end-to-end through extensive experimentation on the benchmark datasets.

The structure of the rest of the paper is as follows. Section 2 provides related work on AD and GAN-based AD models. Section 4, presents our proposed VALD-GAN framework. In Sect. 5, we conduct a series of experiments to optimize and validate the effectiveness of our AD technique. Section 6 discusses the results and insights derived from our proposed model. Finally, Sect. 7 summarizes concludes the paper by summarizing the main findings and highlighting potential avenues for future research.

2 Related work

2.1 Anomaly detection

Recently, anomaly detection has become an active interest for researchers. With this, the model is trained to learn the features and reconstruct the input video frame based on it. Ribeiro et al. [19] introduced an image-to-image reconstruction approach using an encoder–decoder network. This method effectively recovers input frames and allows for the detection of anomalies by leveraging the reconstruction error. Normal samples exhibit a small reconstruction error, while abnormal samples show a significantly higher error. Convolutional LSTM and autoencoder networks are combined by Xu et al. [20] to reconstruct and predict the frame sequences of videos, which improves the capacity of the network to reflect motion patterns and also raises the bar for network training. Cong et al. [22] utilized sparse representation learning to make use of self-representation to find irregular occurrences in videos. Chong et al. [23] propose incorporating a convolutional autoencoder (CAE) into a two-stream network architecture to address training convergence issues. The addition of \(L_2\) weight regularization and bias terms improves the reliability of the AD method, as suggested by Chalapathy et al. [24].

Gordon et al. [25] proposed a method to compare stored anomalies with video frame reconstructions for AE. Lu et al. [26] utilized sparse representation-based learning for AE. Chong et al. [23] introduced ST-Autoencoder for spatial-temporal reconstruction-based AE. Yan et al. [27] adopted a two-stream recurrent variational AE, while Wang et al. [28] proposed an LSTM-based encoder–decoder architecture for AD in videos. Nawaratne et al. [29] introduced an incremental spatio-temporal learner (ISTL) with active learning techniques for recognizing abnormal occurrences. However, reconstruction networks may mistakenly identify new normal samples as anomalies, necessitating additional restrictions to limit their generalization of regular data.

2.2 Generative adversarial networks

GAN has emerged as a significant breakthrough in deep learning [30]. In GAN, the discriminator and generator engage in a game, with the generator aiming to generate realistic data and the discriminator striving to distinguish between real and generated data. Unsupervised counteraction is applied until a stable and balanced state is reached [31]. Lee et al. [32] employed a dual-directional generator with LSTM to capture spatial and temporal properties of normal patterns, using a 3D-CNN as a discriminator for video AD. The generator utilizes encoder–decoder networks for image-to-image reconstruction [9, 33], using the discriminator’s output as metrics for anomaly assessment. The Ada-Net [34] merges an auto-encoder network with a GAN-like model, and the addition of a decoder with an attention model to dynamically select significant portions for decoding the encoding features, which is beneficial for preserving important information to learn inherent normal patterns. For the purpose of identifying anomalies, more advanced GAN-based models are being developed. In order to detect abnormalities, a cross-channel network structure with dual-channel GAN is developed, as noted in [35]. In this structure, optical flow is generated from frames via the cross-channel GAN, and frames are generated from optical flow. Pour et al. [36] provide a novel approach for AD that generates some aberrant samples using partial convergence WGAN and trains binary classifiers to distinguish between anomalous and normal frames. Yu et al. [37] utilize the Wasserstein loss for GAN to find anomalies using the future and past discriminator. It makes the supposition that everything that traditional training cannot adequately define is an abnormality. To improve the capacity to discriminate, a restricted generative adversarial network (GAN) with greater distinguishability should be proposed.

3 Preliminaries

The encoder–decoder-based GAN is composed of two networks: generator (G) and discriminator (D). The D network assesses whether the output sample originates from the real distribution (\(p_{\text {normal}}\)) or the generated distribution (\(p_t\)), while the G network serves the goal of reconstructing (recreating) the input sample. The G composed of an encoder (\(G_1\)) and a decoder (\(G_2\)). G assures that the produced image frames come from the true distribution to convince D. This is a consequence of adversarial training. The reconstructed anomalous frame, which we will use to detect anomaly events, will be distorted by G since it has not seen the anomalous distribution.

The Wasserstein loss function [38] is required since the function is condensed in the range \(\mathbb {G} \{0,1\}\) as shown in Eq. 1.

$$\begin{aligned} \min _{G} \max _{D} (\mathbb {E}_{M \sim p_{\text{ normal } }}(D(M))-\mathbb {E}_{M \sim p_{\text{ normal }}}(D(G(M)))) \end{aligned}$$
(1)

where G and D training have been summarized in the form of a min-max game.

The contemporary method of AD utilizes the \(L_2\) loss of G’s reconstruction as the anomaly score to identify the structural differences. These methods are based on improper reconstruction when faced with unseen events. Thus, existing methods have employed a thresholding technique on the reconstruction score to depict the anomaly, as shown below:

$$\begin{aligned} {\text {Label}}(M)= {\left\{ \begin{array}{ll} \text{ Normal, } &{} \text{ if } \mathcal {S(M)} \le \tau \\ \text{ Anomaly, } &{} \text{ else } \text{ if } \mathcal {S(M)} >\tau \end{array}\right. } \end{aligned}$$
(2)

where the label of a frame (M) is determined on the basis of anomaly score S(M) given the threshold \(\tau \).

4 Proposed end-to-end pipeline model: VALD-GAN

Fig. 2
figure 2

Pipeline of VALD-GAN to detect and localize anomaly in Video

The overall architecture of VALD-GAN is shown in Fig. 1. The VALD-GAN takes a Gaussian noise-augmented video frame as input and learns to reconstruct the denoised frame which matches the actual data distribution. The generator (G) can reconstruct the normal frames well; however, it faces difficulty while reconstructing the abnormal frame. The complete pipeline of VALD-GAN is shown in Fig. 2. The discriminator (D) calculates the anomaly score of the frames, while the G is responsible for frame reconstruction. The anomaly score (AS) is the score obtained from D which, along with the proposed distance metric, is utilized to detect and localize the anomaly.

We describe the G and D based on end-to-end pipeline model. Training G has the objective of increasing the possibility that D will make a mistake, and this framework is similar to a two-player minimax game. The G consists of encoder–decoder architecture \(G_1\) and \(G_2\) where we emphasize utilizing only the normal video frames for training and using the reconstruction error to determine the part of the frame that turned out to be an anomaly. The encoder \(G_1\) maps input samples to a latent plane: \(G_{1}: M \rightarrow \beta \) and the decoder \(G_2\) reconstructs the input video frame according to the latent features: \(G_{2}: \beta \rightarrow {\overline{M}} \). The output of \(G_1\) is then fed to the latent discriminator and the generated result \({\overline{M}}\) (obtained from \(G_2\)) is directly provided to D. The latent discriminator (\(D_L\)) obtains the latent distribution \(\beta \) to the Gaussian distribution.

Fig. 3
figure 3

Architecture of G consisting of 2D-cnn, FC-layer and 2D-deconvolutional layer. The order of dimensions is the kernel width \(\times \) kernel height \(\times \) input channel \(\times \) output channel. The order of dimensions for the FC-layer is input \(\times \) output

Fig. 4
figure 4

Architecture of D consisting of 2D-cnn and FC-layer. The order of dimensions for 2D-cnn is the kernel width \(\times \) kernel height \(\times \) input channel \(\times \) output channel. The order of dimensions for the FC-layer is input \(\times \) output

4.1 Network architecture

The network architecture of G and D constitutes with 2D-CNNs and a fully connected neural network (FC-layer) as shown in Figs. 3 and 4, respectively. When input samples are provided, 2-D CNNs are used to concurrently capture the appearance information and FC-layer is used to abstract the learned representation derived from 2-D CNNs. Similarly, the network design of D includes FC layers and 2-D CNNs, whereas the network design of \(D_L\) consists of four FC-layer with sizes 1024, 1024, 1024, and 512 dimensions, respectively. Instead of using the \(p_\mathrm{{normal}}\) distribution, the Gaussian noise is added as indicated in Eq. 3.

$$\begin{aligned} {\tilde{M}}= & {} \left( M \sim p_{\text{ normal } }\right) +\left( \eta \sim {\mathcal {N}}\left( 0, \sigma ^{2} {\textbf{I}}\right) \right) \nonumber \\{} & {} \quad \longrightarrow M^{\prime } \sim p_{\text{ normal } } \end{aligned}$$
(3)

where \(\eta \) is sampled from \({\mathcal {N}}\left( 0, \sigma ^{2} {\textbf{I}}\right) \) which is a standard Gaussian distribution. The output of D and latent discriminator \(D_L\) is in the range [0,1] which is defined as shown in Eq. 4 and 5, respectively.

$$\begin{aligned}{} & {} {\mathcal {D}}\left( {\overline{M}}\right) \in [0,1] \end{aligned}$$
(4)
$$\begin{aligned}{} & {} \mathcal {D_L}\left( {G_{1}}{({\tilde{M}})}\right) = \mathcal {D_L}\left( \beta \right) \in [0,1] \end{aligned}$$
(5)

The G acts as a transformer, transforming \({\tilde{M}}\) into \(p_{t}\) distribution as shown in Eq. 6.

$$\begin{aligned} {G}\left( {\tilde{M}}\right) =G_{1} \cdot G_{2}\left( {\tilde{M}}\right) =\overline{{M}} \end{aligned}$$
(6)

We utilized the latent discriminator loss to take into account that the latent space follows the Gaussian distribution. The loss function for G is shown in Eq. 7.

$$\begin{aligned} {L}_{{G}}= & {} \mathbb {E}_{{\tilde{M}} \sim p_{\text{ normal }} }[{D}({G}\left( {\tilde{M}}\right) )] +\mathbb {E}_{M \sim p_{\text {normal}}}\left[ 1-{D}\left( {G}\left( {M}\right) \right) \right] \nonumber \\{} & {} +\mathbb {E}_{\beta \sim p_{\text {latent}}}\left[ \log \left( 1-{\mathcal {D}}_{L}(\beta )\right) \right] \end{aligned}$$
(7)

To distinguish between the real and reconstructed distribution is the goal of D. An adversarial loss for the \(G+D\) network is used to train the model. The other reconstruction loss shown in Eq. 8 is another loss component that is added to the overall loss function to help it produce data distributions that are similar to the normal class distribution. We propose to utilize both the \(L_2\) and \(L_1\) loss as the former is advantageous for smoothness and continuous outputs while the latter is advantageous for robustness against outliers and preserving sharp details. We experimented with various \(\lambda \) values in the range [0,1].

$$\begin{aligned} {\mathcal {L}}_{{r}}=\lambda _{1}\left\| G({\tilde{M}})-M\right\| ^{2} + \lambda _{2}\left\| G({\tilde{M}})-M\right\| \end{aligned}$$
(8)

Consequently, the total loss function for the generator which is cumulative of adversarial training of the generator and reconstruction loss is shown in Eq. 9.

$$\begin{aligned}{} & {} {\mathcal {L}}_{G^{*}}={\mathcal {L}}_{G} + {\mathcal {L}}_{r} \end{aligned}$$
(9)
$$\begin{aligned}{} & {} \begin{aligned} {\mathcal {L}}_{{\mathcal {D}}}=&\mathbb {E}_{{M} \sim p_{\text {normal}}}\left[ {\mathcal {D}}\left( {G}\left( {\tilde{M}}\right) \right) \right] -\mathbb {E}_{{M} \sim p_{normal}}\left[ {\mathcal {D}}\left( {M}\right) \right] \end{aligned}\nonumber \\ \end{aligned}$$
(10)

In addition, we add an additional loss for the latent feature discriminator \(D_L\) as follows:

$$\begin{aligned} {\mathcal {L}}_{\mathcal {D_L}}= & {} \mathbb {E}_{z \sim {\mathcal {N}}(0,\sigma _l)}\left[ \log \left( {\mathcal {D}}_L(z)\right) \right] \nonumber \\{} & {} \quad +\mathbb {E}_{\beta \sim p_\mathrm{{normal}}}\left[ \log \left( 1-{\mathcal {D}}_L(\beta )\right) \right] \end{aligned}$$
(11)

where \(\beta \) denotes the output from \(G_1\). When detecting learning anomalies, \({\mathcal {L}}_{\mathcal {D_L}}\) attempts to encrypt M to \(\beta \) with a distribution close to \({\mathcal {N}} (0, 0.34)\). With the use of \(D_L\), the latent feature distribution can be constraint by VALD-GAN to the normal distribution [39] and it helps to derive precise \(p_{latent}\).

The adversarial loss of the \(G+D\) network and the reconstruction loss make up our total loss function. D and G are trained by maximizing or reducing the associated loss function in light of the formulation of the aforementioned loss functions and are shown in Eq. 12.

$$\begin{aligned} \begin{aligned}&\theta _{{\mathcal {D}}}=\theta _{{\mathcal {D}}}+\gamma \frac{d {\mathcal {L}}_{{\mathcal {D}}}}{d \theta _{{\mathcal {D}}}},&\theta _{{\mathcal {D}}_{L}}=\theta _{{\mathcal {D}}_{L}}+\gamma \frac{d {\mathcal {L}}_{{\mathcal {D}}_{L}}}{d \theta _{{\mathcal {D}}_{L}}}, \\&\theta _{{\mathcal {G}}}=\theta _{{\mathcal {G}}}+\gamma \frac{d {\mathcal {L}}_{{\mathcal {G}}^{}}}{d \theta _{{\mathcal {G}}}} \end{aligned} \end{aligned}$$
(12)

In order to compare the two frames \(\overline{{M}}\) and M as shown in Eq. 2, utilizing the modified KL divergence known as the Jeffrey divergence that takes the symmetric condition, we propose a distance metric. The Jeffrey divergence is symmetric, numerically stable, noise and input-scale-invariant [40]. Using the Jeffrey divergence method, the distance metric is defined as in Eq. 13.

$$\begin{aligned} d\left( {\overline{M}}, M\right)= & {} \sum _{i, j}\left( {\bar{m}}_{i, j} \log _{10} \frac{{\bar{m}}_{i, j}}{x_{i, j}}-m_{i, j} \log _{10} \frac{m_{i, j}}{x_{i, j}}\right) \nonumber \\ x_{i, j}= & {} \frac{{\bar{m}}_{i, j}+m_{i, j}}{2} \end{aligned}$$
(13)

The discriminator D completes the responsibilities of an AD model with the aid of G. As a result, the G and the D network both are utilized during testing. The predicted discriminator score D(G(M)), referred to as AS(M), is combined with the Jeffrey divergence to develop the final anomaly condition. In Eq. 2, the new thresholding process is described as follows:

$$ \begin{aligned} AD\left( M\right) =\left\{ \begin{aligned} \text{ Abnormal, }&AS(M)>\tau _1 \, { \& }\, d\left( {\overline{M}}, M\right) \ge \tau _2 \\ \text{ Normal, }&\text{ Otherwise } \end{aligned}\right. \nonumber \\ \end{aligned}$$
(14)

where \(\tau _1\) and \(\tau _2\) are the predetermined threshold.

5 Experiments

In this section, the capability of VALD-GAN is evaluated on three benchmark video datasets, namely UCSD Peds [41], CUHK Avenue [26], and Subway datasets [42]. By using the frame-level measurement, the ability of our proposed approach to identify an abnormality is assessed. The abnormality threshold affects how well our model works. Therefore, rather than measuring the performance of our model based on a certain threshold, it is fairer to assess how much more discriminative performance our approach may deliver under several threshold selections. The receiver operating characteristic (ROC) curve is utilized because a binary classifier system’s discrimination threshold might be altered, potentially providing diagnostic capabilities. The ROC curve compares the ratio of true positives to false positives (TPR/FPR). TPR = TP/(FN+TP) and FPR = FP/(FP +TN) are used to compute TPR and FPR, respectively. The true positive and false negative are denoted by TP and FN, respectively. FP and TN, respectively, stand for false positives and true negatives. We followed the quantitative evaluation metrics proposed in previous work [37], the description of which are given below:

  1. 1.

    Equal error rate (EER) It represents the value of FPR at a point on the ROC curve where FPR is approximately equal to 1-TPR. Here, FPR is the false positive rate and TPR is the true positive rate. The enhanced performance of the method is reflected by a lower EER value, signifying an effective reduction in miss-classification. To calculate EER value, we follow the iterative algorithm 1, which requires iteration over various thresholds utilized to plot the ROC curve. For the input to the algorithm, we used the AUC metrics library of the Sklearn package which takes the ground-truth label of each video frame and the predicted video frame label as inputs. It returns fpr_list as the list of false positives, tpr_list as the list of true positives, and a list of threshold values. In algorithm 1, we first initialize the minimum difference to 1 and the EER value to 0. Further, iteratively passing through the thresholds list gives the threshold point (\(EER\_Threshold\)) where the difference between FPR and (1-TPR) is minimum. \(EER\_Threshold\) is a point on the ROC curve where FPR is approximately equal to 1-TPR, and this FPR value is the EER which lies between 0 and 1. The lower value of EER indicates the robustness of the method.

  2. 2.

    Area under curve (AUC It is a threshold-independent evaluation measure used in binary classification tasks. It quantifies a method’s ability to distinguish between positive and negative instances by calculating the area under the ROC curve. A high AUC value indicates a strong ability to distinguish between normal and anomalous activities, enhancing the method’s reliability and effectiveness for identifying unusual events.

Algorithm 1
figure a

Calculating Equal Error Rate

Fig. 5
figure 5

Comparison of ROC with different approaches for UCSD Peds1 and UCSD Peds2 dataset

5.1 Datasets

In this section, the proposed AD model is tested using benchmark datasets from UCSD Peds, CUHK Avenue, and Subway for real-time applications of video surveillance.

5.1.1 UCSD Peds dataset

It comprises of two subsets, namely Peds1 and Peds2, captured by outdoor security cameras. Peds1 takes a frame size of \(158 \times 234\), while Peds2 takes a frame size of \(240 \times 360\). Peds1 consists of 34 training videos and 36 testing videos, with 40 abnormalities in total. Peds2 includes 16 training videos, 12 testing videos, and 12 abnormalities.

Table 1 Comparing reconstruction-based approaches at the frame-level using the CUHK Avenue and UCSD-Peds datasets. “–" denotes that no results are given. The best performance is shown by the bolded digits

5.1.2 CUHK avenue dataset

It consists of 47 anomalies, including tossing, wandering, and running, with 16 training videos and 21 test videos having frame size \(360 \times 640\). Due to the shifting placements and camera angles, it is notable that the people in the dataset change in size and scale.

5.1.3 Subway dataset

It consists of “Entrance video” and “Exit video" having frame size \(512 \times 384\). The duration of “Entrance video" is 1 h, 36 min, and it has 144249 frames. The runtime of the “Exit video" is 43 min, and it consists of 64900 frames.

5.2 Experimental results

The training process utilizes specific parameters: the video frame size of \(160 \times 160\), the momentum of 0.9, learning & decay rate of \(10^{-4}\), batch size of 64, and convergence tolerance of \(10^{-6}\). During experimentation, the value of \(\sigma \) and \(\sigma _l\) is varied within the range of [0, 0.5] and [0, 10], respectively. It is observed that better results are achieved for \(\sigma =0.34\) and \(\sigma _l=1\). Also, we test our proposed approach using various \(\lambda \) values in the range [0,1] and found experimentally the values of \(\lambda _1\) = 0.4 and \(\lambda _2\) = 0.8. All tests are conducted on a dedicated GPU server running Xubuntu 22.04, equipped with an Intel Xeon Gold 6226R processor clocked at 2.9 GHz, 128 GB of RAM, and an Nvidia TESLA GPU. We implement our VALD-GAN architecture using the Keras framework.

The result analysis of the above datasets is discussed below:

  1. 1.

    UCSD Peds1 For the UCSD Peds1 dataset our proposed VALD-GAN outperforms the other state-of-the-art reconstruction-based approaches. The ROC curve for the UCSD Peds1 dataset is shown in Fig. 5. The AUC and EER are shown in Table 1. The AUC and EER score of VALD-GAN is 0.06 and 0.07, respectively, better than \(AEP_\mathrm{{MTRM}}\). VALD-GAN outperforms \(AEP_\mathrm{{MTRM}}\) due to the advantage of utilizing 2D-CNN over 3D-CNN architecture and avoiding the usage of past and future frame sequences which affects the anomaly score of the frame.

  2. 2.

    UCSD Peds2 For the UCSD Peds2 dataset, our proposed VALD-GAN outperforms the other state-of-the-art reconstruction-based approaches. The ROC curve for the UCSD Peds2 dataset is shown in Fig. 5. The AUC and EER are shown in Table 1. The AUC and EER score of VALD-GAN is 0.43 and 0.51, respectively, better than \(AEP_\mathrm{{MTRM}}\).

  3. 3.

    CUHK avenue For the CUHK Avenue dataset, VALD-GAN outperforms the current state-of-the-art reconstruction-based approaches by a greater margin as compared to other datasets. The AUC and EER are shown in Table 1. The AUC and EER score of VALD-GAN is 0.83 and 1.03, respectively, better than \(AEP_\mathrm{{MTRM}}\).

  4. 4.

    Subway “Entry" and “Exit" datasets In this dataset, we computed the False Alarm rate and the number of AD for both the “Entrance" and “Exit" datasets. We compared the performance of VALD-GAN with the current state-of-the-art reconstruction-based approaches. As shown in Table 2, VALD-GAN found a total of 62 and 19 in the “Entrance" and “Exit" videos out of which 4 and 1 are False Alarms. On the other, the false alarm detected by VALD-GAN is lower than the other state-of-the-art approaches.

5.3 Anomaly visualization

Figure 6 shows the anomaly score visualized from UCSD Peds2 and CUHK Avenue dataset. In Fig. 6, M denotes the input video frame, M\(^{\prime }\) denotes the noise added to the video frame, G(M\(^{\prime }\)) denotes the reconstructed video frame from G, and the anomaly score from D is denoted by D(G(M\(^{\prime }\))) and finally, the anomaly area in video frame identified is highlighted with red pixels. The D value for normal video frames is close to 0, whereas the anomaly video frame score is close to 1.

Table 2 Quantitative comparison of different OCC on the Subway dataset where “Entrance video” and “Exit video” are represented by EN and EX and “False Alarm” is represented as FA
Fig. 6
figure 6

Visualization of anomaly scores for normal and anomalous video frames: M denotes the real video frame, while G(M\(^{\prime }\)) denotes the reconstructed video frame from noise-added video frame M\(^{\prime }\). The anomaly score D(G(M\(^{\prime }\))) is computed by the discriminator. (a) and (b) are the normal video frames, while (c) and (d) denote the anomalous video frames. In (c), the anomaly is depicted by a person dropping his bag in front of the camera, and (d) shows an anomaly with the presence of a vehicle in the pedestrian area

Table 3 On the UCSD-Peds2 dataset, the VALD-GAN was compared to various reconstruction-based approaches in terms of Execution speed in terms of seconds to process each frame
Table 4 The impact of noise added to input frame on the AUC values of UCSD Peds2 and CUHK avenue dataset

5.4 Time complexity

We conducted a comparison of the execution speed of VALD-GAN with other state-of-the-art approaches during testing on the UCSD Peds2 dataset. Table 3 presents the average duration required to process each frame during testing. We compared VALD-GAN with Unmasking [43], Lu et al. [26], FFP+MC [13], ALOCC [9], AMDN [20], and \(AEP_\mathrm{{MTRM}}\) [37]. Notably, VALD-GAN exhibits faster computational performance compared to \(AEP_\mathrm{{MTRM}}\) [37], despite both approaches utilizing deep architectures and video frames for training. The key distinction is that while \(AEP_\mathrm{{MTRM}}\) employs a 3D-CNN model and detects anomalies based on deviations from future frames, VALD-GAN utilizes an end-to-end training procedure with a 2D-CNN architecture, enabling real-time AD based on deviations from normal video frames.

6 Discussion

In this section, we discuss the effect of noise in improving the generalization of the proposed approach and the impact of constraining the latent space in improving the AUC score. Also, we discuss the weights given to the hyperparameter \(\lambda \) and its impact on the AUC score.

6.1 Effect of noise in input

To improve the generalizability and robustness of VALD-GAN, we modified the input frames by adding a noise component \(\eta \). Table 4 shows the ablation study on the effect of the noises and their intensity on the functioning of our model. We observe that the Gaussian noise \({\mathcal {N}}\left( 0, \sigma ^{2} {\textbf{I}}\right) \), with \(\sigma = 0.34\), performs better than no-noise, resulting in score \(+2.5\) higher on UCSD Peds2 and \(+0.5\) higher on CUHK Avenue, thus implying its importance to detect an anomaly. However, the salt-and-pepper noise performs worse than the no-noise scenario.

6.2 Constraining the latent space

Another constraint placed in VALD-GAN is utilized for the latent space \(\beta \) in G. Specifically, the latent discriminator tries to distinguish \(\beta \) from samples in \({\mathcal {N}}\left( 0, \sigma _{l}^{2} {\textbf{I}}\right) \). Table 5 shows how the variation of \(\sigma _l\) impact on the performance of VALD-GAN. We observe that \(\sigma _l = 1\) forms a suitable choice to prevent stability of training as well as prevent mode collapse, which occurs if \(\beta \) is too small.

Table 5 The impact of standard deviation \(\sigma _l\) of latent distribution constraint on the AUC values of UCSD Peds2 and CUHK avenue dataset
Fig. 7
figure 7

Plot of AUC scores with different ‘\(\lambda \)’ values for CUHK Avenue dataset. The variation of the AUC score on different values of \(\lambda _2\) when \(\lambda _1\)= 0.8 and the variation of the AUC score on different values of \(\lambda _1\) when \(\lambda _2\) \(\sim \) 0.4

6.3 Hyperparameter for anomaly score estimation

The hyperparameter \(\lambda \) is varied to balance between the \(L_1\) and \(L_2\) score which are added further with Jeffrey’s divergence loss to get the combined anomaly score of the input video frame. We first fix the weight of the \(L_2\) loss at 0.4 and find the weight of the \(L_1\) loss that achieves the best AUC score on the CUHK Avenue dataset, which we get as 0.8. We then set the weight of the \(L_1\) loss to 0.8 and find the weight of the \(L_2\) loss that achieves the best AUC score. The obtained weight is 0.38, which we round up to 0.4 for experimentation. Figure 7 shows the variation of \(\lambda _1\) and \(\lambda _2\) for \(L_1\) and \(L_2\) loss. As can be seen, the best AUC score is achieved when \(\lambda _1\) = 0.4 and \(\lambda _2\) = 0.8. This resulted in an AUC score of 91.03%.

7 Conclusion

In this paper, we propose a novel VALD-GAN for video AD. Our proposed approach is an end-to-end reconstruction model based on GAN architecture which learns the features from normal video frames and detects the abnormality from the anomaly samples. In order to improve the reconstruction frame we propose a novel latent discriminator model which makes the latent space of the generator follow a pre-defined distribution. Through the use of the Jeffrey divergence distance metric and the discriminative capabilities of GAN, our model is able to capture the unique characteristics of anomalies, resulting in enhanced anomaly scoring. Extensive experimentation on benchmark datasets shows that VALD-GAN outperforms existing state-of-the-art approaches, highlighting its effectiveness in end-to-end learning for video anomaly detection. Although, the experimental results show the role of latent space in improving the anomaly score, however, learning the latent space is still a challenge. In future work, we will explore new methods based on spatio-temporal latent features for video AD.