HSTforU: anomaly detection in aerial and ground-based videos with hierarchical spatio-temporal transformer for U-net

Le, Viet-Tuan; Jin, Hulin; Kim, Yong-Guk

doi:10.1007/s10489-024-06042-4

HSTforU: anomaly detection in aerial and ground-based videos with hierarchical spatio-temporal transformer for U-net

Open access
Published: 03 January 2025

Volume 55, article number 261, (2025)
Cite this article

Download PDF

You have full access to this open access article

Applied Intelligence Aims and scope Submit manuscript

HSTforU: anomaly detection in aerial and ground-based videos with hierarchical spatio-temporal transformer for U-net

Download PDF

430 Accesses
Explore all metrics

Abstract

Anomaly detection is to identify abnormal events against normal ones within surveillance videos mainly collected in ground-based settings. Recently, the demand for processing drone-collected data is rapidly growing with the expanding range of drone applications. However, as most aerial videos collected by flying drones contain dynamic backgrounds and others, it is necessary to deal with their spatio-temporal features in detecting anomalies. This study presents a transformer-based video anomaly detection method whereby we investigate a challenging aerial dataset and three benchmark ground-based datasets. A multi-stage transformer is leveraged as an encoder to generate multi-scale feature maps, which are then conveyed to a hierarchical spatio-temporal transformer, that is linked to a decoder and used to capture spatial and temporal information by utilizing a joint attention mechanism. Extensive evaluations including several ablation studies suggest that this network outperforms the state-of-the-art methods. We expect the proposed transformer for U-net can find diverse applications in the video processing area. Code and pre-trained models are publicly available at https://github.com/vt-le/HSTforU.

DSTANet: learning a dual-stream model for anomaly driving action detection using spatio-temporal and appearance features

Article 24 October 2024

Conjoined triple deep network for video anomaly detection

Article 27 December 2023

Scale-Aware Spatio-Temporal Relation Learning for Video Anomaly Detection

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Detecting an abnormal event from normal events in a video may be a straightforward task for an expert human observer; however, such a task is not so easy for a computer. Video Anomaly Detection (VAD) remains one of the challenging problems in the field of computer vision. Recently, there has been a need for a unified video anomaly detection framework that can handle both ground and aerial video. Problems in developing a high-performing VAD can be summarized into two aspects. First, abnormal events occur sporadically compared to normal activities, so collecting sufficient abnormal cases is hard. Moreover, since abnormal events in the real world are complicated and diverse, it is difficult to define all possible anomaly patterns. One of the prominent ways to handle this issue is to adopt unsupervised learning by utilizing a U-net. During training, only normal events are used, and yet the network can distinguish between normal and abnormal events by measuring prediction errors during testing. Second, the aerial videos are very dynamic compared to the ground-based videos since all footage is taken while the drone is flying. As the network should be capable of dealing with spatio-temporal aspects for high performance, we propose a hierarchical spatio-temporal transformer embedded within a U-net.

Most anomaly detection methods are based on Convolutional Neural Networks (CNNs) and they can be divided roughly into two categories: Reconstruction-based [1, 2] and Prediction-based approaches [3, 4]. Such networks typically consist of an encoder and a decoder. The former is used to extract features from input frames and the latter generates the corresponding output. And yet a drawback of CNN is that it is difficult to make any association between two distant points in a given image because the size of the receptive fields in it is rather small. To handle this, some approaches utilize attention mechanisms to capture such long-range information within the video frames. For instance, channel attention is applied to various components of convolutional networks, such as encoders [5], skip connections [6], or decoders [7], while spatial attention is commonly leveraged within the skip connection part [5, 6]. On the other hand, [8] inserts a squeeze-and-excitation block as the attention layer between the encoder and decoder. However, the architecture of these studies primarily relies on convolutions where the attention layer serves as an extra module.

Diverse Transformers have demonstrated their powerful performance in various computer vision tasks. In video anomaly detection, TransAnomaly [9] adds two separate transformer modules into a convolutional network for enhancing temporal information. However, this approach only captures temporal information from the smallest scale feature map since these modules are attached after the latest layer of the encoder. On the other hand, ANDT [10] utilizes a transformer as an encoder to extract features from input frames for anomaly detection in aerial videos. However, this transformer generates only a single-scale feature map, which lacks in generating the feature maps across coarse to fine resolutions for the decoder to generate output.

To address the limitations of attention as well as the mentioned columnar transformer and draw inspiration from the success of vision transformers, we propose a Hierarchical Spatio-Temporal Transformer for U-net (HSTforU) for anomaly detection in aerial and ground-based videos. Unlike previous transformer-based methods, we extract multi-scale feature maps from input video frames using a transformer with four stages as backbone, with output resolutions of $\frac{H}{4} \times \frac{W}{4} \times C_{1}$, $\frac{H}{8} \times \frac{W}{8} \times C_{2}$, $\frac{H}{16} \times \frac{W}{16} \times C_{3}$, and $\frac{H}{32} \times \frac{W}{32} \times C_{4}$. For capturing both appearance and motion information from these extracted multi-scale feature maps, we propose a Hierarchical Spatio-Temporal (HST) transformer. Different from traditional transformers for video, the HST transformer has a hierarchical architecture to encode at various scales from low-resolution to high-resolution features. In the HST transformer, we introduce an effective computation of attention by computing both temporal attention and spatial attention in each layer of transformer layers. It should be noted that the spatio-temporal attention is computed in n layers of each stage of the HST transformer.

Our main contributions are summarized as follows:

A transformer-based video anomaly detection framework is presented where one can investigate both aerial and ground-based anomaly datasets in a unified manner.
We propose a new HSTforU, which can generate multi-scale feature maps for reconstructing the output frame that requires dense prediction at the pixel level as well as for modeling long-range dependencies of video frames by capturing both spatial and temporal information.
Extensive evaluation, including ablation studies on an aerial and three benchmark ground-based anomaly datasets, suggests that our network achieves competitive performance compared to state-of-the-art methods.

The rest of this paper is organized as follows: The related studies about video anomaly detection are discussed in Section 2. The proposed hierarchical spatio-temporal transformer for video anomaly detection is presented in Section 3. In Section 4, experiments including an ablation study are conducted to validate the effectiveness of the proposed network. Finally, Section 5 concludes this study.

2 Related work

In this section, we discuss recent approaches that are closely related to autoencoder-based video anomaly detection. Several vision transformers that are relevant to our transformer are reviewed. In addition, deep convolution neural networks have been employed in a wide range of applications. Under unsupervised learning as a dominant paradigm in VAD, there are roughly two representative ones: reconstruction-based and prediction-based approaches.

2.1 Reconstruction-based approaches

The reconstruction-based approach typically utilizes an autoencoder to reconstruct input frames. The reconstruction error computed with input and output determines whether a frame is a normal or abnormal event. Deepak et al. [2] proposed an autoencoder for video frame reconstruction. To exploit both spatial and temporal information from input frames, a ConvLSTM layer was added to the encoder and decoder, and a residual block was placed between the encoder and decoder to avoid the vanishing gradient problem. To address the false reconstruction problem of the autoencoder, [11] combined a human expert’s feedback with the convolutional autoencoder’s output [12] combined an autoencoder and an explanation method to enhance its interpretability. On the other hand, [12,13,14] proposed a two-stream network designed to extract both spatial and temporal features to improve the performance of their reconstruction network. Similarly, [15] employed a two-stream network that has both a forward network for reconstructing the current frame and a backward network for generating the reversed frames.

2.2 Prediction-based approaches

The prediction-based approach predicts the future frames using several preceding frames by assuming that abnormal events are unpredictable whereas normal events are predictable. U-Net is typically employed for this purpose. To enhance network performance, [3] incorporated a visual relation module into the U-Net, while [16] added a context module and a ConvGRU module into the U-Net to extract multi-scale features and to model temporal information. [8] embedded an attention layer and a memory addressing module between the encoder and decoder in the prediction network. To exploit both spatial and temporal features, a 3D U-Net was constructed for predicting future frames [4, 17, 18]. A two-stream network [19] consists of an appearance stream for capturing spatial features and a motion stream for exploiting temporal features. In predicting future frames, [20] proposed a multi-timescale model wherein each timescale has both motion and appearance stream. Two U-Nets [21, 22] were used to predict forward and backward frames. In contrast to the most recent methods that have been based on the video frame level, [23] proposed a model that has graph convolutional networks in predicting skeleton behavior by extracting a set of joints for pose perception across consecutive frames.

2.3 Integration of reconstruction and prediction

To leverage the advantages of reconstruction and prediction approaches, a reconstruction network was combined with a prediction network as the end-to-end framework for video anomaly detection [24,25,26]. Chang et al. [27] introduced a two-stream network to reconstruct the first frame of the input as well as to predict the RGB difference from a sequence of consecutive frames.

2.4 Vision transformer

Since Vision Transformer (ViT) [28] achieved state-of-the-art performance in the image classification task, several transformer-based methods introduced to generate multi-scale feature maps, such as Pyramid Vision Transformers (PVTs) [29, 30], and Swin Transformers [31, 32], for downstream tasks. Building upon the success of transformer in vision tasks, TransAnomaly [9] applied a transformer to video anomaly detection by inserting a temporal transformer encoder and a spatial transformer encoder into an autoencoder, thereby enhancing temporal information and global contexts. Similarly, Sun et al. [33] incorporated a transformer module into a convolutional encoder to leverage its powerful modeling capabilities for capturing temporal features and global information. ANDT [10] combined a Vision Transformer [28] with a convolutional decoder to predict frames.

Table 1 Definition for symbols and others

Full size table

3 Method

3.1 Problem formulation

VAD aims to identify whether the events in a video are normal or abnormal. Given that abnormal cases are rare and often difficult to define, as mentioned above, a neural network such as U-net is trained using only normal cases in an unsupervised manner. Since it would produce a larger error for abnormal events compared to normal ones, it could be easy to discriminate between them. One of the handy ways to do this is to predict the future frame based on previous frames. The predicted frame $\hat{x}_{t+1}$ is generated from the k consecutive frames $x_{t-k:t}$. The predicted frame $\hat{x}_{t+1}$ is compared to the ground truth frame $x_{t+1}$ in computing the losses during the training phase and computing the anomaly score during the testing phase. In the present study, we propose a Transformer-based U-net for high-performing VAD, applicable to both ground and aerial datasets. Before providing a detailed description of our network, we define the symbols for all variables and the process components within the system in Table 1.

3.2 Overall architecture

A schematic diagram of our network is illustrated in Fig. 1. First, a sequence of input frames $x_{t-k:t}$ is fed into a multi-stage transformer as an encoder $\mathcal {E}$ to obtain multi-scale feature maps $y = \mathcal {E}(x_{t-k:t})$. To better capture both spatial and temporal information in extracted feature maps, the extracted feature maps y are fed into an HST transformer $\mathcal {H}$ to obtain $z = {\mathcal {H}(y)}$. Note that the output $y_{i}$ from a stage $\mathcal {E}_{i}$ of the encoder $\mathcal {E}$ is leveraged as the input for both the next stage $\mathcal {E}_{i+1}$ and the corresponding stage $\mathcal {H}_{j}$ of the HST transformer. In addition, a residual connection is applied after each stage $\mathcal {H}_{j}$ of the HST transformer. Finally, the four-stage hierarchical representations z are inputted to the decoder $\mathcal {D}$ to generate the future frame $\hat{x}_{t+1} = \mathcal {D}(z)$. The decoder is an upstream network, where the input of each layer consists of the upsampled feature map from the previous layer and the corresponding output from HST transformer $z_{i}$.

3.3 Encoder

The encoder contains four stages, and each stage consists of a patch embedding and several transformer layers as shown in Fig. 2 (top). These stages generate the coarse-to-fine feature maps. Convolution with the stride of S is applied on an input frame of size $H \times W \times 3$ to obtain overlapping embedded patches of size $\frac{H}{S} \times \frac{W}{S} \times C_{1}$. Then, the embedded patches are passed through several transformer layers to obtain a feature map $\textbf{x}_{1}$ of size $\frac{H}{4} \times \frac{W}{4} \times C_{1}$. Similarly, the feature maps $\textbf{x}_{2}$, $\textbf{x}_{3}$, and $\textbf{x}_{4}$ of corresponding stage two, stage three, and stage four are generated by feeding the output of the previous stage to the next stage, respectively, as illustrated in Fig. 1.

Each transformer layer consists of a Spatial-Reduction Attention (SRA) and a Feed-Forward Network (FFN). A LayerNorm (LN) is applied before SRA and FFN and a residual connection is applied after SRA and FFN.

$$\begin{aligned} z^{l}= & \textrm{SRA}(LN(z^{l-1})) + z^{l-1}\end{aligned}$$

(1)

$$\begin{aligned} z^{l+1}= & \textrm{FFN}(LN(z^{l})) + z^{l} \end{aligned}$$

(2)

where $z^{l}$ and $z^{l+1}$ denote the output of the SRA and FFN layers, respectively.

The SRA reduces the computation of attention operation by applying a convolution on the key K and value V to reduce the spatial scale.

$$\begin{aligned} \textrm{SRA}(q,k,v)=\textrm{Concat}(head_{0}, ..., head_{d})W^{0} \end{aligned}$$

(3)

where $\textrm{Concat}(\cdot )$ denotes the concatenation operation and $W^{0}$ is the linear projection parameter.

$$\begin{aligned} \textrm{head}_{i}=\textrm{Attention}(qW_{i}^{q}, \textrm{SR}(k)W_{i}^{k}, \textrm{SR}(v)W_{i}^{v}) \end{aligned}$$

(4)

where $W^{q}_{i} \in \mathbb {R}^{C_{i} \times d_{head}}$, $W^{k}_{i} \in \mathbb {R}^{C_{i} \times d_{head}}$, and $W^{v}_{i} \in \mathbb {R}^{C_{i} \times d_{head}}$ are the linear projection parameters. $\textrm{SR}(\cdot )$ is the operation for reducing the spatial dimension:

$$\begin{aligned} \textrm{SR}(x)=\mathrm {Norm(Reshape(x, \lambda _{i})W^{S})} \end{aligned}$$

(5)

where $x \in \mathbb {R}^{(H_{i}W_{i})\times C_{i}}$ is the input, $\lambda _{i}$ denotes the reduction ratio in Stage i, and $W_{S} \in \mathbb {R}^{(R_{i}^{2}C_{i}) \times C_{i}}$ is the linear projection.

Attention operation is carried out by

$$\begin{aligned} \textrm{Attention}(\textbf{q},\textbf{k},\textbf{v})= \textrm{Softmax}\left( \frac{\textbf{q}\textbf{k}^{\textsf{T}}}{\sqrt{d_{\textrm{head}}}}\right) \textbf{v}, \end{aligned}$$

(6)

where q, k, v represent the Query, Key, and Value, respectively. First, we compute the dot products of the query with all keys and then divide each by $\sqrt{d_{head}}$. The softmax function is applied to obtain the weights of the results. All these computations of the SRA are illustrated in Fig. 2 (bottom).

The FFN comprises two linear layers. A depth-wise convolution and GELU (Gaussian Error Linear Unit) activation function are added between two linear layers.

3.4 Hierarchical spatio-temporal transformer

The computational complexity of full spatio–temporal attention is $O(T^{2}S^{2})$ since the self-attention is obtained by computing all S spatial locations and T temporal locations:

$$\begin{aligned} z_{s,t}^{l} = \sum _{t^{'}=0}^{T-1} \sum _{s^{'}=0}^{S-1} \textrm{Softmax} \left( \frac{\textbf{q}^{l}_{s,t} \cdot \textbf{k}^{l}_{s^{'},t^{'}}}{\sqrt{d_{\textrm{head}}}} \right) \textbf{v}_{s^{'},t^{'}}^{l}, \left\{ \begin{matrix} s=0,...,S-1 \\ t=0,...,T-1 \end{matrix} \right\} \end{aligned}$$

(7)

where Softmax denotes the softmax function.

In order to reduce its computational complexity, ViViT [34] and TimeSformer [35] divided full spatio-temporal attention into temporal attention and spatial attention separately. Temporal attention and spatial attention are given as follows:

$$\begin{aligned} \begin{aligned} \tilde{z}_{s,t}^{l} = \sum _{t^{'}=0}^{T-1} \textrm{Softmax} \left( \frac{\textbf{q}^{l}_{s,t} \cdot \textbf{k}^{l}_{s,t^{'}}}{\sqrt{d_{\textrm{head}}}} \right) \textbf{v}^{l}_{s,t^{'}}, \left\{ \begin{matrix} s=0,...,S-1\\ t=0,...,T-1 \end{matrix} \right\} \\ z_{s,t}^{l} = \sum _{s^{'}=0}^{S-1} \textrm{Softmax} \left( \frac{\tilde{\textbf{q}}^{l}_{s,t} \cdot \tilde{\textbf{k}}^{l}_{s^{'},t}}{\sqrt{d_{\textrm{head}}}} \right) \tilde{\textbf{v}}^{l}_{s^{'},t}, \left\{ \begin{matrix} s=0,...,S-1\\ t=0,...,T-1 \end{matrix} \right\} \\ \end{aligned} \end{aligned}$$

(8)

where $\tilde{\textbf{q}}^{l}_{s,t}$, $\tilde{\textbf{k}}^{l}_{s^{'},t}$, $\tilde{\textbf{v}}^{l}_{s^{'},t}$ are computed from $\tilde{z}_{s,t}^{l}$. The divided Space-Time Attention reduces the complexity from $O(T^{2}S^{2})$ to $O(T^{2}S + TS^{2})$.

We introduce an HST transformer to capture both spatial and temporal information from feature maps extracted by the encoder. To deal with the extracted multi-scale feature maps, our HST transformer contains four stages that share similar architecture, as illustrated in Fig. 1 (middle part). These stages consist of $L_{i}$ transformer layers and outputs feature maps at varying scales. The output feature maps of these stages consist of four scales, which are convenient for reconstructing the future frame by the upstream decoder.

Different from ViViT and TimeSformer, we remove the fixed-size positional embedding that includes the spatial and temporal position of each patch. As illustrated in Fig. 3, each stage of our HST transformer contains three components: temporal attention, spatial-reduction attention, and an FFN. A residual connection is applied after these modules. Note that temporal attention and spatial attention are computed in each layer to extract both spatial and temporal features effectively from the input. Here, temporal attention is obtained by computing temporal correlation between the patches that are located in identical spatial locations in different frames, as illustrated in Fig. 4. Then, the output of temporal attention is used as input to the spatial-reduction attention. To reduce the computational cost, a spatial reduction layer is applied in the spatial attention [29, 30]. The spatial scales of the Key K and Value V are reduced before feeding to the spatial attention module. By applying this module, the computational cost of spatial self-attention is reduced largely by reducing the spatial scale of K and V. The equations for temporal attention and spatial attention in our HST transformer are given as follows:

$$\begin{aligned} \tilde{z}_{s,t}^{l}= & \sum _{t^{'}=0}^{T-1} \textrm{Softmax} \left( \frac{\textbf{q}^{l}_{s,t} \cdot \textbf{k}^{l}_{s,t^{'}}}{\sqrt{d_{\textrm{head}}}} \right) \textbf{v}^{l}_{s,t^{'}}, \left\{ \begin{matrix} s=0,...,S-1\\ t=0,...,T-1 \end{matrix} \right\} \nonumber \\ z_{s,t}^{l}= & \sum _{s^{'}=0}^{S-1} \textrm{Softmax} \left( \frac{\tilde{\textbf{q}}^{l}_{s,t} \cdot R(\tilde{\textbf{k}}^{l}_{s^{'},t})}{\sqrt{d_{\textrm{head}}}} \right) R(\tilde{\textbf{v}}^{l}_{s^{'},t}), \left\{ \begin{matrix} s=0,...,S-1\\ t=0,...,T-1 \end{matrix} \right\} \nonumber \\ \end{aligned}$$

(9)

where $\tilde{\textbf{q}}^{l}_{s,t}$, $\tilde{\textbf{k}}^{l}_{s^{'},t}$ and $\tilde{\textbf{v}}^{l}_{s^{'},t}$ are obtained from $\tilde{z}_{s,t}^{l}$. Note that $R(\tilde{\textbf{k}}^{l}_{s^{'},t})$ and $R(\tilde{\textbf{v}}^{l}_{s^{'},t})$ denote the reduced forms using a spatial reduction layer.

The resulting vector $z_{s,t}^{l}$ is passed to the FFN to obtain the final encoding. A fully connected (FC) layer is applied to increase the feature channels of the input. Next, GELU is used as the activation function after the first FC layer. To exploit the local context, we add a depth-wise convolutional block to the FFN following the recent works [30, 36, 37]. Finally, the second FC decreases the channels to match the dimension of the input channels:

$$\begin{aligned} \begin{aligned} z_{s,t}^{l} = \textrm{FFN} ( \textrm{LN} (z_{s,t}^{l}) + z_{s,t}^{l} ) \end{aligned} \end{aligned}$$

(10)

3.5 Decoder

A convolutional decoder is used to reconstruct the future frame corresponding to the input frames from the multi-scale feature maps. After aggregating the spatial and temporal information, the multi-scale feature maps generated from the four stages of the HST transformer are passed to the decoder via skip connections to restore the spatial resolution of the feature maps. The feature maps of the previous stage are upsampled by applying a deconvolutional layer, then the resized feature maps are concatenated with the feature map from the corresponding HST transformer stage, as shown in Fig. 1. The combined feature maps are passed through h convolutional layers that consist of a 3 $\times $ 3 convolution, batch normalization, and ReLU activation function.

3.6 Objective function

Given a sequence of input frames $x_{t-k:t}$, the network aims to predict the future frame $\hat{x}_{t+1}$ corresponding to the ground truth frame $x_{t+1}$. For constraining the predicted frame $\hat{x}_{t+1}$ similar to its ground truth $x_{t+1}$, three different loss functions are used.

In order to guarantee the similarity of all pixels in RGB space, the $l_{2}$ loss is applied between the predicted frame $\hat{x}_{t+1}$ and the actual future frame $x_{t+1}$ as follows:

$$\begin{aligned} L_{int}(x,\hat{x})=\Vert {x-\hat{x}}\Vert _{2}^{2} \end{aligned}$$

(11)

However, the disadvantage of using $l_{2}$ loss is that it produces a blurred output. A gradient loss is added to obtain a sharper result of the predicted frame. The loss function computes the difference between absolute gradients along two spatial dimensions of the predicted frame and its ground truth as:

$$\begin{aligned} L_{gra}(x,\hat{x})=\sum _{i,j}&\Big \Vert {\left| {\hat{x}_{i,j}-\hat{x}_{i-1,j}}\right| -\left| {x_{i,j}-x_{i-1,j}}\right| }\Big \Vert _{1} \nonumber \\ +&\Big \Vert {\left| {\hat{x}_{i,j}-\hat{x}_{i,j-1}}\right| -\left| {x_{i,j}-x_{i,j-1}}\right| }\Big \Vert _{1} \end{aligned}$$

(12)

Following the works [38, 39], Multi-Scale Structural Similarity (MS-SSIM) [40] is used to measure the structural difference. MS-SSIM was proposed to estimate the structural similarity of images at different resolutions.

The final loss is the combination of the intensity loss $L_{int}$, gradient loss $L_{gra}$ and multi-scale structural similarity loss $L_{mss}$ as follows:

$$\begin{aligned} L_{pre}(x,\hat{x})=\alpha L_{int}(x,\hat{x}) + \beta L_{gra}(x,\hat{x}) + \gamma L_{mss}(x,\hat{x}), \end{aligned}$$

(13)

where $\alpha $, $\beta $, and $\gamma $ are three coefficients that balance the weights of the loss functions.

3.7 Anomaly detection

Given that the network has been trained using only normal event data under unsupervised learning, it predicts normal events very well. However, it will be stumbled when an unseen event, such as abnormal, is given as input. By utilizing such a phenomenon, we can measure the difference between the predicted frame and its ground truth.

Peak Signal to Noise Ratio (PSNR) has been used in estimating image quality. A frame has a high quality when it has a high value of PSNR. In other words, a predicted frame has a high PSNR value, and the difference between it and the ground truth frame is small. Then, we assume that it would be a normal event. PSNR is computed as follows:

$$\begin{aligned} \textrm{PSNR}(x, \hat{x})=10\log _{10} \frac{[\max _{\hat{x}}]^{2}}{\frac{1}{N}\sum _{i=1}^{N}(x_{i}-\hat{x}_{i})^{2}} \end{aligned}$$

(14)

where N denotes the number of pixels in the frame, $[\max _{\hat{x}}]$ is the maximum value of $\hat{x}$.

Following the work [3], the PSNR values of all frames in a testing video are normalized to the range [0, 1], then the anomaly score S(t) of the t-th frame is obtained by using the following formula:

$$\begin{aligned} S(t)=\frac{\textrm{PSNR}_{t}-\min (\textrm{PSNR})}{\max (\textrm{PSNR})-\min (\textrm{PSNR})} \end{aligned}$$

(15)

where $\min (\textrm{PSNR})$ and $\max (\textrm{PSNR})$ denote the minimum and the maximum PSNR values in the given video sequence, respectively. The anomaly scores S(t) are used to determine whether a frame is normal or not using a threshold value.

4 Experiments

4.1 Video anomaly detection datasets

The proposed network was evaluated extensively using four benchmark datasets that are divided into two categories: ground-based videos (UCSD Pedestrian dataset [41], CUHK Avenue dataset [42] and ShanghaiTech Campus dataset [43]) and aerial videos (Drone-Anomaly dataset [10]).

Table 2 The total number of frames in UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets for training and testing, respectively

Full size table

Table 3 The training, testing, and total number of frames, and the numbers of normal and abnormal frames are detected on each scene of Drone-Anomaly, respectively

Full size table

4.1.1 Ground-based Videos

UCSD Pedestrian dataset

The UCSD Pedestrian dataset contains two subsets Ped1 and Ped2, which were captured in two different outdoor areas. The Ped1 has a resolution of 158 $\times $ 238 while the Ped2 has a higher resolution of 240 $\times $ 360. Following the recent works [19, 22, 27], Ped1 is excluded from our experiments because of its low resolution. Ped2 contains 16 videos for training and 12 videos for testing, corresponding to 2550 frames for training and 2010 frames for testing, respectively. Ped2 contains 12 abnormal events which include bikers, skaters, small carts, people in wheelchairs, and people walking across a walkway or the grass.

CUHK Avenue dataset

The CUHK Avenue dataset consists of 16 videos for training and 21 videos for testing. It contains a total of 30,652 frames which are split into 15,328 training frames and 15,423 testing frames. The resolution of each frame is 360 $\times $ 640 pixels. The dataset contains 47 abnormal events such as throwing objects, loitering, and running.

ShanghaiTech Campus dataset

The ShanghaiTech Campus dataset has been one of the most challenging datasets for video anomaly detection. The dataset was recorded in 13 different scenes with different light conditions and camera angles. It contains 437 videos and is split into 330 videos for training and 107 videos for testing. The training set contains 274,515 frames, which include normal events while the testing set contains 42,883 frames with 130 abnormal events. Each video has a resolution of 480 $\times $ 856 pixels.

The numbers of training, testing, and total frames for the Ped2, Avenue, and ShanghaiTech datasets, are listed in the 2nd, 3rd, and 4th columns in Table 2, respectively. In addition, the numbers of normal and abnormal frames used in the testing phase, are listed in the 5th and 6th columns, respectively.

4.1.2 Aerial videos

Drone-Anomaly dataset

The Drone-Anomaly dataset has been released recently [10], consisting of seven different scenes collected by drones flying over the highway, crossroad, bike roundabout, vehicle roundabout, railway, solar panels, and farmland, respectively. The whole dataset includes 37 training videos and 22 testing videos, corresponding to 51,635 training frames and 35,853 testing frames, respectively. Each frame has a resolution of 640 $\times $ 640 pixels. This is a challenging dataset since the videos are collected in different places and contain various kinds of anomalous events even in the same scene. Moreover, since these aerial videos are collected while the drone is flying, they have both moving backgrounds and objects, making it hard to detect anomalies.

Since Drone-Anomaly consists of seven categorized scenes and each scene has a different number of frames, the counts for training, testing, and total frames for each scene are given in the 2nd, 3rd, and 4th columns in Table 3, respectively. Also, the numbers of normal and abnormal frames for each scene, are listed in the 5th and 6th columns, respectively.

4.2 Implementation details

All video frames of three ground-based datasets as well as one aerial dataset are resized as 256 $\times $ 256 and normalized to the range of [-1, 1]. A sequence of four frames is fed to the model to predict its fifth frame. AdamW optimizer is used to train the model with a cosine learning rate decay. The initial learning rate is set to $5e-4$ for UCSD Ped2, CUHK Avenue, Drone-Anomaly datasets and $4e-4$ for ShanghaiTech dataset. PVTv2-B1 [30] is used as the feature extraction for UCSD Ped 2 and Drone-Anomaly datasets, while CUHK Avenue and ShanghaiTech datasets use PVTv2-B2 [30] to extract feature maps. Our HSTforU model with PVTv2-B1 and PVTv2-B2 has around 137 million and 149 million parameters, respectively.

4.3 Evaluation metric

Following the previous studies [3, 27], the frame-level Area Under Curve (AUC) of the Receiver Operation Characteristic (ROC) is used as the metric for the present evaluation. A higher AUC means that the given model performs better. An ROC curve plots True Positive Rate (TPR) and False Positive Rate (FPR). TPR and FPR are defined as follows:

$$\begin{aligned} \textrm{TPR}=\frac{\textrm{TP}}{\textrm{TP}+\textrm{FN}}, \end{aligned}$$

(16)

$$\begin{aligned} \textrm{FPR}=\frac{\textrm{FP}}{\textrm{FP}+\textrm{TN}}, \end{aligned}$$

(17)

where $\textrm{TP}$, $\textrm{FN}$, $\textrm{FP}$, and $\textrm{TN}$ denote true positive, false negative, false positive, and true negative, respectively.

The resulting ROC curves are shown in Figs. 5 and 6 and the AUC results are compared in Tables 9 and 10.

4.4 Ablation study

4.4.1 Ground-based Video

In this section, an ablation study is conducted to investigate: 1) Which vision transformer (PVT vs Swin) performs better as an encoder in our HSTforU; 2) The optimal number of the Hierarchical Spatio-Temporal Transformer layers; 3) The impact of the proposed Hierarchical Spatio-Temporal Transformer. For this ablation study, three standard benchmark datasets are used to evaluate the performance.

Vision Transformer as an encoder (PVT vs Swin)

The performance of our network is evaluated with two representative vision transformers: PVT V2 [30] and Swin Transformer V2 [32]. PVT is a well-known vision transformer that has a hierarchical structure. Given that the encoder of the standard U-net has a similar structure, we have thought that the convolutional-based encoder can be converted to a vision transformer-based one smoothly. Similarly, Swin is also a visual transformer that produces multi-scale feature maps. The comparison result is summarized in Table 4, suggesting that PVT V2 performs better than Swin on three benchmark datasets under the present setting.

Optimal number of Hierarchical Spatio-Temporal (HST) Transformer layers

In this ablation study, we vary the number of layers of the HST transformer from 1 to 3 and measure its effects on the performance of three datasets. All stages of the HST transformer have the same number of layers. As shown in Table 5, our HSTforU performs better when the number of layers is increased from 1 to 2, whereas the performance is decreased when the layer number is increased up to 3, indicating that our network with two layers performs best.

Impact of HST Transformer

To evaluate the importance of the HST Transformer, this section evaluates the performance of our network with and without the HST Transformer. The result, depicted in Table 6, clearly indicates that our HSTforU performs better when the HST transformer is employed on all three benchmark datasets. Figure 5 compares the ROC curves of the proposed network using two encoders (PVT v2 and Swin v2) without using the HST transformer. The other case is that PVT v2 is combined with the HST transformer. The results are plotted for three benchmark datasets. The blue and red lines denote the ROC curves that use PVT v2 as the encoder with and without the HST transformer, respectively. The black dotted lines indicate the ROC curves for the cases of Swin v2 as an encoder without using the HST transformer. The result suggests that PVT v2 + HST transformer performs best.

Table 4 Performance comparison of the proposed network for two different feature extractions (PVT and Swin) in terms of AUC (%)

Full size table

Table 5 Performance comparison of the proposed network for different number of HST Transformer layers in terms of AUC (%)

Full size table

Table 6 Performance comparison of the proposed network with and without HST Transformer in terms of AUC (%)

Full size table

Table 7 Performance comparison of the proposed network for different number of HST Transformer layers on Drone-Anomaly dataset in terms of AUC (%)

Full size table

4.4.2 Aerial video

In this section, the similar effects for the proposed model are evaluated on the Drone-anomaly dataset.

Optimal number of HST Transformer layers

As described in Section 3.4, the HST Transformer includes four stages, each consisting of $L_{i}$ layers. To validate the effectiveness of varying the number of transformer layers of the HST transformer, we experiment to report the performance of the HSTforU with the HST transformer layers set to $L_{i}=1$, 2, 3, respectively.

As shown in Table 7, the proposed network with one layer of HST transformer achieves the best performance in most of the scenes. When the HST transformer is set to 2 layers, the performance is slightly increased on Highway and Bike roundabout scenes. However, no further improvement is seen except in the farmland scene when the number of layers is increased to three.

Impact of HST Transformer

Table 8 shows the performance of the proposed network with and without the HST transformer. When the HST transformer is used, the performance is much higher than the case of not using it, suggesting that our HST transformer plays an essential role in detecting anomalies on the Drone-Anomaly dataset.

Table 8 Performance comparison of the proposed network with and without HST Transformer on Drone-Anomaly dataset in terms of AUC (%)

Full size table

4.5 Comparison with state-of-the-art methods

4.5.1 Ground-based videos

In this section, we compare the proposed method with the state-of-the-art methods on three benchmark video datasets, shown in Table 9. Experimental results indicate that our model achieves the best AUC on UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets compared with other recent methods. The AUC on Ped2 dataset exceeds that of the second best method [54] by only 0.1% but our method improved significantly on Avenue and Shanghaitech datasets, proved by 87.8% and 75.3% over 1.0% and 1.3%, respectively. Cai et al. [16] obtained 87.4% on Avenue dataset, compared to 87.8% of our method on the same dataset. However, our method obtained 75.3% on ShanghaiTech dataset, whereas their method was only 74.2%. Finally, [20] obtained 74.5% on ShanghaiTech dataset. Nevertheless, their AUC on Ped2 and Avenue datasets was 96.0% and 86.3% in comparison with 97.3% and 87.8% of our results. These results suggest that our model outperforms the convolution-based U-net as well as the TransAnomaly case for the ground-based video datasets.

Table 9 Performance comparison of the proposed network with state-of-the-art methods in terms of AUC (%) on ground-based video datasets

Full size table

Table 10 Performance comparison of the proposed network with baseline and Skip-Ganomaly methods in terms of AUC (%) on the Drone-Anomaly dataset

Full size table

4.5.2 Aerial videos

Our model is compared with ANDT [10], which is a recent model for aerial videos, using seven different scenes in the Drone-Anomaly dataset, shown in Table 10. Our model performs better than ANDT in five scenes except for two scenes. For instance, the AUC of our model outperforms 1.9%, 0.5%, 4.3%, 8.4% and 3.5% compared to ANDT on Highway, Bike roundabout, Vehicle roundabout, Railway inspection and Solar panel inspection scenes, respectively.

Figure 6 shows the ROC curves of seven scenes in the Drone-Anomaly dataset. Our model achieves the best performance on the Bike roundabout (green line) whereas the lowest result is on the Crossroads scene (red line). Given that performance varies from 0.535 to 0.825 widely across the scenes, it seems that each scene of the Drone-Anomaly dataset has unique spatial-temporal characteristics, suggesting that our model can extract better spatio-temporal features, and yet it may need fine-tuning for some particular aerial scenes.

4.6 Qualitative comparison with visualization of prediction error

Although we have seen that the HST transformer enhances the overall performance of the system quantitatively, it would be helpful to identify such enhancement via visualization. Figure 7 is a collection of three abnormal events from three different datasets, such as Ped2, Avenue, and ShangheiTech, from top to bottom. In the top row, the prediction errors are visualized with HST and without HST, respectively, and then the difference map contains two abnormal regions, i.e. vehicle and a cyclist, that are circled in red. In the middle row, throwing a bag of a man shows abnormal behavior, and the prediction error around him is distinctive. The difference between with and without HST can be seen clearly in the difference map as well. On the bottom row, a riding motorcyclist is identified as abnormal in the prediction error maps, and the difference between with and without HST is illustrated in the difference map.

4.7 Visualization

4.7.1 Prediction error

Figure 8 visualizes how our model responds to the ground-based video datasets. In each panel, the left shows the normal case, the middle depicts the anomaly, and the right represents the prediction error. The prediction error is scaled from blue to yellow, and up to red. On the top, two abnormal objects are a car and a cyclist with a group of pedestrians background. In the middle, a man is throwing a bag. At the bottom, a motorcyclist is riding among a few pedestrians.

For the aerial video dataset, Fig. 9 visualizes how our model detects anomalies in seven different scenes. In each panel, the left shows the normal case and the middle depicts the anomaly, and the right represents the prediction error. Because this is a challenging dataset as mentioned before, it would be helpful to describe the context of each scene in terms of its abnormality: In the highway case, a group of cattle intrudes on the highway; a man crosses the road illegally; a big bus intrudes on a bike roundabout; a man crossed the road illegally at the vehicle roundabout; an identified object is found on the railway; one of the solar panels is collapsed and now it is seen as a grey one; a cycle is found in the middle of farmland. See also our demo video at https://vt-le.github.io/HSTforU/.

4.7.2 Anomaly score

Figure 10 illustrates the anomaly scores of three example videos within ground-based datasets. In each panel, the blue line denotes the score of our model with the HST transformer, whereas the green line indicates the score without using it. For the normal region, the score of our model with the HST transformer is often lower than that of without using it. For the abnormal region painted, the major part of the blue curves is higher than the green ones, i.e. (c), indicating that our network with the HST transformer discriminates normal and abnormal events better than without.

The anomaly scores of three videos corresponding to three scenes of the Drone-Anomaly dataset are shown in Fig. 11. Similar observations can be made in these panels, particularly, in Fig. 11(a). With an abnormal event, the anomaly score of our model with the HST transformer is often higher than that of without using it.

4.8 Limitations

Although our network achieves SOTA performance, there are several challenging cases in which it fails to detect abnormal events. First, when the distance between the camera and the objects is very far as shown in Fig. 12(a), a skateboarder is shown in the top-right corner of the frame, taking up just a few pixels in size. Our network produces a very small prediction error for this object. Second, Fig. 12(b) shows another case where a skateboarder stands at a considerable distance from the camera and keeps the same pose across a sequence of consecutive frames. As a result, the network produces a small prediction error around him.

It seems that the primary reason for these limitations comes from the angle and distance between the camera and the objects. The immediate solution to such an issue is to increase the resolution of the video frames but it would be necessary to develop a model that can capture different viewpoints of the same scene or understand the sophisticated context within the scene.

5 Conclusion

Although the primary target of VAD has been restricted to ground-based datasets until recently, the aerial datasets have grown rapidly as drones are available widely and they find many interesting applications. In this study, we propose a transformer-based VAD method by which one can investigate both aerial and ground-based anomaly datasets in a unified framework. The U-net-based prediction network has been trained with the normal events under an unsupervised learning paradigm, and it is asked to discriminate the abnormal events against the normal ones during testing. The encoder of our U-net has a multi-stage transformer that generates the multi-scale feature maps by extracting visual features from the input video in a coarse-to-fine manner. The merit of this architecture is its flexibility since any hierarchical transformers, such as pyramid vision transformers and Swin, can be plugged and played depending on the designated application. To handle the aerial datasets, containing diverse dynamic backgrounds and moving objects, we propose a hierarchical spatio-temporal transformer that extracts both temporal and spatial features effectively since it is designed to compute spatial and temporal features simultaneously using a new joint attention mechanism. Evaluation results for three benchmark ground-based anomaly detection datasets were 97.3% on Ped2, 87.8% on Avenue, and 74.9% on ShanghaiTech, respectively, and that for the Drone-Anomaly dataset varied from 53.5% to 82.5% across the scenes, suggesting that our model outperforms the state-of-the-art methods and shows its ability to detect anomalies in both aerial and ground-based videos using a single model with high performance.

Data Availability

The UCSD Ped2, CUHK Avenue, ShanghaiTech, and Drone-Anomaly datasets are provided from previous works: https://doi.org/10.1109/CVPR.2010.5539872, https://doi.org/10.1109/ICCV.2013.338, https://doi.org/10.1109/ICCV.2017.45, and https://doi.org/10.1109/TGRS.2022.3198130, respectively. Our code and model are available at https://vt-le.github.io/HSTforU/.

References

Gong D, Liu L, Le V, Saha B, Mansour MR, Venkatesh S, Van Den Hengel A (2019) Memorizing normality to detect anomaly: memory-augmented deep autoencoder for unsupervised anomaly detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp 1705–1714. https://doi.org/10.1109/ICCV.2019.00179
Deepak K, Chandrakala S, Mohan CK (2021) Residual spatiotemporal autoencoder for unsupervised video anomaly detection. Signal Image Video Process 15(1):215–222. https://doi.org/10.1007/s11760-020-01740-1
Article MATH Google Scholar
Luo W, Liu W, Lian D, Gao S (2022) Future frame prediction network for video anomaly detection. IEEE Trans Patt Anal Mach Intell 44(11):7505–7520. https://doi.org/10.1109/TPAMI.2021.3129349
Article MATH Google Scholar
Hao Y, Li J, Wang N, Wang X, Gao X (2022) Spatiotemporal consistency-enhanced network for video anomaly detection. Patt Recognit 121:108232. https://doi.org/10.1016/j.patcog.2021.108232
Article MATH Google Scholar
Zhang X, Fang J, Yang B, Chen S, Li B (2023) Hybrid attention and motion constraint for anomaly detection in crowded scenes. IEEE Trans Circ Syst Video Technol 33(5):2259–2274. https://doi.org/10.1109/TCSVT.2022.3221622
Article MATH Google Scholar
Gu J, Zeng J, Ji G (2022) Dual attention mechanisms based auto-encoder for video anomaly detection. Artificial Intelligence and Security. Springer, Cham, pp 153–165
Chapter MATH Google Scholar
Le VT, Kim YG (2023) Attention-based residual autoencoder for video anomaly detection. Appl Intell 53(3):3240–3254. https://doi.org/10.1007/s10489-022-03613-1
Article MATH Google Scholar
Li Q, Yang R, Xiao F, Bhanu B, Zhang F (2022) Attention-based anomaly detection in multi-view surveillance videos. Knowl-Based Syst 252:109348. https://doi.org/10.1016/j.knosys.2022.109348
Article Google Scholar
Yuan H, Cai Z, Zhou H, Wang Y, Chen X (2021) Transanomaly: video anomaly detection using video vision transformer. IEEE Access 9:123977–123986. https://doi.org/10.1109/ACCESS.2021.3109102
Article Google Scholar
Jin P, Mou L, Xia GS, Zhu XX (2022) Anomaly detection in aerial videos with transformers. IEEE Trans Geosci Remote Sens 60:1–13. https://doi.org/10.1109/TGRS.2022.3198130
Article MATH Google Scholar
Yang F, Yu Z, Chen L, Gu J, Li Q, Guo B (2021) Human-machine cooperative video anomaly detection. Proc ACM HumComput Interact 4(CSCW3). https://doi.org/10.1145/3434183
Wu P, Liu J, Li M, Sun Y, Shen F (2020) Fast sparse coding networks for anomaly detection in videos. Patt Recognit 107. https://doi.org/10.1016/j.patcog.2020.107515
Li N, Chang F, Liu C (2020) Spatial-temporal cascade autoencoder for video anomaly detection in crowded scenes. IEEE Trans Multimed 23:203–215. https://doi.org/10.1109/TMM.2020.2984093
Article MATH Google Scholar
Fan Y, Wen G, Li D, Qiu S, Levine MD, Xiao F (2020) Video anomaly detection and localization via gaussian mixture fully convolutional variational autoencoder. Comp Vision Image Underst 195:102920. https://doi.org/10.1016/j.cviu.2020.102920
Article Google Scholar
Fang Z, Liang J, Zhou JT, Xiao Y, Yang F (2022) Anomaly detection with bidirectional consistency in videos. IEEE Trans Neural Netw Learn Syst 33(3):1079–1092. https://doi.org/10.1109/TNNLS.2020.3039899
Article MATH Google Scholar
Cai Y, Liu J, Guo Y, Hu S, Lang S (2021) Video anomaly detection with multi-scale feature and temporal information fusion. Neurocomput 423:264–273. https://doi.org/10.1016/j.neucom.2020.10.044
Article MATH Google Scholar
Yang J, Cai Y, Liu D, Xie J (2022) 3d u-net for video anomaly detection. In: Proceedings of the 2021 5th International Conference on Electronic Information Technology and Computer Engineering. EITCE ’21, pp 1640–1645. Association for Computing Machinery, New York, USA. https://doi.org/10.1145/3501409.3501698
Park C, Cho M, Lee M, Lee S (2022) Fastano: fast anomaly detection via spatio-temporal patch transformation. In: 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp 1908–1918. https://doi.org/10.1109/WACV51458.2022.00197
Cai R, Zhang H, Liu W, Gao S, Hao Z (2021) Appearance-motion memory consistency network for video anomaly detection. Proc AAAI Conf Artif Intell 35(2):938–946. https://doi.org/10.1609/aaai.v35i2.16177
Article MATH Google Scholar
Wang W, Chang F, Mi H (2021) Intermediate fused network with multiple timescales for anomaly detection. Neurocomput 433:37–49. https://doi.org/10.1016/j.neucom.2020.12.025
Article MATH Google Scholar
Chen D, Wang P, Yue L, Zhang Y, Jia T (2020) Anomaly detection in surveillance video based on bidirectional prediction. Image Vis Comput 98:103915. https://doi.org/10.1016/j.imavis.2020.103915
Article MATH Google Scholar
Li D, Nie X, Li X, Zhang Y, Yin Y (2022) Context-related video anomaly detection via generative adversarial network. Patt Recognit Lett 156:183–189. https://doi.org/10.1016/j.patrec.2022.03.004
Article MATH Google Scholar
Luo W, Liu W, Gao S (2021) Normal graph: spatial temporal graph convolutional networks based prediction network for skeleton based video anomaly detection. Neurocomput 444:332–337. https://doi.org/10.1016/j.neucom.2019.12.148
Article MATH Google Scholar
Tang Y, Zhao L, Zhang S, Gong C, Li G, Yang J (2020) Integrating prediction and reconstruction for anomaly detection. Patt Recognit Lett 129:123–130. https://doi.org/10.1016/j.patrec.2019.11.024
Article MATH Google Scholar
Qiang Y, Fei S, Jiao Y (2021) Anomaly detection based on latent feature training in surveillance scenarios. IEEE Access 9:68108–68117. https://doi.org/10.1109/ACCESS.2021.3077577
Article MATH Google Scholar
Liu T, Zhang C, Niu X, Wang L (2022) Spatio-temporal prediction and reconstruction network for video anomaly detection. Plos one 17(5):0265564. https://doi.org/10.1371/journal.pone.0265564
Article MATH Google Scholar
Chang Y, Tu Z, Xie W, Luo B, Zhang S, Sui H, Yuan J (2022) Video anomaly detection with spatio-temporal dissociation. Patt Recognit 122:108213. https://doi.org/10.1016/j.patcog.2021.108213
Article Google Scholar
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16$\times $16 words: Transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, ???. https://openreview.net/forum?id=YicbFdNTTy
Wang W, Xie E, Li X, Fan DP, Song K, Liang D, Lu T, Luo P, Shao L (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp 548–558. https://doi.org/10.1109/ICCV48922.2021.00061
Wang W, Xie E, Li X, Fan DP, Song K, Liang D, Lu T, Luo P, Shao L (2022) Pvt v2: improved baselines with pyramid vision transformer. Comput Visl Media 8(3):415–424. https://doi.org/10.1007/s41095-022-0274-8
Article Google Scholar
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp 9992–10002. https://doi.org/10.1109/ICCV48922.2021.00986
Liu Z, Hu H, Lin Y, Yao Z, Xie Z, Wei Y, Ning J, Cao Y, Zhang Z, Dong L, Wei F, Guo B (2022) Swin transformer v2: scaling up capacity and resolution. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 11999–12009. https://doi.org/10.1109/CVPR52688.2022.01170
Sun X, Chen J, Shen X, Li H (2023) Transformer with spatio-temporal representation for video anomaly detection. In: Krzyzak A, Suen CY, Torsello A, Nobile N (eds) Structural, Syntactic, and Statistical Pattern Recognition. Springer, Cham, pp 213–222
MATH Google Scholar
Arnab A, Dehghani M, Heigold G, Sun C, Lučić M, Schmid C (2021) Vivit: a video vision transformer. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp 6816–6826. https://doi.org/10.1109/ICCV48922.2021.00676
Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding? In: Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol 139, pp 813–824. PMLR, ???. http://proceedings.mlr.press/v139/bertasius21a.html
Wang Z, Cun X, Bao J, Zhou W, Liu J, Li H (2022) Uformer: a general u-shaped transformer for image restoration. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 17662–17672. https://doi.org/10.1109/CVPR52688.2022.01716
Yuan K, Guo S, Liu Z, Zhou A, Yu F, Wu W (2021) Incorporating convolution designs into visual transformers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp 559–568. https://doi.org/10.1109/ICCV48922.2021.00062
Lu Y, Kumar KM, Nabavi Ss, Wang Y (2019) Future frame prediction using convolutional vrnn for anomaly detection. In: 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp 1–8. https://doi.org/10.1109/AVSS.2019.8909850
Lu Y, Yu F, Reddy MKK, Wang Y (2020) Few-shot scene-adaptive anomaly detection. Computer Vision - ECCV 2020. Springer, Cham, pp 125–141
Chapter Google Scholar
Wang Z, Simoncelli EP, Bovik AC (2003) Multiscale structural similarity for image quality assessment. In: The 37th Asilomar Conference on Signals, Systems & Computers, vol 2, pp 1398–14022. https://doi.org/10.1109/ACSSC.2003.1292216
Mahadevan V, Li W, Bhalodia V, Vasconcelos N (2010) Anomaly detection in crowded scenes. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp 1975–1981. https://doi.org/10.1109/CVPR.2010.5539872
Lu C, Shi J, Jia J (2013) Abnormal event detection at 150 fps in matlab. In: 2013 IEEE International Conference on Computer Vision, pp 2720–2727. https://doi.org/10.1109/ICCV.2013.338
Luo W, Liu W, Gao S (2017) A revisit of sparse coding based anomaly detection in stacked rnn framework. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 341–349. https://doi.org/10.1109/ICCV.2017.45
Doshi K, Yilmaz Y (2021) Online anomaly detection in surveillance videos with asymptotic bound on false alarm rate. Patt Recognit 114:107865. https://doi.org/10.1016/j.patcog.2021.107865
Article MATH Google Scholar
Liu B, Chen Y, Liu S, Kim HS (2021) Deep learning in latent space for video prediction and compression. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 701–710. https://doi.org/10.1109/CVPR46437.2021.00076
Saypadith S, Onoye T (2021) Video anomaly detection based on deep generative network. In: 2021 IEEE International Symposium on Circuits and Systems (ISCAS), pp 1–5. https://doi.org/10.1109/ISCAS51556.2021.9401642
Szymanowicz S, Charles J, Cipolla R (2021) X-man: Explaining multiple sources of anomalies in video. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp 3218–3226. https://doi.org/10.1109/CVPRW53098.2021.00360
Wang T, Xu X, Shen F, Yang Y (2021) A cognitive memory-augmented network for visual anomaly detection. IEEE/CAA J Autom Sin 8(7):1296–1307. https://doi.org/10.1109/JAS.2021.1004045
Article MATH Google Scholar
Xu J, Miao Z, Xu W, Wang J, Zhang Q, Song S (2021) Video anomaly detection using dual discriminator based generative adversarial network. In: 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA), pp 1259–1265. https://doi.org/10.1109/ICMLA52953.2021.00205
Borja-Borja LF, Azorín-López J, Saval-Calvo M, Fuster-Guilló A, Sebban M (2022) Architecture for automatic recognition of group activities using local motions and context. IEEE Access 10:79874–79889. https://doi.org/10.1109/ACCESS.2022.3195035
Doshi K, Yilmaz Y (2022) Multi-task learning for video surveillance with limited data. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp 3888–3898. https://doi.org/10.1109/CVPRW56347.2022.00434
Guo A, Guo L, Zhang R, Wang Y, Gao S (2022) Self-trained prediction model and novel anomaly score mechanism for video anomaly detection. Image Vis Comput 119:104391. https://doi.org/10.1016/j.imavis.2022.104391
Article MATH Google Scholar
Wang S, Liu J, Yu G, Liu X, Zhou S, Zhu E, Yang Y, Yin J, Yang W (2022) Multiview deep anomaly detection: a systematic exploration. IEEE Trans Neural Netw Learn Syst 35(2):1651–1665. https://doi.org/10.1109/TNNLS.2022.3184723
Article MATH Google Scholar
Hyun W, Nam WJ, Lee SW (2023) Dissimilate-and-assimilate strategy for video anomaly detection and localization. Neurocomput 522:203–213. https://doi.org/10.1016/j.neucom.2022.12.026
Article MATH Google Scholar

Download references

Funding

This work was supported by the Institute of Information & communications Technology Planning & Evaluation (IITP), a grant funded by the Korean government (No.RS-2019-II190231), and by the Information Technology Research Center (ITRC) support program (IITP-2022-RS-2022-00156354) as well as by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2020R1A6A1A03038540).

Author information

Authors and Affiliations

Department of Computer Engineering, Sejong University, Seoul, Korea
Viet-Tuan Le & Yong-Guk Kim
School of Big Data and Statistics, Anhui University, Anhui, Hefei, China
Hulin Jin

Authors

Viet-Tuan Le
View author publications
You can also search for this author in PubMed Google Scholar
Hulin Jin
View author publications
You can also search for this author in PubMed Google Scholar
Yong-Guk Kim
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Viet-Tuan Le: Conceptualization, Methodology, Writing - Original Draft. Hulin Jin: Software, Writing - Review. Yong-Guk Kim: Conceptualization, Methodology, Supervision, Funding acquisition, Writing - Review and Editing.

Corresponding author

Correspondence to Yong-Guk Kim.

Ethics declarations

Conflicts of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical approval

This article does not contain studies with human participants or animals.

Informed consent

Statement of informed consent is not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Le, VT., Jin, H. & Kim, YG. HSTforU: anomaly detection in aerial and ground-based videos with hierarchical spatio-temporal transformer for U-net. Appl Intell 55, 261 (2025). https://doi.org/10.1007/s10489-024-06042-4

Download citation

Accepted: 11 September 2024
Published: 03 January 2025
DOI: https://doi.org/10.1007/s10489-024-06042-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

HSTforU: anomaly detection in aerial and ground-based videos with hierarchical spatio-temporal transformer for U-net

Abstract

Similar content being viewed by others

DSTANet: learning a dual-stream model for anomaly driving action detection using spatio-temporal and appearance features

Conjoined triple deep network for video anomaly detection

Scale-Aware Spatio-Temporal Relation Learning for Video Anomaly Detection

Explore related subjects

1 Introduction

2 Related work

2.1 Reconstruction-based approaches

2.2 Prediction-based approaches

2.3 Integration of reconstruction and prediction

2.4 Vision transformer

3 Method

3.1 Problem formulation

3.2 Overall architecture

3.3 Encoder

3.4 Hierarchical spatio-temporal transformer

3.5 Decoder

3.6 Objective function

3.7 Anomaly detection

4 Experiments

4.1 Video anomaly detection datasets

4.1.1 Ground-based Videos

UCSD Pedestrian dataset

CUHK Avenue dataset

ShanghaiTech Campus dataset

4.1.2 Aerial videos

Drone-Anomaly dataset

4.2 Implementation details

4.3 Evaluation metric

4.4 Ablation study

4.4.1 Ground-based Video

Vision Transformer as an encoder (PVT vs Swin)

Optimal number of Hierarchical Spatio-Temporal (HST) Transformer layers

Impact of HST Transformer

4.4.2 Aerial video

Optimal number of HST Transformer layers

Impact of HST Transformer

4.5 Comparison with state-of-the-art methods

4.5.1 Ground-based videos

4.5.2 Aerial videos

4.6 Qualitative comparison with visualization of prediction error

4.7 Visualization

4.7.1 Prediction error

4.7.2 Anomaly score

4.8 Limitations

5 Conclusion

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Ethical approval

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation