1 Introduction

Detecting an abnormal event from normal events in a video may be a straightforward task for an expert human observer; however, such a task is not so easy for a computer. Video Anomaly Detection (VAD) remains one of the challenging problems in the field of computer vision. Recently, there has been a need for a unified video anomaly detection framework that can handle both ground and aerial video. Problems in developing a high-performing VAD can be summarized into two aspects. First, abnormal events occur sporadically compared to normal activities, so collecting sufficient abnormal cases is hard. Moreover, since abnormal events in the real world are complicated and diverse, it is difficult to define all possible anomaly patterns. One of the prominent ways to handle this issue is to adopt unsupervised learning by utilizing a U-net. During training, only normal events are used, and yet the network can distinguish between normal and abnormal events by measuring prediction errors during testing. Second, the aerial videos are very dynamic compared to the ground-based videos since all footage is taken while the drone is flying. As the network should be capable of dealing with spatio-temporal aspects for high performance, we propose a hierarchical spatio-temporal transformer embedded within a U-net.

Most anomaly detection methods are based on Convolutional Neural Networks (CNNs) and they can be divided roughly into two categories: Reconstruction-based [1, 2] and Prediction-based approaches [3, 4]. Such networks typically consist of an encoder and a decoder. The former is used to extract features from input frames and the latter generates the corresponding output. And yet a drawback of CNN is that it is difficult to make any association between two distant points in a given image because the size of the receptive fields in it is rather small. To handle this, some approaches utilize attention mechanisms to capture such long-range information within the video frames. For instance, channel attention is applied to various components of convolutional networks, such as encoders [5], skip connections [6], or decoders [7], while spatial attention is commonly leveraged within the skip connection part [5, 6]. On the other hand, [8] inserts a squeeze-and-excitation block as the attention layer between the encoder and decoder. However, the architecture of these studies primarily relies on convolutions where the attention layer serves as an extra module.

Diverse Transformers have demonstrated their powerful performance in various computer vision tasks. In video anomaly detection, TransAnomaly [9] adds two separate transformer modules into a convolutional network for enhancing temporal information. However, this approach only captures temporal information from the smallest scale feature map since these modules are attached after the latest layer of the encoder. On the other hand, ANDT [10] utilizes a transformer as an encoder to extract features from input frames for anomaly detection in aerial videos. However, this transformer generates only a single-scale feature map, which lacks in generating the feature maps across coarse to fine resolutions for the decoder to generate output.

To address the limitations of attention as well as the mentioned columnar transformer and draw inspiration from the success of vision transformers, we propose a Hierarchical Spatio-Temporal Transformer for U-net (HSTforU) for anomaly detection in aerial and ground-based videos. Unlike previous transformer-based methods, we extract multi-scale feature maps from input video frames using a transformer with four stages as backbone, with output resolutions of \(\frac{H}{4} \times \frac{W}{4} \times C_{1}\), \(\frac{H}{8} \times \frac{W}{8} \times C_{2}\), \(\frac{H}{16} \times \frac{W}{16} \times C_{3}\), and \(\frac{H}{32} \times \frac{W}{32} \times C_{4}\). For capturing both appearance and motion information from these extracted multi-scale feature maps, we propose a Hierarchical Spatio-Temporal (HST) transformer. Different from traditional transformers for video, the HST transformer has a hierarchical architecture to encode at various scales from low-resolution to high-resolution features. In the HST transformer, we introduce an effective computation of attention by computing both temporal attention and spatial attention in each layer of transformer layers. It should be noted that the spatio-temporal attention is computed in n layers of each stage of the HST transformer.

Our main contributions are summarized as follows:

  • A transformer-based video anomaly detection framework is presented where one can investigate both aerial and ground-based anomaly datasets in a unified manner.

  • We propose a new HSTforU, which can generate multi-scale feature maps for reconstructing the output frame that requires dense prediction at the pixel level as well as for modeling long-range dependencies of video frames by capturing both spatial and temporal information.

  • Extensive evaluation, including ablation studies on an aerial and three benchmark ground-based anomaly datasets, suggests that our network achieves competitive performance compared to state-of-the-art methods.

The rest of this paper is organized as follows: The related studies about video anomaly detection are discussed in Section 2. The proposed hierarchical spatio-temporal transformer for video anomaly detection is presented in Section 3. In Section 4, experiments including an ablation study are conducted to validate the effectiveness of the proposed network. Finally, Section 5 concludes this study.

2 Related work

In this section, we discuss recent approaches that are closely related to autoencoder-based video anomaly detection. Several vision transformers that are relevant to our transformer are reviewed. In addition, deep convolution neural networks have been employed in a wide range of applications. Under unsupervised learning as a dominant paradigm in VAD, there are roughly two representative ones: reconstruction-based and prediction-based approaches.

2.1 Reconstruction-based approaches

The reconstruction-based approach typically utilizes an autoencoder to reconstruct input frames. The reconstruction error computed with input and output determines whether a frame is a normal or abnormal event. Deepak et al. [2] proposed an autoencoder for video frame reconstruction. To exploit both spatial and temporal information from input frames, a ConvLSTM layer was added to the encoder and decoder, and a residual block was placed between the encoder and decoder to avoid the vanishing gradient problem. To address the false reconstruction problem of the autoencoder, [11] combined a human expert’s feedback with the convolutional autoencoder’s output [12] combined an autoencoder and an explanation method to enhance its interpretability. On the other hand, [12,13,14] proposed a two-stream network designed to extract both spatial and temporal features to improve the performance of their reconstruction network. Similarly, [15] employed a two-stream network that has both a forward network for reconstructing the current frame and a backward network for generating the reversed frames.

2.2 Prediction-based approaches

The prediction-based approach predicts the future frames using several preceding frames by assuming that abnormal events are unpredictable whereas normal events are predictable. U-Net is typically employed for this purpose. To enhance network performance, [3] incorporated a visual relation module into the U-Net, while [16] added a context module and a ConvGRU module into the U-Net to extract multi-scale features and to model temporal information. [8] embedded an attention layer and a memory addressing module between the encoder and decoder in the prediction network. To exploit both spatial and temporal features, a 3D U-Net was constructed for predicting future frames [4, 17, 18]. A two-stream network [19] consists of an appearance stream for capturing spatial features and a motion stream for exploiting temporal features. In predicting future frames, [20] proposed a multi-timescale model wherein each timescale has both motion and appearance stream. Two U-Nets [21, 22] were used to predict forward and backward frames. In contrast to the most recent methods that have been based on the video frame level, [23] proposed a model that has graph convolutional networks in predicting skeleton behavior by extracting a set of joints for pose perception across consecutive frames.

2.3 Integration of reconstruction and prediction

To leverage the advantages of reconstruction and prediction approaches, a reconstruction network was combined with a prediction network as the end-to-end framework for video anomaly detection [24,25,26]. Chang et al. [27] introduced a two-stream network to reconstruct the first frame of the input as well as to predict the RGB difference from a sequence of consecutive frames.

2.4 Vision transformer

Since Vision Transformer (ViT) [28] achieved state-of-the-art performance in the image classification task, several transformer-based methods introduced to generate multi-scale feature maps, such as Pyramid Vision Transformers (PVTs) [29, 30], and Swin Transformers [31, 32], for downstream tasks. Building upon the success of transformer in vision tasks, TransAnomaly [9] applied a transformer to video anomaly detection by inserting a temporal transformer encoder and a spatial transformer encoder into an autoencoder, thereby enhancing temporal information and global contexts. Similarly, Sun et al. [33] incorporated a transformer module into a convolutional encoder to leverage its powerful modeling capabilities for capturing temporal features and global information. ANDT [10] combined a Vision Transformer [28] with a convolutional decoder to predict frames.

Table 1 Definition for symbols and others

3 Method

3.1 Problem formulation

VAD aims to identify whether the events in a video are normal or abnormal. Given that abnormal cases are rare and often difficult to define, as mentioned above, a neural network such as U-net is trained using only normal cases in an unsupervised manner. Since it would produce a larger error for abnormal events compared to normal ones, it could be easy to discriminate between them. One of the handy ways to do this is to predict the future frame based on previous frames. The predicted frame \(\hat{x}_{t+1}\) is generated from the k consecutive frames \(x_{t-k:t}\). The predicted frame \(\hat{x}_{t+1}\) is compared to the ground truth frame \(x_{t+1}\) in computing the losses during the training phase and computing the anomaly score during the testing phase. In the present study, we propose a Transformer-based U-net for high-performing VAD, applicable to both ground and aerial datasets. Before providing a detailed description of our network, we define the symbols for all variables and the process components within the system in Table 1.

Fig. 1
figure 1

The overall architecture of the proposed HSTforU. The encoder utilizes a four-stage pyramid transformer \(\mathcal {E}_{1:4}\) to generate multi-scale feature maps, and each stage has a link to an HST transformer. The HST transformer has also four stages \(\mathcal {H}_{1:4}\) to handle the corresponding feature maps, and each stage computes spatio-temporal features jointly. The multi-scale feature maps extracted from this transformer are conveyed to a convolutional decoder

3.2 Overall architecture

A schematic diagram of our network is illustrated in Fig. 1. First, a sequence of input frames \(x_{t-k:t}\) is fed into a multi-stage transformer as an encoder \(\mathcal {E}\) to obtain multi-scale feature maps \(y = \mathcal {E}(x_{t-k:t})\). To better capture both spatial and temporal information in extracted feature maps, the extracted feature maps y are fed into an HST transformer \(\mathcal {H}\) to obtain \(z = {\mathcal {H}(y)}\). Note that the output \(y_{i}\) from a stage \(\mathcal {E}_{i}\) of the encoder \(\mathcal {E}\) is leveraged as the input for both the next stage \(\mathcal {E}_{i+1}\) and the corresponding stage \(\mathcal {H}_{j}\) of the HST transformer. In addition, a residual connection is applied after each stage \(\mathcal {H}_{j}\) of the HST transformer. Finally, the four-stage hierarchical representations z are inputted to the decoder \(\mathcal {D}\) to generate the future frame \(\hat{x}_{t+1} = \mathcal {D}(z)\). The decoder is an upstream network, where the input of each layer consists of the upsampled feature map from the previous layer and the corresponding output from HST transformer \(z_{i}\).

3.3 Encoder

The encoder contains four stages, and each stage consists of a patch embedding and several transformer layers as shown in Fig. 2 (top). These stages generate the coarse-to-fine feature maps. Convolution with the stride of S is applied on an input frame of size \(H \times W \times 3\) to obtain overlapping embedded patches of size \(\frac{H}{S} \times \frac{W}{S} \times C_{1}\). Then, the embedded patches are passed through several transformer layers to obtain a feature map \(\textbf{x}_{1}\) of size \(\frac{H}{4} \times \frac{W}{4} \times C_{1}\). Similarly, the feature maps \(\textbf{x}_{2}\), \(\textbf{x}_{3}\), and \(\textbf{x}_{4}\) of corresponding stage two, stage three, and stage four are generated by feeding the output of the previous stage to the next stage, respectively, as illustrated in Fig. 1.

Fig. 2
figure 2

The architecture of an encoder stage (top), which includes a patch embedding, spatial-reduction attention, and feed-forward network, and the computational process of the SRA (bottom)

Each transformer layer consists of a Spatial-Reduction Attention (SRA) and a Feed-Forward Network (FFN). A LayerNorm (LN) is applied before SRA and FFN and a residual connection is applied after SRA and FFN.

$$\begin{aligned} z^{l}= & \textrm{SRA}(LN(z^{l-1})) + z^{l-1}\end{aligned}$$
(1)
$$\begin{aligned} z^{l+1}= & \textrm{FFN}(LN(z^{l})) + z^{l} \end{aligned}$$
(2)

where \(z^{l}\) and \(z^{l+1}\) denote the output of the SRA and FFN layers, respectively.

The SRA reduces the computation of attention operation by applying a convolution on the key K and value V to reduce the spatial scale.

$$\begin{aligned} \textrm{SRA}(q,k,v)=\textrm{Concat}(head_{0}, ..., head_{d})W^{0} \end{aligned}$$
(3)

where \(\textrm{Concat}(\cdot )\) denotes the concatenation operation and \(W^{0}\) is the linear projection parameter.

$$\begin{aligned} \textrm{head}_{i}=\textrm{Attention}(qW_{i}^{q}, \textrm{SR}(k)W_{i}^{k}, \textrm{SR}(v)W_{i}^{v}) \end{aligned}$$
(4)

where \(W^{q}_{i} \in \mathbb {R}^{C_{i} \times d_{head}}\), \(W^{k}_{i} \in \mathbb {R}^{C_{i} \times d_{head}}\), and \(W^{v}_{i} \in \mathbb {R}^{C_{i} \times d_{head}}\) are the linear projection parameters. \(\textrm{SR}(\cdot )\) is the operation for reducing the spatial dimension:

$$\begin{aligned} \textrm{SR}(x)=\mathrm {Norm(Reshape(x, \lambda _{i})W^{S})} \end{aligned}$$
(5)

where \(x \in \mathbb {R}^{(H_{i}W_{i})\times C_{i}}\) is the input, \(\lambda _{i}\) denotes the reduction ratio in Stage i, and \(W_{S} \in \mathbb {R}^{(R_{i}^{2}C_{i}) \times C_{i}}\) is the linear projection.

Attention operation is carried out by

$$\begin{aligned} \textrm{Attention}(\textbf{q},\textbf{k},\textbf{v})= \textrm{Softmax}\left( \frac{\textbf{q}\textbf{k}^{\textsf{T}}}{\sqrt{d_{\textrm{head}}}}\right) \textbf{v}, \end{aligned}$$
(6)

where q, k, v represent the Query, Key, and Value, respectively. First, we compute the dot products of the query with all keys and then divide each by \(\sqrt{d_{head}}\). The softmax function is applied to obtain the weights of the results. All these computations of the SRA are illustrated in Fig. 2 (bottom).

Fig. 3
figure 3

The architecture of an HST transformer stage which includes temporal attention, spatial-reduction attention, and feed-forward network

Fig. 4
figure 4

Illustration of how spatial and temporal attention are carried out in our HST Transformer. Input consists of 4 frames and each frame has \(4~\times ~4\) patches. The query patch is denoted by green. The blue patches are used to compute the temporal attention of the green one, whereas the red patches are used to compute the spatial attention of the green patch

The FFN comprises two linear layers. A depth-wise convolution and GELU (Gaussian Error Linear Unit) activation function are added between two linear layers.

3.4 Hierarchical spatio-temporal transformer

The computational complexity of full spatio–temporal attention is \(O(T^{2}S^{2})\) since the self-attention is obtained by computing all S spatial locations and T temporal locations:

$$\begin{aligned} z_{s,t}^{l} = \sum _{t^{'}=0}^{T-1} \sum _{s^{'}=0}^{S-1} \textrm{Softmax} \left( \frac{\textbf{q}^{l}_{s,t} \cdot \textbf{k}^{l}_{s^{'},t^{'}}}{\sqrt{d_{\textrm{head}}}} \right) \textbf{v}_{s^{'},t^{'}}^{l}, \left\{ \begin{matrix} s=0,...,S-1 \\ t=0,...,T-1 \end{matrix} \right\} \end{aligned}$$
(7)

where Softmax denotes the softmax function.

In order to reduce its computational complexity, ViViT [34] and TimeSformer [35] divided full spatio-temporal attention into temporal attention and spatial attention separately. Temporal attention and spatial attention are given as follows:

$$\begin{aligned} \begin{aligned} \tilde{z}_{s,t}^{l} = \sum _{t^{'}=0}^{T-1} \textrm{Softmax} \left( \frac{\textbf{q}^{l}_{s,t} \cdot \textbf{k}^{l}_{s,t^{'}}}{\sqrt{d_{\textrm{head}}}} \right) \textbf{v}^{l}_{s,t^{'}}, \left\{ \begin{matrix} s=0,...,S-1\\ t=0,...,T-1 \end{matrix} \right\} \\ z_{s,t}^{l} = \sum _{s^{'}=0}^{S-1} \textrm{Softmax} \left( \frac{\tilde{\textbf{q}}^{l}_{s,t} \cdot \tilde{\textbf{k}}^{l}_{s^{'},t}}{\sqrt{d_{\textrm{head}}}} \right) \tilde{\textbf{v}}^{l}_{s^{'},t}, \left\{ \begin{matrix} s=0,...,S-1\\ t=0,...,T-1 \end{matrix} \right\} \\ \end{aligned} \end{aligned}$$
(8)

where \(\tilde{\textbf{q}}^{l}_{s,t}\), \(\tilde{\textbf{k}}^{l}_{s^{'},t}\), \(\tilde{\textbf{v}}^{l}_{s^{'},t}\) are computed from \(\tilde{z}_{s,t}^{l}\). The divided Space-Time Attention reduces the complexity from \(O(T^{2}S^{2})\) to \(O(T^{2}S + TS^{2})\).

We introduce an HST transformer to capture both spatial and temporal information from feature maps extracted by the encoder. To deal with the extracted multi-scale feature maps, our HST transformer contains four stages that share similar architecture, as illustrated in Fig. 1 (middle part). These stages consist of \(L_{i}\) transformer layers and outputs feature maps at varying scales. The output feature maps of these stages consist of four scales, which are convenient for reconstructing the future frame by the upstream decoder.

Different from ViViT and TimeSformer, we remove the fixed-size positional embedding that includes the spatial and temporal position of each patch. As illustrated in Fig. 3, each stage of our HST transformer contains three components: temporal attention, spatial-reduction attention, and an FFN. A residual connection is applied after these modules. Note that temporal attention and spatial attention are computed in each layer to extract both spatial and temporal features effectively from the input. Here, temporal attention is obtained by computing temporal correlation between the patches that are located in identical spatial locations in different frames, as illustrated in Fig. 4. Then, the output of temporal attention is used as input to the spatial-reduction attention. To reduce the computational cost, a spatial reduction layer is applied in the spatial attention [29, 30]. The spatial scales of the Key K and Value V are reduced before feeding to the spatial attention module. By applying this module, the computational cost of spatial self-attention is reduced largely by reducing the spatial scale of K and V. The equations for temporal attention and spatial attention in our HST transformer are given as follows:

$$\begin{aligned} \tilde{z}_{s,t}^{l}= & \sum _{t^{'}=0}^{T-1} \textrm{Softmax} \left( \frac{\textbf{q}^{l}_{s,t} \cdot \textbf{k}^{l}_{s,t^{'}}}{\sqrt{d_{\textrm{head}}}} \right) \textbf{v}^{l}_{s,t^{'}}, \left\{ \begin{matrix} s=0,...,S-1\\ t=0,...,T-1 \end{matrix} \right\} \nonumber \\ z_{s,t}^{l}= & \sum _{s^{'}=0}^{S-1} \textrm{Softmax} \left( \frac{\tilde{\textbf{q}}^{l}_{s,t} \cdot R(\tilde{\textbf{k}}^{l}_{s^{'},t})}{\sqrt{d_{\textrm{head}}}} \right) R(\tilde{\textbf{v}}^{l}_{s^{'},t}), \left\{ \begin{matrix} s=0,...,S-1\\ t=0,...,T-1 \end{matrix} \right\} \nonumber \\ \end{aligned}$$
(9)

where \(\tilde{\textbf{q}}^{l}_{s,t}\), \(\tilde{\textbf{k}}^{l}_{s^{'},t}\) and \(\tilde{\textbf{v}}^{l}_{s^{'},t}\) are obtained from \(\tilde{z}_{s,t}^{l}\). Note that \(R(\tilde{\textbf{k}}^{l}_{s^{'},t})\) and \(R(\tilde{\textbf{v}}^{l}_{s^{'},t})\) denote the reduced forms using a spatial reduction layer.

The resulting vector \(z_{s,t}^{l}\) is passed to the FFN to obtain the final encoding. A fully connected (FC) layer is applied to increase the feature channels of the input. Next, GELU is used as the activation function after the first FC layer. To exploit the local context, we add a depth-wise convolutional block to the FFN following the recent works [30, 36, 37]. Finally, the second FC decreases the channels to match the dimension of the input channels:

$$\begin{aligned} \begin{aligned} z_{s,t}^{l} = \textrm{FFN} ( \textrm{LN} (z_{s,t}^{l}) + z_{s,t}^{l} ) \end{aligned} \end{aligned}$$
(10)

3.5 Decoder

A convolutional decoder is used to reconstruct the future frame corresponding to the input frames from the multi-scale feature maps. After aggregating the spatial and temporal information, the multi-scale feature maps generated from the four stages of the HST transformer are passed to the decoder via skip connections to restore the spatial resolution of the feature maps. The feature maps of the previous stage are upsampled by applying a deconvolutional layer, then the resized feature maps are concatenated with the feature map from the corresponding HST transformer stage, as shown in Fig. 1. The combined feature maps are passed through h convolutional layers that consist of a 3 \(\times \) 3 convolution, batch normalization, and ReLU activation function.

3.6 Objective function

Given a sequence of input frames \(x_{t-k:t}\), the network aims to predict the future frame \(\hat{x}_{t+1}\) corresponding to the ground truth frame \(x_{t+1}\). For constraining the predicted frame \(\hat{x}_{t+1}\) similar to its ground truth \(x_{t+1}\), three different loss functions are used.

In order to guarantee the similarity of all pixels in RGB space, the \(l_{2}\) loss is applied between the predicted frame \(\hat{x}_{t+1}\) and the actual future frame \(x_{t+1}\) as follows:

$$\begin{aligned} L_{int}(x,\hat{x})=\Vert {x-\hat{x}}\Vert _{2}^{2} \end{aligned}$$
(11)

However, the disadvantage of using \(l_{2}\) loss is that it produces a blurred output. A gradient loss is added to obtain a sharper result of the predicted frame. The loss function computes the difference between absolute gradients along two spatial dimensions of the predicted frame and its ground truth as:

$$\begin{aligned} L_{gra}(x,\hat{x})=\sum _{i,j}&\Big \Vert {\left| {\hat{x}_{i,j}-\hat{x}_{i-1,j}}\right| -\left| {x_{i,j}-x_{i-1,j}}\right| }\Big \Vert _{1} \nonumber \\ +&\Big \Vert {\left| {\hat{x}_{i,j}-\hat{x}_{i,j-1}}\right| -\left| {x_{i,j}-x_{i,j-1}}\right| }\Big \Vert _{1} \end{aligned}$$
(12)

Following the works [38, 39], Multi-Scale Structural Similarity (MS-SSIM) [40] is used to measure the structural difference. MS-SSIM was proposed to estimate the structural similarity of images at different resolutions.

The final loss is the combination of the intensity loss \(L_{int}\), gradient loss \(L_{gra}\) and multi-scale structural similarity loss \(L_{mss}\) as follows:

$$\begin{aligned} L_{pre}(x,\hat{x})=\alpha L_{int}(x,\hat{x}) + \beta L_{gra}(x,\hat{x}) + \gamma L_{mss}(x,\hat{x}), \end{aligned}$$
(13)

where \(\alpha \), \(\beta \), and \(\gamma \) are three coefficients that balance the weights of the loss functions.

3.7 Anomaly detection

Given that the network has been trained using only normal event data under unsupervised learning, it predicts normal events very well. However, it will be stumbled when an unseen event, such as abnormal, is given as input. By utilizing such a phenomenon, we can measure the difference between the predicted frame and its ground truth.

Peak Signal to Noise Ratio (PSNR) has been used in estimating image quality. A frame has a high quality when it has a high value of PSNR. In other words, a predicted frame has a high PSNR value, and the difference between it and the ground truth frame is small. Then, we assume that it would be a normal event. PSNR is computed as follows:

$$\begin{aligned} \textrm{PSNR}(x, \hat{x})=10\log _{10} \frac{[\max _{\hat{x}}]^{2}}{\frac{1}{N}\sum _{i=1}^{N}(x_{i}-\hat{x}_{i})^{2}} \end{aligned}$$
(14)

where N denotes the number of pixels in the frame, \([\max _{\hat{x}}]\) is the maximum value of \(\hat{x}\).

Following the work [3], the PSNR values of all frames in a testing video are normalized to the range [0, 1], then the anomaly score S(t) of the t-th frame is obtained by using the following formula:

$$\begin{aligned} S(t)=\frac{\textrm{PSNR}_{t}-\min (\textrm{PSNR})}{\max (\textrm{PSNR})-\min (\textrm{PSNR})} \end{aligned}$$
(15)

where \(\min (\textrm{PSNR})\) and \(\max (\textrm{PSNR})\) denote the minimum and the maximum PSNR values in the given video sequence, respectively. The anomaly scores S(t) are used to determine whether a frame is normal or not using a threshold value.

4 Experiments

4.1 Video anomaly detection datasets

The proposed network was evaluated extensively using four benchmark datasets that are divided into two categories: ground-based videos (UCSD Pedestrian dataset [41], CUHK Avenue dataset [42] and ShanghaiTech Campus dataset [43]) and aerial videos (Drone-Anomaly dataset [10]).

Table 2 The total number of frames in UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets for training and testing, respectively
Table 3 The training, testing, and total number of frames, and the numbers of normal and abnormal frames are detected on each scene of Drone-Anomaly, respectively

4.1.1 Ground-based Videos

UCSD Pedestrian dataset

The UCSD Pedestrian dataset contains two subsets Ped1 and Ped2, which were captured in two different outdoor areas. The Ped1 has a resolution of 158 \(\times \) 238 while the Ped2 has a higher resolution of 240 \(\times \) 360. Following the recent works [19, 22, 27], Ped1 is excluded from our experiments because of its low resolution. Ped2 contains 16 videos for training and 12 videos for testing, corresponding to 2550 frames for training and 2010 frames for testing, respectively. Ped2 contains 12 abnormal events which include bikers, skaters, small carts, people in wheelchairs, and people walking across a walkway or the grass.

CUHK Avenue dataset

The CUHK Avenue dataset consists of 16 videos for training and 21 videos for testing. It contains a total of 30,652 frames which are split into 15,328 training frames and 15,423 testing frames. The resolution of each frame is 360 \(\times \) 640 pixels. The dataset contains 47 abnormal events such as throwing objects, loitering, and running.

ShanghaiTech Campus dataset

The ShanghaiTech Campus dataset has been one of the most challenging datasets for video anomaly detection. The dataset was recorded in 13 different scenes with different light conditions and camera angles. It contains 437 videos and is split into 330 videos for training and 107 videos for testing. The training set contains 274,515 frames, which include normal events while the testing set contains 42,883 frames with 130 abnormal events. Each video has a resolution of 480 \(\times \) 856 pixels.

The numbers of training, testing, and total frames for the Ped2, Avenue, and ShanghaiTech datasets, are listed in the 2nd, 3rd, and 4th columns in Table 2, respectively. In addition, the numbers of normal and abnormal frames used in the testing phase, are listed in the 5th and 6th columns, respectively.

4.1.2 Aerial videos

Drone-Anomaly dataset

The Drone-Anomaly dataset has been released recently [10], consisting of seven different scenes collected by drones flying over the highway, crossroad, bike roundabout, vehicle roundabout, railway, solar panels, and farmland, respectively. The whole dataset includes 37 training videos and 22 testing videos, corresponding to 51,635 training frames and 35,853 testing frames, respectively. Each frame has a resolution of 640 \(\times \) 640 pixels. This is a challenging dataset since the videos are collected in different places and contain various kinds of anomalous events even in the same scene. Moreover, since these aerial videos are collected while the drone is flying, they have both moving backgrounds and objects, making it hard to detect anomalies.

Since Drone-Anomaly consists of seven categorized scenes and each scene has a different number of frames, the counts for training, testing, and total frames for each scene are given in the 2nd, 3rd, and 4th columns in Table 3, respectively. Also, the numbers of normal and abnormal frames for each scene, are listed in the 5th and 6th columns, respectively.

4.2 Implementation details

All video frames of three ground-based datasets as well as one aerial dataset are resized as 256 \(\times \) 256 and normalized to the range of [-1, 1]. A sequence of four frames is fed to the model to predict its fifth frame. AdamW optimizer is used to train the model with a cosine learning rate decay. The initial learning rate is set to \(5e-4\) for UCSD Ped2, CUHK Avenue, Drone-Anomaly datasets and \(4e-4\) for ShanghaiTech dataset. PVTv2-B1 [30] is used as the feature extraction for UCSD Ped 2 and Drone-Anomaly datasets, while CUHK Avenue and ShanghaiTech datasets use PVTv2-B2 [30] to extract feature maps. Our HSTforU model with PVTv2-B1 and PVTv2-B2 has around 137 million and 149 million parameters, respectively.

Fig. 5
figure 5

The ROC curves of the proposed network with and without the HST transformer are plotted for three benchmark video datasets such as UCSD Ped2, CHUK Avenue, and ShangheiTech, respectively. Note that when PVT is combined with the HST transformer, the network performs the best, whereas Swin without the HST transformer performs poorly

4.3 Evaluation metric

Following the previous studies [3, 27], the frame-level Area Under Curve (AUC) of the Receiver Operation Characteristic (ROC) is used as the metric for the present evaluation. A higher AUC means that the given model performs better. An ROC curve plots True Positive Rate (TPR) and False Positive Rate (FPR). TPR and FPR are defined as follows:

$$\begin{aligned} \textrm{TPR}=\frac{\textrm{TP}}{\textrm{TP}+\textrm{FN}}, \end{aligned}$$
(16)
$$\begin{aligned} \textrm{FPR}=\frac{\textrm{FP}}{\textrm{FP}+\textrm{TN}}, \end{aligned}$$
(17)

where \(\textrm{TP}\), \(\textrm{FN}\), \(\textrm{FP}\), and \(\textrm{TN}\) denote true positive, false negative, false positive, and true negative, respectively.

The resulting ROC curves are shown in Figs. 5 and 6 and the AUC results are compared in Tables 9 and 10.

Fig. 6
figure 6

The ROC curves for seven aerial scenes in the Drone-Anomaly dataset by our model. Note that performance varies from 0.535 to 0.825 widely across the scenes, indicating that each scene has unique spatial-temporal characteristics

4.4 Ablation study

4.4.1 Ground-based Video

In this section, an ablation study is conducted to investigate: 1) Which vision transformer (PVT vs Swin) performs better as an encoder in our HSTforU; 2) The optimal number of the Hierarchical Spatio-Temporal Transformer layers; 3) The impact of the proposed Hierarchical Spatio-Temporal Transformer. For this ablation study, three standard benchmark datasets are used to evaluate the performance.

Vision Transformer as an encoder (PVT vs Swin)

The performance of our network is evaluated with two representative vision transformers: PVT V2 [30] and Swin Transformer V2 [32]. PVT is a well-known vision transformer that has a hierarchical structure. Given that the encoder of the standard U-net has a similar structure, we have thought that the convolutional-based encoder can be converted to a vision transformer-based one smoothly. Similarly, Swin is also a visual transformer that produces multi-scale feature maps. The comparison result is summarized in Table 4, suggesting that PVT V2 performs better than Swin on three benchmark datasets under the present setting.

Optimal number of Hierarchical Spatio-Temporal (HST) Transformer layers

In this ablation study, we vary the number of layers of the HST transformer from 1 to 3 and measure its effects on the performance of three datasets. All stages of the HST transformer have the same number of layers. As shown in Table 5, our HSTforU performs better when the number of layers is increased from 1 to 2, whereas the performance is decreased when the layer number is increased up to 3, indicating that our network with two layers performs best.

Impact of HST Transformer

To evaluate the importance of the HST Transformer, this section evaluates the performance of our network with and without the HST Transformer. The result, depicted in Table 6, clearly indicates that our HSTforU performs better when the HST transformer is employed on all three benchmark datasets. Figure 5 compares the ROC curves of the proposed network using two encoders (PVT v2 and Swin v2) without using the HST transformer. The other case is that PVT v2 is combined with the HST transformer. The results are plotted for three benchmark datasets. The blue and red lines denote the ROC curves that use PVT v2 as the encoder with and without the HST transformer, respectively. The black dotted lines indicate the ROC curves for the cases of Swin v2 as an encoder without using the HST transformer. The result suggests that PVT v2 + HST transformer performs best.

Table 4 Performance comparison of the proposed network for two different feature extractions (PVT and Swin) in terms of AUC (%)
Table 5 Performance comparison of the proposed network for different number of HST Transformer layers in terms of AUC (%)
Table 6 Performance comparison of the proposed network with and without HST Transformer in terms of AUC (%)
Table 7 Performance comparison of the proposed network for different number of HST Transformer layers on Drone-Anomaly dataset in terms of AUC (%)

4.4.2 Aerial video

In this section, the similar effects for the proposed model are evaluated on the Drone-anomaly dataset.

Optimal number of HST Transformer layers

As described in Section 3.4, the HST Transformer includes four stages, each consisting of \(L_{i}\) layers. To validate the effectiveness of varying the number of transformer layers of the HST transformer, we experiment to report the performance of the HSTforU with the HST transformer layers set to \(L_{i}=1\), 2, 3, respectively.

As shown in Table 7, the proposed network with one layer of HST transformer achieves the best performance in most of the scenes. When the HST transformer is set to 2 layers, the performance is slightly increased on Highway and Bike roundabout scenes. However, no further improvement is seen except in the farmland scene when the number of layers is increased to three.

Impact of HST Transformer

Table 8 shows the performance of the proposed network with and without the HST transformer. When the HST transformer is used, the performance is much higher than the case of not using it, suggesting that our HST transformer plays an essential role in detecting anomalies on the Drone-Anomaly dataset.

Table 8 Performance comparison of the proposed network with and without HST Transformer on Drone-Anomaly dataset in terms of AUC (%)

4.5 Comparison with state-of-the-art methods

4.5.1 Ground-based videos

In this section, we compare the proposed method with the state-of-the-art methods on three benchmark video datasets, shown in Table 9. Experimental results indicate that our model achieves the best AUC on UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets compared with other recent methods. The AUC on Ped2 dataset exceeds that of the second best method [54] by only 0.1% but our method improved significantly on Avenue and Shanghaitech datasets, proved by 87.8% and 75.3% over 1.0% and 1.3%, respectively. Cai et al. [16] obtained 87.4% on Avenue dataset, compared to 87.8% of our method on the same dataset. However, our method obtained 75.3% on ShanghaiTech dataset, whereas their method was only 74.2%. Finally, [20] obtained 74.5% on ShanghaiTech dataset. Nevertheless, their AUC on Ped2 and Avenue datasets was 96.0% and 86.3% in comparison with 97.3% and 87.8% of our results. These results suggest that our model outperforms the convolution-based U-net as well as the TransAnomaly case for the ground-based video datasets.

Table 9 Performance comparison of the proposed network with state-of-the-art methods in terms of AUC (%) on ground-based video datasets
Table 10 Performance comparison of the proposed network with baseline and Skip-Ganomaly methods in terms of AUC (%) on the Drone-Anomaly dataset
Fig. 7
figure 7

Visualization of prediction error with and without using HST transformer on Ped2, Avenue, and ShangheiTech datasets, respectively, from top to bottom. Since the prediction errors with the HST transformer are brighter than those without using it, the difference between the two can be seen clearly around abnormal objects, that are circled in red for convenience. Notice that w/o HST denotes “without HST”, and w/ HST indicates “with HST”. Each difference map is obtained by subtracting one from the other and is scaled from dark blue to yellow

4.5.2 Aerial videos

Our model is compared with ANDT [10], which is a recent model for aerial videos, using seven different scenes in the Drone-Anomaly dataset, shown in Table 10. Our model performs better than ANDT in five scenes except for two scenes. For instance, the AUC of our model outperforms 1.9%, 0.5%, 4.3%, 8.4% and 3.5% compared to ANDT on Highway, Bike roundabout, Vehicle roundabout, Railway inspection and Solar panel inspection scenes, respectively.

Figure 6 shows the ROC curves of seven scenes in the Drone-Anomaly dataset. Our model achieves the best performance on the Bike roundabout (green line) whereas the lowest result is on the Crossroads scene (red line). Given that performance varies from 0.535 to 0.825 widely across the scenes, it seems that each scene of the Drone-Anomaly dataset has unique spatial-temporal characteristics, suggesting that our model can extract better spatio-temporal features, and yet it may need fine-tuning for some particular aerial scenes.

Fig. 8
figure 8

A normal frame (left), an abnormal frame (middle), and the prediction error of the abnormal frame (right) are shown in each panel, respectively, for three benchmark video anomaly detection datasets. The prediction error is scaled from blue to yellow, up to red

4.6 Qualitative comparison with visualization of prediction error

Although we have seen that the HST transformer enhances the overall performance of the system quantitatively, it would be helpful to identify such enhancement via visualization. Figure 7 is a collection of three abnormal events from three different datasets, such as Ped2, Avenue, and ShangheiTech, from top to bottom. In the top row, the prediction errors are visualized with HST and without HST, respectively, and then the difference map contains two abnormal regions, i.e. vehicle and a cyclist, that are circled in red. In the middle row, throwing a bag of a man shows abnormal behavior, and the prediction error around him is distinctive. The difference between with and without HST can be seen clearly in the difference map as well. On the bottom row, a riding motorcyclist is identified as abnormal in the prediction error maps, and the difference between with and without HST is illustrated in the difference map.

Fig. 9
figure 9

The normal frames (left), the abnormal frames (middle), and the prediction errors of the abnormal frames (right) are shown, respectively, for seven scenes in the Drone-Anomaly dataset

Fig. 10
figure 10

Anomaly scores for three representative ground-based video datasets, i.e. UCSD Ped2, CHUK Avenue, and ShanghaiTech. In each panel, two anomaly scores are drawn: the blue line denotes utilizing the HST transformer, whereas the dotted line without utilizing it. Each panel consists of the abnormal region as pink and the normal area as white

Fig. 11
figure 11

Anomaly scores for three aerial scenes in the Drone-Anomaly dataset, i.e. highway, bike roundabout, and solar panel, respectively. In each panel, two anomaly scores are drawn: the blue line denotes utilizing the HST transformer, whereas the green line without utilizing it. Each panel consists of the abnormal region as pink and the normal area as white

Fig. 12
figure 12

Two examples where our network fails to detect an anomaly. (a) A man occupies just a few pixels within the image. (b) A skateboarder maintains a pose for a few seconds

4.7 Visualization

4.7.1 Prediction error

Figure 8 visualizes how our model responds to the ground-based video datasets. In each panel, the left shows the normal case, the middle depicts the anomaly, and the right represents the prediction error. The prediction error is scaled from blue to yellow, and up to red. On the top, two abnormal objects are a car and a cyclist with a group of pedestrians background. In the middle, a man is throwing a bag. At the bottom, a motorcyclist is riding among a few pedestrians.

For the aerial video dataset, Fig. 9 visualizes how our model detects anomalies in seven different scenes. In each panel, the left shows the normal case and the middle depicts the anomaly, and the right represents the prediction error. Because this is a challenging dataset as mentioned before, it would be helpful to describe the context of each scene in terms of its abnormality: In the highway case, a group of cattle intrudes on the highway; a man crosses the road illegally; a big bus intrudes on a bike roundabout; a man crossed the road illegally at the vehicle roundabout; an identified object is found on the railway; one of the solar panels is collapsed and now it is seen as a grey one; a cycle is found in the middle of farmland. See also our demo video at https://vt-le.github.io/HSTforU/.

4.7.2 Anomaly score

Figure 10 illustrates the anomaly scores of three example videos within ground-based datasets. In each panel, the blue line denotes the score of our model with the HST transformer, whereas the green line indicates the score without using it. For the normal region, the score of our model with the HST transformer is often lower than that of without using it. For the abnormal region painted, the major part of the blue curves is higher than the green ones, i.e. (c), indicating that our network with the HST transformer discriminates normal and abnormal events better than without.

The anomaly scores of three videos corresponding to three scenes of the Drone-Anomaly dataset are shown in Fig. 11. Similar observations can be made in these panels, particularly, in Fig. 11(a). With an abnormal event, the anomaly score of our model with the HST transformer is often higher than that of without using it.

4.8 Limitations

Although our network achieves SOTA performance, there are several challenging cases in which it fails to detect abnormal events. First, when the distance between the camera and the objects is very far as shown in Fig. 12(a), a skateboarder is shown in the top-right corner of the frame, taking up just a few pixels in size. Our network produces a very small prediction error for this object. Second, Fig. 12(b) shows another case where a skateboarder stands at a considerable distance from the camera and keeps the same pose across a sequence of consecutive frames. As a result, the network produces a small prediction error around him.

It seems that the primary reason for these limitations comes from the angle and distance between the camera and the objects. The immediate solution to such an issue is to increase the resolution of the video frames but it would be necessary to develop a model that can capture different viewpoints of the same scene or understand the sophisticated context within the scene.

5 Conclusion

Although the primary target of VAD has been restricted to ground-based datasets until recently, the aerial datasets have grown rapidly as drones are available widely and they find many interesting applications. In this study, we propose a transformer-based VAD method by which one can investigate both aerial and ground-based anomaly datasets in a unified framework. The U-net-based prediction network has been trained with the normal events under an unsupervised learning paradigm, and it is asked to discriminate the abnormal events against the normal ones during testing. The encoder of our U-net has a multi-stage transformer that generates the multi-scale feature maps by extracting visual features from the input video in a coarse-to-fine manner. The merit of this architecture is its flexibility since any hierarchical transformers, such as pyramid vision transformers and Swin, can be plugged and played depending on the designated application. To handle the aerial datasets, containing diverse dynamic backgrounds and moving objects, we propose a hierarchical spatio-temporal transformer that extracts both temporal and spatial features effectively since it is designed to compute spatial and temporal features simultaneously using a new joint attention mechanism. Evaluation results for three benchmark ground-based anomaly detection datasets were 97.3% on Ped2, 87.8% on Avenue, and 74.9% on ShanghaiTech, respectively, and that for the Drone-Anomaly dataset varied from 53.5% to 82.5% across the scenes, suggesting that our model outperforms the state-of-the-art methods and shows its ability to detect anomalies in both aerial and ground-based videos using a single model with high performance.