Keywords

1 Introduction

Group re-identification (Re-ID) is a significant but relatively less studied task in video surveillance systems [1, 2]. Given a group sequence in one camera, the task of group Re-ID is to find matching sequences of the same group with the same people under different cameras. At present, the methods of group Re-ID are mainly image-based. The video-based group Re-ID task is rarely studied and remains to be promising and unsolved.

Group Re-ID has many connections with person Re-ID, making group Re-ID not a completely independent task. The group contains a plurality of pedestrian individuals whose movement direction and speed are basically the same, meaning that the group Re-ID solves the task of the Re-ID from a holistic perspective. Group Re-ID also includes challenges such as illumination, variations across different viewpoints, scale changes and occlusion that exist under person Re-ID. Recent works about person Re-ID widely utilize the convolutional neural networks (CNN) to extract the spatial information of the pedestrians [3,4,5,6] and adopt the recurrent neural networks (RNN or LSTM) to extract reliable temporal features for person Re-ID [7, 8].

In addition to the commonality, there are many differences between the group Re-ID and the person Re-ID. Group Re-ID has more challenges than person Re-ID, including self-occlusion in groups, misalignment between group videos, changes in relative positions between pedestrians in groups, and interference caused by other pedestrians which make group Re-ID remain to be a challenging issue.

Since there are little researches for the video-based group Re-ID task, the traditional video-based person Re-ID methods are potentially helpful for video-based group Re-ID task. The works of [9, 10] utilize the deep recurrent network (RNN) to extract the temporal information for the person sequences, achieving considerable improvements of performance. The work of [5] utilizes attention model to pay emphasis on the more important area and frames to make the learned features by the RNN more effective. These network architectures independently utilize the CNN or RNN layer, making the output features lack comprehensiveness. Since RNN cannot completely integrate all long-term information of all the sequence frames, the spatial and temporal features should be effectively fused for performance improvement. The work of [11] proposes a residual learning framework to make the deep layers indirectly better fit a desired optimal mapping by learning the residual, the residual learning can be helpful in the adaptive fusion of the spatial and temporal features for group Re-ID task.

In this paper, we propose a spatial-temporal fusion network with the residual learning mechanism which composes of the element-wise addition between the CNN and the recurrent layer in a unified network. The spatial features learned by the CNN are sent to the recurrent neural networks (RNN [12] or LSTM) for the temporal features learning. The element-wise addition is done between the extracted temporal and spatial features to obtain more discriminative fusion features which can improve the video-based group Re-ID.

Fig. 1.
figure 1

Samples of DukeGroupVid datasets. The dataset contains different kinds of challenges, such as illumination, occlusion and interference from other pedestrians.

Furthermore, we incorporate the spatial-temporal attention mechanism in the model to extract more discriminative group features. For spatial features, we use both position attention and channel attention for feature representation. For the temporal features, we utilize the temporal attention to integrate sequence information. In this paper, after the spatial-temporal attention is integrated into the model, the results have been greatly improved.

To demonstrate the effectiveness of our proposed method, we deploy our spatial-temporal fusion network on multiple datasets. We create a new dataset called DukeGroupVid based on DUKEMTMC [13] dataset. To the best of our knowledge, the proposed DukeGroupVid is the first video-based group Re-ID dataset that can be used for deep learning training, containing varied challenges such as misalignment and self-occlusion within the group. DukeGroupVid consists of 371 different groups and 890 tracklets with 8 cameras in the dataset. Tracklets contain 12 to 6444 frames, while most of IDs are captured by 2 to 4 cameras. This dataset remains challenging and necessary to evaluate the performance of the video-based group Re-ID methods. Samples of DukeGroupVid dataset with multiple challenges is shown in Fig. 1.

In conclusion, the main contribution of this paper is of three points:

  • We propose a spatial-temporal fusion network for the video-based group Re-ID. Instead of only utilizing CNN or recurrent layer to extract features, We introduce the residual learning played between the RNN and CNN in a unified network for performance improvement.

  • We propose multiple attention mechanism in the residual learning model to make the network emphasize more on important parts and frames, making our network learning characteristics more effective.

  • A new dataset called DukeGroupVid based on DUKEMTMC [13] dataset is created. Comprehensive experimental results are conducted on the proposed dataset and other wildly utilized datasets, demonstrating the effectiveness of our model.

2 Proposed Method

The key point of our spatial-temporal fusion network is to learn and fuse both the spatial and temporal features of the certain group. We learn the spatial and temporal information respectively and incorporate the CNN and RNN/LSTM in a unified network. The residual learning is carried out by element-wise addition between the spatial and temporal features to obtain more discriminative fusion features that is able to improve the video-based group Re-ID. We have considered a variety of attention mechanisms and fuse spatial-temporal attention mechanism to the network which enables the network to extract more resolving sequence features under a multi-attention mechanism. In conclusion, we deploy our residual learning with spatial-temporal attention mechanism and propose improved spatial-temporal fusion network with performance improvements on the video-based group Re-ID task.

2.1 Model Structure

For video-based group Re-ID task, the spatial features extracted by the convolutional neural network (CNN) typically consider spatial characteristics and cannot integrate time domain information. At the same time, although the recurrent neural network (RNN) can accept a wide range of sequence inputs, it cannot fully integrate all the sequence information of the frame. [9, 10] adopted the temporal pooling after recurrent layers to do an average within the output of the whole person sequences, which can not commendably solve this problem.

In this paper, we propose a spatial-temporal fusion network with the residual learning mechanism which composes of the element-wise addition between the CNN and the recurrent layer in a unified network. At the same time, we propose multiple attention mechanism in the residual learning model to make the network emphasize more on important parts and frames, making our network learning characteristics more effective. Our spatial-temporal fusion network is shown in Fig. 2.

Fig. 2.
figure 2

The illustration of our proposed spatial-temporal fusion network based on residual learning and spatial-temporal attention. The spatial-temporal fusion network consists of four main parts: Base CNN Model, spatial attention mechanism, temporal attention mechanism and residual learning of spatial-temporal features.

The spatial-temporal fusion network consists of four main parts. As shown in Fig. 2, the first part is the Base CNN Model. The network architectures with residual learning mechanism and spatial-temporal attention is based on the original CNN model such as ResNet [11] and Densenet [14]. The second part is spatial attention mechanism. Considering the challenges like self-occlusion and interference from others outside the group, features extracted from the group sequence should concentrate more on the important parts and pedestrians of the frames. The model should be robust, reducing interference caused by backgrounds or other pedestrians. The third part is temporal attention mechanism. Video sequences containing multiple frames provides more information and also need to be better selected for effective information with resolution. There are often challenges with image misalignment, illumination changes inside the sequences. The time domain attention mechanism can help find more valuable frames and assign higher weights to these frames, thus enabling efficient use of these discriminative information. The last part is residual learning. Since the RNN model cannot fully integrate all the sequence information of the frame., and the spatial pooling can only bring very limit improvement, we propose a residual learning method and propose a new model that can extract the spatial-temporal fusion features which addresses the limitations of models such as RNN and time domain pooling. We deploy our adaptive spatial-temporal feature extraction mechanism to make the temporal features extracted by RNN in tune with the spatial features extracted by the original CNN, and apply attention mechanism to make the network more discriminative.

Fig. 3.
figure 3

An illustration of Residual Learning between RNN and CNN. Residual learning mechanism composes of the element-wise addition between the CNN and the recurrent layer in a unified network.

2.2 Residual Learning Mechanism

The performance of temporal features extracted by the recurrent layer is inferior, even lower than the image-based model [15]. To solve this problem, in this paper, we propose a spatial-temporal fusion network that combines residual learning mechanism with spatial-temporal features to enhance video-based Re-ID. The model structure is illustrated in Fig. 3.

Residual learning [11] is to solve the problem of degradation. Optimizing the residual map is easier than optimizing the original unreferenced map. Therefore, the residual learning represented by the recursive layer has the ability to help the two-dimensional CNN to be further optimized, and can make the learned features more discriminating.

Specifically, given an input sequence \({s = (s^{0},s^{1},...,s^{T-1})}\), T is the sequence length, \(s^{t}\) denotes the person image at time t. The feature vectors obtained after the CNN are abbreviated as

$$\begin{aligned} f^{(t)} = C(s^{(t)}) \end{aligned}$$
(1)

where the C function refers to the simplification of the feature extraction procedure by the CNN layers (such as ResNet [11] and DenseNet [14]). After synthesizing the features extracted from CNN, there is an RNN layers in the network to receive the output feature vectors from the CNN layers for accumulating features from anterior images within the video sequence. The input of the recurrent layer is the feature vectors \(f^{(t)}\) obtained after the CNN. The recurrent layer learns long-term dependencies and remembers information for long periods of time within person sequence, which can be denoted in the following formula:

$$\begin{aligned} o^{(t)} = W_{i}f^{(t)}+W_{s}r^{(t-1)} \end{aligned}$$
(2)
$$\begin{aligned} r^{(t)} = Tanh(o^{(t)}) \end{aligned}$$
(3)

where \(r^{(t)}\) remembers information at the previous time-step and \(o^{(t)}\) produces an output based on both the current input and information from the previous time-steps. The \(o^{(t)}\) and the \(f^{(t)}\) both go through a linear combination to produce a feature vector with a dimension N. Then \(o^{(t)}\) is connected to a temporal pooling layer, to produce a single feature vector which accumulates the appearance information of sequences to gain the periodic characteristics of certain person image sequence, which can be denoted as

$$\begin{aligned} x_{R} =\frac{1}{T}\sum _{t=1}^{T}o^{(t)} \end{aligned}$$
(4)

where \(x_{R}\) is the temporal feature we need. In our case, to make the element-wise addition in the residual learning adaptive and concordant, the feature vectors \(f^{(t)}\) from the CNN is also connected to a temporal pooling layer to produce a single feature vector representing the information averaged over the whole input sequence of the certain person, i.e. averaged spatial features, denoted by following formula:

$$\begin{aligned} x_{C} =\frac{1}{T}\sum _{t=1}^{T}f^{(t)} \end{aligned}$$
(5)

where the \(x_{C}\) refers to the spatial features extracted by the CNN layer in the improved network. Furthermore, compared to the prior work by [12], which utilizes the one-stream network to process the image sequence from different camera views, where the spatial features from CNN is sent to the RNN for corresponding temporal feature extraction. We further improve the network by two-stream separated CNN and RNN layer to respectively extract the spatial and temporal features for more adaptive spatial-temporal fusion, bridging the gap between different camera view variations. The fusion operation is formulated as:

$$\begin{aligned} x_{F} =x_{C} +x_{R} \end{aligned}$$
(6)

where the “+” denotes the element-wise addition in the cascade residual learning.

Fig. 4.
figure 4

The illustration of temporal attention and spatial attention mechanism used in our proposed spatial-temporal fusion network. Attention mechanism is used to make the feature extraction model focus more on resolving features.

2.3 Improved Spatial-temporal Attention Mechanism

In this section, we further optimize the network structure with the spatial-temporal attention mechanism which is adopted to replace the temporal pooling in the previous model. Since the pooling operation is to average the features of all frames in the sequence features, there is no higher focus on the more important frames. The operation of temporal pooling is relatively simple, but it is important to assign different weights to the characteristics of different frames.

Sequence features based on the temporal attention mechanism can be calculated as,

$$\begin{aligned} f_{ta}^{(t)} =\frac{1}{T}\sum _{t=1}^{T}attn_{t}^{(t)}f^{(t)} \end{aligned}$$
(7)

where \(attn_{t}^{(t)}\) is the attention weight of frame t of the sequence features. The temporal attention network takes the image-level sequence features [T,1024,w,h] (take DenseNet121 [14] for example) as inputs, and attention scores of length T. Temporal attention network is shown in Fig. 4. We apply a convolution layer with hidden dim of \(d_{t}\) and a Fully-Connected (FC) layer output dim of 1. The output of the convolution layer is the scalar vector score \(s^{(t)}\). The attention score for the frame t of sequence can be calculated as,

$$\begin{aligned} attn_{t}^{(t)} = \frac{e^{s^{(t)}}}{\sum _{t=1}^{T}e^{s^{(t)}}} \end{aligned}$$
(8)

Besides temporal attention, we apply the multi-spatial-attention mechanism to applied to extract more discriminative spatial features. For the channel attention part, we first perform a squeeze operation [16] with a global average pooling layer and get the shrinking feature \(f_c\). This operation turn each 2D feature map into a real number \(f_c\) which is calculated by

$$\begin{aligned} f_c = \frac{1}{h \times w}\sum \limits _{i = 1}^{w}\sum \limits _{j= 1}^{h}{\mathbf {X}}_{1:h,1:w,c} \end{aligned}$$
(9)

The dimensions of the squeeze operation output are the same as the number of input feature channels. The second operation is excitation and a bottleneck structure is used to get channel attention. Two FC layers are leveraged to form a bottleneck structure to model the correlation between channels. For the position attention, the first operation is cross-channel pooling before learning the position attention map. The position attention is modeled by a convolution layer of 1 \(\times \) 1 filter with stride 1. Spatial attention network is shown in Fig. 4.

3 Experiment and Results

3.1 Proposed Dataset

A new dataset called DukeGroupVid based on DUKEMTMC [13] dataset is created. This dataset, which is the first known video-based group Re-ID dataset that can be used for deep learning training, contains different kinds of challenges such as misalignment and self-occlusion within the group. As an extension of the DUKEMTMC dataset [13], it consists of 371 different groups and 890 tracklets. There are totally 8 cameras in the dataset. Tracklets contain 12 to 6444 frames, while most IDs are captured by 2 to 4 cameras. This dataset remains challenging and necessary to evaluate the performance of the proposed video-based group Re-ID methods.

3.2 Evaluation Metrics

In order to test the performance of our method, following the common person Re-ID researches, we use the cumulative matching characteristic (CMC) to measure the performance of our spatial-temporal fusion network. We show the Rank1, Rank5 and Rank20 results in Table 1 and compare them with other methods. We also adopt mean averaged precision (mAP) as the other metric. Averaged precision (AP) is computed for each target person image based on the precision-recall curve. Therefore, mAP is the average of AP across all target tracklets.

3.3 Results

We make comparisons of our spatial-temporal fusion network with some baseline networks. To better illustrate the effectiveness of the spatial and temporal features in the networks after our residual learning mechanism, we deploy the model to separably extract different features by CNN and RNN and attention model for comparing the performance. RNN is known to have the problem of gradient vanishing, which makes it have better integration for adjacent frames but cannot completely integrate the periodic information from all sequence frames, especially the earlier image frames. In this section, we further compare the performance between different kinds of CNN to compare their respective effectiveness in different deep networks.

Table 1. Comparisons of our spatial-temporal fusion network with some baseline networks.

Group Re-ID has some similarities with person Re-ID with more challenges than person Re-ID, such as misalignment of frames and more background information. Therefore, applying the method of person Re-ID to the group’s task directly does not yield satisfactory results [19].

The performance of combining CNN and RNN is better than that only from CNN or RNN. The residual learning framework is to address the degradation problem to make the deep layers indirectly better fit a desired optimal mapping. The results demonstrate the effectiveness of our spatial-temporal fusion network with residual learning and attention mechanism that adding the information extracted by the CNN to the RNN can make up for the lost information from the RNN layer.

4 Discussion

4.1 Analysis of Different Layers

Table 2 shows the performance of different attention model with our proposed residual learning network. The spatial-temporal attention mechanism provides additional benefits for further extraction of discriminant features. We conduct a comparative experiment of spatial attention and temporal attention. It can be seen from the experimental results that both spatial attention and temporal attention can improve the experimental results. These results on the dataset demonstrate the validity of our spatial-temporal attention mechanism for spatial-temporal fusion network of video-based group Re-ID task.

Table 2. Component analysis of different kinds of attention mechanism on residual learning network. The base CNN model is ResNet50 and our spatial-temporal fusion network performs the best.

4.2 Experiments on Video-Based Person Re-ID Datasets

In addition to verification our spatial-temporal fusion network on the group Re-ID dataset, we validate our method on the video-based person Re-ID datasets. Our network can not only apply to group Re-ID task, and can also perform well in person Re-ID task. The results comparison with the state-of-the art methods can prove the effectiveness of our method on video-based person Re-ID task.

Video-Based Person Re-ID Dataset. In this paper, we adopt three typical datasets widely used on the problem of video-based person Re-ID to evaluate the performance of our method on person Re-ID task.

PRID-2011: The PRID 2011 [17] dataset contains two camera views, but only 200 people from the two camera views are adjacent. There is 400 image sequence pair on the dataset. Each image sequence has a variable length from 5 to 675.

i-LIDS-VID: The iLIDS-VID dataset [18] is captured on two camera views. Each camera view contains coincident 300 people and each image sequence has a variable length from 23 to 192 Due to clothing similarities among people, lighting and viewpoint variations across camera views, cluttered background and occlusions, this dataset is much more challenging than the PRID-2011 dataset.

MARS: The MARS dataset [6] consists of 1261 different identities and 20715 tracklets. A large number of tracklets contain 25 to 50 frames, while most IDs are captured by 2 to 4 cameras and have 5 to 20 tracklets.

Table 3. Comparisons of our Spatial-temporal Fusion Network and the state-of-the-art methods on Video-based Person Re-ID Datasets. The top 1 and 2 results are in bold and italic.

Experiment Results. We adopt DenseNet121 [14] as our CNN-based architecture to extract high-level features and compare the results between multiple structures as shown in Table 3.

Comprehensive experimental results conducted on the three representative datasets, PRID-2011, i-LIDS-VID and MARS, show that the improved network after our approach performs favorably than the existing state-of-the-art methods, demonstrating the effectiveness of the proposed network.

5 Conclusion

In this paper, we propose a spatial-temporal fusion network for adaptive spatial-temporal feature extraction, incorporating the convolution neural network (CNN) and the recurrent neural network (RNN/LSTM) to jointly exploit the spatial and temporal information for improving the video-based group Re-ID. We propose to make the feature extraction model of the recurrent layers in tune with the CNN layers for constructing an adaptive element-wise addition in the residual learning mechanism. We incorporate the spatial-temporal attention mechanism in the model to extract discriminative features better and explore the proposed network on different deep state-of-the-art networks to obtain improved spatial-temporal fusion networks for better performance improvements.

To demonstrate the effectiveness of our proposed method, we propose a new dataset DukeGroupVid to validate our network. Further comparisons are also made to show the respective significance of CNN and recurrent layer in different kinds of networks. Besides, comprehensive experimental results conducted on the three representative datasets, PRID-2011, i-LIDS-VID and MARS, show that the improved network after our approach performs favorably than the existing state-of-the-art methods. The results demonstrate the effectiveness of the proposed spatial-temporal fusion network.