Keywords

1 Introduction

Nowadays, video surveillance has been widely used in many security-sensitive places, such as hospital, school, bank, museum, etc and the areas of video surveillance are often non-overlapping. Therefore, person re-identification (re-id) [1,2,3,4,5] across non-overlapping cameras was proposed and widely studied. However, person re-id has encountered many challenges, the most serious one of which is occlusion.

As we know, in crowed public places, there are always some pedestrians occluded by some objects, e.g. cars, trees, garbage cans. So occluded person re-id is needed for actual pedestrian applications. In occluded person re-id, one of the hardest problems is how to deal with occluded person images because pedestrian images detected by most of pedestrian detectors always include occlusions. While the useful information what we need is only the person body parts. These occlusion areas cause interference for person re-id and bring more redundant information in matching process, which leads to the dropping performance of re-id. As shown in Fig. 1, among the works related with occluded person re-id [6, 7], Zheng [7] used partial person images to match the full-body person images by the framework combining a local-to-local matching model with a global-to-local matching model and achieved effective results. However, in the practical situation, there are no suitable detectors to detect the occlusions for occlusions with diverse characteristics, such as colors, sizes, shapes, and positions. The partial person images are gained by manually cropping, which brings operational trouble and time waste (See Fig. 1). Differed from the idea of detecting occlusions, we consider detecting the useful information needed in occluded person images, that is, the person body parts.

Fig. 1.
figure 1

Illustration of partial person re-id solving occlusions. (a): occluded person images detected by existing pedestrian detector; (b): partial person images after manually crop ping; (c): full-body person images which partial person re-id searches for.

Referring to other works on detecting human body parts, most of works aim to detect or cluster specific local part regions, such as head, shoulder, arm, leg and foot of persons. For occluded images, since either the occlusion or non-occluded body parts are always irregular, it is difficult to get a definite boundaries between body and occlusions using methods with detecting bounding boxes, which results in poor cutting. Differed from other works, we used a deep learning based saliency detection framework to detect person body parts and obtain partial person regions. There are two main reasons using saliency detection framework: Firstly, the saliency detection methods are able to separate the foreground and the background of an image in the pixel level, which leads to higher definition. It can better find the boundaries between the regions of person body parts and the regions of backgrounds or noise areas. Second, the deep learning based saliency detection network has ability to learn semantic information, which helps the framework distinguish the semantics of the parts that we need to acquire, namely the person body parts regions (Fig. 2).

Fig. 2.
figure 2

Illustration of partial person re-id solving occlusions. (a): occluded person images detected by existing pedestrian detector; (b): partial person images after manually crop ping; (c): full-body person images which partial person re-id searches for.

Although saliency detection methods have been studied excellently, few works make initial attempts to solve occluded person re-id. Due to the poor quality of most person re-id benchmark datasets, there are higher requirements for the extracted features of saliency detection being proposed, that is, the information of images needs to be carefully preserved as far as possible. This paper proposes a double-line multi-scale fusion (DMF) network, which not only extracts richer features by double-line block and multi-scale fusion block but further improves the feature fusion process to make the high-level and low-level information better complemented. Besides, a fully-connected CRF [8] is used after DMF network for further improvement. After using the saliency detection network proposed in this paper to preprocess the occluded person images, we can obtain masks of occluded person images and crop the interesting regions in occluded person images into partial person images so as to achieve partial person re-id. In summary, this paper makes three main contributions.

  • It is a new attempt that saliency detection is used in occluded person re-id problem, which is used to detect the person body parts on the pixel-level and then crop occluded person images into partial person images.

  • To reach complementation of high-level and lower-level information, we propose a double-line multi-scale fusion (DMF) network consisting of the double-line feature extraction block and multi-scale feature fusion block. The former makes information diverse and the latter makes information from different scales fusing and complemented.

  • Experiments on five heterogeneous benchmark datasets show high performance of our approach compared with other state-of-art methods. Besides, our approach is used in two banchmask occluded person re-id datasets to deal with occluded person images.

2 Proposed Method

We propose a double-line multi-scale fusion (DMF) network for saliency detection, which is based on the VGG-16 [9] network, as shown in Fig. 3. The DMF network includes two main components: the double-line feature extraction block and multi-scale feature fusion block. To make extracted features more diverse, the double-line block calculates the difference of the feature after average pooling and then we connect it with the original feature in order to get richer information. The multi-scale fusion block aims to combine high-level information with lower-level information to reduce the loss of details on the ground floor. Finally, features form the based network VGG-16 and double-line multi-scale fusion are connected and put into softmax loss to obtain predicted masks. We also use a fully-connected CRF [8] as a post-processing step. This saliency detection can be used in the process of dealing with occluded person images in occluded person re-id.

Fig. 3.
figure 3

Illustration of partial person re-id solving occlusions. (a): occluded person images detected by existing pedestrian detector; (b): partial person images after manually crop ping; (c): full-body person images which partial person re-id searches for.

2.1 Double-Line Feature Extraction

For saliency detection, it is greatly important that extracted features have ability to distinguish the foreground and the background, so we need to extract richer features to do determination. In this paper, we adopt a structure similar to feature pyramid [10]. Our network first used VGG-16 as basic network to extract features, in which all the convolution layers are divided into five convolution blocks and output features from each convolution block. This way aims to retain information from each floor and increase the diversity of features. In addition, the more important ways to increase the feature diversity is the double-line feature extraction part. As shown in Fig. 3(a), this structure of this part consists of a convolutional layer and an average pooling layer, which has two outputs. One output \(X_i^{conv}\) is the feature \(X_i\) through the convolutional layer, and the other \(X_i^{conv}-X_i^{pool}\) is similar to pooling residual structure, which means the difference (D-value) between \(X_i\) and \(X_i\) after average pooling. Then the double-line output is connected in series as final output. In this process, the pooling residual structure can offer other abundant information because average pooling has the effect to smooth images and the difference between original images and smoothing images can embody some outstanding information, which we desire to grasp.

Let \(X_i (i={1,2,3,4,5})\) denote output features after each convolution block of VGG-16 network, and \(X_i^{conv}\) and \(X_i^{pool}\) are feature \(X_i\) through the convolutional layer and the average pooling layer, respectively. The result of double-line feature extraction block can be expressed as:

$$\begin{aligned} X_i^{out}= \{X_i^{conv}, X_i^{conv}-X_i^{pool}\} \end{aligned}$$
(1)

After double-line feature extraction, features are merged with the deconvolution upsampling of the next block as the final double-line output, and then enter the multi-scale feature fusion block. The concrete fusion operation will be explained in the next section.

2.2 Multi-scale Feature Fusion

Actually, if we only grasp high-level features, the output would lose plenty of important details in low-level feature floor, leading to poor saliency detection. Considering the above condition, we use a multi-scale feature fusion structure to fuse high-level and low-level features until all the feature floors are fused together. In order to make unified fusion size between upper and lower layer, we do the deconvolution upsampling for high-level floor and then add to low-level double-line features, respectively. After fusing, we use a convolution layer to smooth the fusion results. In this way, our final output features greatly incorporate all high-level semantic information and low-level semantic information. The concrete structure is shown in Fig. 3(b). In the paper [11], the high-level low-level features are directly concated to lower level features, which only retains multiple levels of information. While our network further lets these information be merged to achieve a complementary effect so that the final output features both focus on large objects and details or small objects, achieving stronger saliency detection.

The operation of feature fusion is mathematically that sum the corresponding pixel points of deconvoluted high-level feature maps and the last double-line features, which is formulated as

$$\begin{aligned} X_i^{fus}= conv(\{X_i^{conv}, X_i^{conv}-X_i^{pool}\}+deconv(X_{i-1}^{fus})) \end{aligned}$$
(2)

where \(X_i^{fus}\) is output features after one multi-scale feature fusion block corresponding to the \(i_{th}\) convolution block. Finally, features including information from five convolutional floors are connected with global futures form the based VGG-16 network, and then are used in the calculation of softmax loss and fully-connected CRF post-processing.

2.3 DMF in Occluded Person Re-id

DMF network can better combine high-level semantics information and low-level semantic information, and use VGG-16 network as basic network, which has ability to identify the different object categories. Therefore, DMF network can successfully achieve fine pedestrian salient detection and we use it to distinguish salient person body parts and backgrounds or noise information from occlusions. After DMF salient detection, occluded person images are cropped into partial person images according to predicted salient regions. The process is shown in Algorithm 1.

figure a

3 Experiment

3.1 Datasets

We evaluate the proposed method on five public benchmark datasets: MSRA-B [12], ECSSD [13], PASCAL-S [14], HKU-IS [15], DUT-OMRON [16]. Besides, we use our method on two occluded person re-id datasets Occluded REID dataset [17] and Partial REID dataset [7] to verify the effectiveness in occluded person re-id.

MSRA-B has been widely used for salient object detection, which contains 5000 images and corresponding pixel-wise ground truth.

ECSSD contains 1000 complex and natural images with complex structure acquired from the internet.

PASCAL-S contains 850 natural images with both pixel-wise saliency ground truth which are chosen from the validation set of the PASCAL VOC 2010 segmentation dataset.

HKU-IS is large-scale dataset containing 4447 images, which is split into 2500 training images, 500 validation images and the remaining test images.

DUT-OMRON includes 5168 challenging images, each of which has one or more salient objects.

Partial REID dataset is the first for partial person re-id [7], which includes 900 images of 60 people, with 5 full-body images, 5 partial images and 5 occluded images per person.

Occluded REID dataset consists of 2000 images of 200 persons. Each one has 5 full-body images and 5 occluded images with different types of severe occlusions. All of images with different viewpoints and backgrounds.

3.2 Experiment Setting

Methods for Comparison. To evaluate the superiority of our method, we compare our method with several recent state-of-the-art methods: Geodesic Saliency (GS) [18], Manifold Ranking (MR) [16], optimized WeightedContrast (wCtr*) [19], Background based Single-layer Cellular Automata (BSCA) [20], Local Estimation and Global Search (LEGS) [21], Multi-Context (MC) [22], Multiscale Deep Features (MDF) [15] and Deep Contrast Learning (DCL) [23]. Among these methods, LEGS, MC, MDF and DCL are the recent saliency detection methods based on deep learning.

Evaluation Metrics. Max F-measure (\(F_\beta \)) and mean absolute error (MAE) score are used to evaluate the performance. Max \(F_\beta \) is computed from the PR curve, which is defined as

$$\begin{aligned} F_\beta = \frac{(1+\beta ^{2})\times Precision\times Recall}{\beta ^{2}\times Precision + Recall} . \end{aligned}$$
(3)

MAE score means the average pixel-wise absolute difference between predicted mask P and its corresponding ground truth L, which is computed as

$$\begin{aligned} MAE = \frac{1}{W\times H}\sum _{x=1}^{W}\sum _{y=1}^{H}|\hat{P}(x,y)-\hat{L}(x,y)| . \end{aligned}$$
(4)

where P̂ and L̂ are the continuous saliency map and the ground truth that are normalized to [0, 1]. W and H is the width and height of the input image.

Parameter Setting. Our method is easily implemented in Pytorch, which is initialized with the pretrained weights of VGG-16 [9]. We randomly take 2500 images of MSRA-B dataset as training data and select 2000 images of the remaining images as testing data. The other datasets are all regarded as testing data. All the input images are resized to \(352 \times 352\) for training and test. The experiments are conducted with the initial learning rate of \(10^{-6}\), batch size = 40 and parameter of evaluation metrics \(\beta ^{2}\) is 0.3. Parameters of fully-connected CRF follow as [8].

Table 1. Comparison on five public benchmark datasets with state-of-the-art

Compared with State-of-the-Art. We compare our approach with several recent state-of-the-art methods in terms of max \(F_\beta \) and MAE score on five benchmark datasets, which is shown in Table 1. We collect eight methods including (A) non-deep learning ones and (B) deep learning ones. It can be seen that our method presents the best performance in the whole and largely outperforms non-deep learning methods because deep neural network has ability to learn and update the model automatically. Besides, our method surpasses the \(2^{nd}\) best method on ECSSD, PASCAL-S, HKU-IS and DUT-OMRON in almost max \(F_\beta \) and MAE score, which indicates our model can be directly applied in practical application due to good generalization.

Used on Occluded Person Re-identification. We do the visual comparison among our approach and the compared methods in Table 1 on two occluded person re-id datasets, Occluded REID dataset and Partial REID dataset, and process occluded person images into partial person images according to Algorithm 1, Sect. 2.3. Experimental result is shown in Fig. 4, which can be easily seen that our proposed method can not only highlight the most relevant regions, person body parts but also find the exact boundary to obtain better partial person images. Therefore, our proposed method is able to apply to pedestrian salient detection in occluded person re-id.

3.3 Experiment Results

3.4 Time Costing

We measure the speed of deep learning salient detection methods by computing the average time of obtaining a saliency map of one image. Table 2 shows the comparison between five deep leaning based methods: LEGS [21], MC [22], MDF [15], DCL [23] and our method, using a Titan GPU. Our method takes the least time to achieve salient detection and is 4 to 50 times faster than other methods, which illustrates the superiority of our method in terms of computing speed.

Table 2. Time costing of obtaining a salient map
Fig. 4.
figure 4

Visual comparison with eight existing methods and examples of cropping mask to obtain partial person images. As can be seen, our proposal produces more accuracy and coherent salient maps than all other methods.

4 Conclusion

In this paper, we make the first attempt to deal with occluded person images in occluded person re-id by pedestrian salient detection. To fine detect person body parts, the double-line multi-scale fusion (DMF) network is proposed to get more semantic information by double-line feature extraction and multi-scale fusion by fusing high-level and low-level information from high floor to low floor. We finally used a full-connected CRF as post-processing step after DMF network. Experimental results on benchmarks about salient detection and on occluded person re-id datasets both show the effectiveness and superiority of our method.

This project is supported by the Natural Science Foundation of China (61573387) and Guangdong Project (2017B030306018).