Embedding Deep Metric for Person Re-identification: A Study Against Large Variations

Shi, Hailin; Yang, Yang; Zhu, Xiangyu; Liao, Shengcai; Lei, Zhen; Zheng, Weishi; Li, Stan Z.

doi:10.1007/978-3-319-46448-0_44

Hailin Shi^17,18,
Yang Yang^17,18,
Xiangyu Zhu^17,18,
Shengcai Liao^17,18,
Zhen Lei^17,18,
Weishi Zheng¹⁹ &
…
Stan Z. Li^17,18

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9905))

Included in the following conference series:

European Conference on Computer Vision

32k Accesses
126 Citations

Abstract

Person re-identification is challenging due to the large variations of pose, illumination, occlusion and camera view. Owing to these variations, the pedestrian data is distributed as highly-curved manifolds in the feature space, despite the current convolutional neural networks (CNN)’s capability of feature extraction. However, the distribution is unknown, so it is difficult to use the geodesic distance when comparing two samples. In practice, the current deep embedding methods use the Euclidean distance for the training and test. On the other hand, the manifold learning methods suggest to use the Euclidean distance in the local range, combining with the graphical relationship between samples, for approximating the geodesic distance. From this point of view, selecting suitable positive (i.e. intra-class) training samples within a local range is critical for training the CNN embedding, especially when the data has large intra-class variations. In this paper, we propose a novel moderate positive sample mining method to train robust CNN for person re-identification, dealing with the problem of large variation. In addition, we improve the learning by a metric weight constraint, so that the learned metric has a better generalization ability. Experiments show that these two strategies are effective in learning robust deep metrics for person re-identification, and accordingly our deep model significantly outperforms the state-of-the-art methods on several benchmarks of person re-identification. Therefore, the study presented in this paper may be useful in inspiring new designs of deep models for person re-identification.

You have full access to this open access chapter, Download conference paper PDF

HOB-net: high-order block network via deep metric learning for person re-identification

Article 29 July 2021

Long-Tailed Contrastive Loss for Video-Based Person Re-identification

Deep feature embedding learning for person re-identification based on lifted structured loss

Article 24 July 2018

Keywords

1 Introduction

Given a set of pedestrian images, person re-identification aims to identify the probe image that generally captured by different cameras. Nowadays, person re-identification becomes increasingly important for surveillance and security system, e.g. replacing manual video screening and other heavy loads. Person re-identification is a challenging task due to large variations of body pose, lighting, view angles, scenarios across time and cameras.

The framework of existing methods usually consists of two parts: (1) extracting discriminative features from pedestrian images; (2) computing the distance of samples by feature comparison. There are many works focus on these two aspects. The traditional methods work at improving suitable hand-crafted features [34, 41], or good metric for comparison [13, 15, 18, 23, 26, 36, 38, 39], or both of them [12, 19, 33, 37]. The first aspect considers to find features that are robust to challenging factors (lighting, pose etc.) while preserving the identity information. The second aspect comes to the metric learning problem which generally minimizes the intra-class distance while maximizing the inter-class distance.

More recently, the deep learning methods gradually gain the popularity in person re-identification. The re-identification methods by deep learning [1, 6, 17, 35] incorporate the two above-mentioned aspects (feature extraction and metric learning) of person re-identification into an integrated framework. The feature extraction and the metric learning are fulfilled respectively by two components in a deep neural network: the CNN part which extracts features from images, and the following metric learning part which compares the features with the metric. The FPNN [17] algorithm introduced a patch matching layer for the CNN part for the first time. Ahmed et al. [1] proposed an improved deep learning architecture (IDLA) with cross-input neighborhood differences and patch summary features. These two methods are both dedicated to improve the CNN architecture. Their purpose is to evaluate the pair similarity early in the CNN stage, so that it could make use of spatial correspondence of feature maps. As for the metric learning part, DML [35] adopted the cosine similarity and Binomial deviance. DeepFeature [6] adopted the Euclidean distance and triplet loss. Some others [1, 17] used the logistic loss to directly form a binary classification problem of whether the input image pair belongs to the same identity.

The following are our contributions.

For training the CNN, the hard negative mining strategy has been used in [1, 27, 30]. Considering the large intra-class variations in pedestrian data, we argue that, in person re-identification, the positive training pairs should also be sampled carefully since the pedestrian data is distributed as the manifold that are highly curved in the feature space. As argued in some manifold learning methods [4, 29, 31], it is effective to use the local Euclidean distance, combining with the graphical relationship between samples, to approximate the geodesic distance. Thus, selecting the moderate positive pairs in the local range is critical for training the network. This is an important issue but has been seldom noticed. In this paper, we propose a new training strategy, named moderate positive mining ^{Footnote 1}, to adaptively search the moderate positives for training. This novel training method significantly improves the identification accuracy.
In addition, we improve the network by the weight constraint for the metric layers. The weight constraint regularizes the metric learning part and alleviates the over-fitting problem.

2 Related Work

Positive Sample Mining. The hard negative mining strategy [30] has been used for face recognition. In person re-identification, IDLA [1] also adopted hard negative mining for the training. By forcing the model to focus on the hard negatives near the decision boundary, hard negative mining improves the training efficiency and the model performance. In this paper, we find that how to select moderate positive samples is also an essential issue for learning person re-identification model. The moderate positives are as critical as hard negatives for training the network, especially when the data has large intra-class variations. However, there are barely any previous attempt in this aspect for learning the deep embedding. In our approach, we propose the novel strategy of moderate positive mining to address the problem. We sample the moderate positives for training, and avoid using the hard ones from extreme intra-class variations of pedestrian data. We empirically find that this strategy effectively improves the identification accuracy (see Sect. 4.2).

Weight Constraint for Metric Learning. A commonly used metric by deep learning methods is the Euclidean distance [6, 27, 30]. However, the Euclidean distance is sensitive to the scale, and is blind to the correlation across dimensions. In practice, we cannot guarantee the CNN-learned features have similar scales and the de-correlation across dimensions. Therefore, using the Mahalanobis distance is a better choice for multivariate metric [22]. In the area of face recognition, DDML [11] implemented the Mahalanobis metric in their network, but without any constraint. Our metric is learned in a similar way and improved by the proposed weight constraint which helps to gain a better generalization ability.

3 Proposed Method

In this section, we firstly introduce the moderate positive mining method. Then, we revisit DDML and introduce the weight constraint.

3.1 Moderate Positive Mining

Large Intra-class Variations. There are many factors lead to the large intra-class variations in pedestrian data, such as illumination, background, misalignment, occlusion, co-occurrence of people, appearance changing, etc. Many of them are specific with pedestrian data. Figure 1(a) shows some hard positive cases in the data set of CUHK03 [17]. Some of them are even difficult for human to recognize.

Although CNN has a strong ability to extract features, pedestrian data follows the very irregular distribution in the feature space due to the large variations, such as the example of highly-curved manifold illustrated in Fig. 1(b). This is reflected by the fact the state-of-the-art performances on several person re-identification benchmarks are relatively poor comparing with the human face recognition task which is easier due to less intra-class variations.

Moderate Positive Mining Method. Considering the distribution in Fig. 1(b) is unknown, it is difficult to apply the geodesic distance for comparing two samples. The usual way is to use the Mahalanobis distance (or the special case Euclidean) [6, 11, 30] which is a suitable metric in the ideal condition (Fig. 1(c)).

On the other hand, the manifold learning methods [4, 29, 31] suggest to use the Euclidean distance (or heat kernel) in the local range, combining with the graphical relationship between samples, for approximating the geodesic distance. This is a feasible way to minimize the intra-class variance along the manifold for the supervised learning. However, when training the deep CNN with contrastive or triplet loss for embedding, the existing deep embedding methods use the Euclidean distance undiscriminatingly with all the positive samples.

Here, we argue that selecting positive samples in the local range (pairing by the yellow line in Fig. 1(b)) is critical for training the network; training with the positive samples of large distance (the yellow line with cross) may distort the manifold and harm the manifold learning.

The basic idea is that we reduce the intra-class variance while preserving the intrinsic graphical structure of pedestrian data via mining the moderate positive pairs in the local range.

We introduce the moderate positive mining method as follows: we select the moderate positive pairs in the range of the same subject at one time. For example, suppose a subject having 6 images, of which 3 from a camera and 3 from another. We can totally match 9 positive pairs from this subject. If we use the easiest positive pair of the nine, the convergence will be very slow; if we use the hardest, the learning will be damaged. Thus, we pick the moderate positive pairs that are between the two extreme cases.

Given two sets of pedestrian images $\mathcal {I}_\mathbf {1}$ and $\mathcal {I}_\mathbf {2}$ come from two disjoint cameras. Denote $\mathbf {I}_1 \in \mathcal {I}_\mathbf {1}$ and $\mathbf {I}_2^p \in \mathcal {I}_\mathbf {2}$ as a positive pair (from the same identity), and $\mathbf {I}_1 \in \mathcal {I}_\mathbf {1}$ and $\mathbf {I}_2^n \in \mathcal {I}_\mathbf {2}$ as a negative pair (from different identities). Denote $\mathbf {\Psi } (\cdot )$ as the CNN, $ d(\cdot , \cdot ) $ is the Mahalanobis or Euclidean distance. The mining method is described as follows:

Firstly, we randomly select an anchor sample and its positive samples and negative samples (with equal number) to form a mini-batch; then, we mine the hardest negative sample, and choose the positive samples that have smaller distances than the hardest negative; finally, we mine the hardest one among these chosen positives as our moderate positive sample. The reason to do so is that we define the “moderate positive” adaptively within each subject while their hard negatives are also involved in case the positives are too easy or too hard to be mined.

An example is given in Fig. 2. In the experiments, this dynamic mining strategy improves the performance significantly, and shows good stability since all the positives are considered in each subject and the data is augmented by random translation.

3.2 Weight Constraint for Deep Metric Learning

Once the CNN extract the features from a pair of images, the metric layers are performed subsequently to calculate the distance, as shown in Fig. 3. The metric learning layer is like the structure proposed in DDML [11], and its learning is improved via a weight constraint.

Recalling the two sets of pedestrian images $\mathcal {I}_\mathbf {1}$ and $\mathcal {I}_\mathbf {2}$ mentioned above, denote $\mathcal {X}_\mathbf {1}$ and $\mathcal {X}_\mathbf {2}$ are the corresponding feature sets extracted by the CNN. $\mathbf {x}_1 = \mathbf {\Psi }(\mathbf {I}_1)$, $\mathbf {x}_2^p = \mathbf {\Psi }(\mathbf {I}_2^p)$ and $\mathbf {x}_2^n = \mathbf {\Psi }(\mathbf {I}_2^n)$ are the corresponding features of the anchor, positive and negative samples.

Revisiting DDML. The Mahalanobis distance is formulated as

$$\begin{aligned} d(\mathbf {x}_1, \mathbf {x}_2) = \sqrt{ (\mathbf {x}_1 - \mathbf {x}_2)^T \mathbf M (\mathbf {x}_1 - \mathbf {x}_2) }, \end{aligned}$$

(1)

where $\mathbf {x}_2 \in \{\mathbf {x}_2^p,\mathbf {x}_2^n\}$, $\mathbf M $ is a symmetric positive semi-definite matrix. Learning $\mathbf M $ under the constraint of positive semi-definite is difficult. We make use of its decomposition $\mathbf M = \mathbf W {} \mathbf W ^T$. Learning $\mathbf W $ is much easier, and $\mathbf W {} \mathbf W ^T$ is always positive semi-definite. We develop the distance as follows

$$\begin{aligned} d(\mathbf {x}_1, \mathbf {x}_2)&= \sqrt{ (\mathbf {x}_1 - \mathbf {x}_2)^T \mathbf W {} \mathbf W ^T (\mathbf {x}_1 - \mathbf {x}_2) } \nonumber \\&= \sqrt{ (\mathbf W ^T(\mathbf {x}_1 - \mathbf {x}_2))^T(\mathbf W ^T(\mathbf {x}_1 - \mathbf {x}_2)) } \nonumber \\&= \Vert \mathbf W ^T(\mathbf {x}_1 - \mathbf {x}_2)\Vert _2. \end{aligned}$$

(2)

The inner product $\mathbf W ^T(\mathbf {x}_1 - \mathbf {x}_2)$ can be implemented by a linear fully-connected (FC) layer in which the weight matrix is defined by $\mathbf W ^T$. The output of the FC layer is calculated by

$$\begin{aligned} \mathbf {y} = f(\mathbf W ^T\mathbf {x} + \mathbf {b}), \end{aligned}$$

(3)

where $\mathbf {b}$ is the bias term. The identity function is used as the activation $f(\cdot )$ for the linear FC layer. As shown in Fig. 3, the feature vectors $\mathbf {x}_1$ and $\mathbf {x}_2$ are fed into the subtraction layer. Then, the difference is transformed by the linear FC layer with the weight matrix $\mathbf W ^T$. For the symmetry of the distance, we fix the bias term $\mathbf {b}$ to zero throughout the training and test. Finally, the L2 norm is computed as the output distance $d(\mathbf {x}_1, \mathbf {x}_2)$. This structure remains equivalent when switching the position of the subtraction layer and the FC layer.

Weight Constraint. The objective is to minimize the intra-class distance and maximize the inter-class distance. The training loss is defined as

$$\begin{aligned} L = d(\mathbf {\Psi }(\mathbf {I}_1), \mathbf {\Psi }(\mathbf {I}_2^p)) + [m - d(\mathbf {\Psi }(\mathbf {I}_1), \mathbf {\Psi }(\mathbf {I}_2^n))]_{+}, \end{aligned}$$

(4)

where $\mathbf {I}_1$, $\mathbf {I}_2^p$ and $\mathbf {I}_2^n$ are the input images corresponding to the features $\mathbf {x}_1$, $\mathbf {x}_2^p$ and $\mathbf {x}_2^n$, and m is the margin which is set to 2 in the implementation. In each time of the forward propagation, either the first term or the second term of Eq. 4 is computed. Then the loss is obtained by combining the two terms, and we compute the gradient.

Compared with the Mahalanobis distance, the Euclidean distance has less discriminability but better generalization ability, because it does not take account of the scales and the correlation across dimensions [22]. Here, we impose a constraint that keep the matrix $\mathbf M $ having large values at the diagonal and small entries elsewhere, so we can achieve a balance between the unconstrained Mahalanobis distance and the Euclidean distance. The constraint is formulated as the Frobenius norm of the difference between $\mathbf W {} \mathbf W ^T$ and identity matrix $\mathbf {I}$,

$$\begin{aligned} {L} = d(\mathbf {\Psi }(\mathbf {I}_1), \mathbf {\Psi }(\mathbf {I}_2^p)) + [m - d(\mathbf {\Psi }(\mathbf {I}_1), \mathbf {\Psi }(\mathbf {I}_2^n))]_{+} \nonumber \\ s.t. \quad \Vert \mathbf W {} \mathbf W ^T - \mathbf {I}\Vert ^2_F \le C , \end{aligned}$$

(5)

where $C $ is a constant. We further combine the constraint into the loss function as a regularization term:

$$\begin{aligned} \hat{L} = L + \frac{\lambda }{2} \Vert \mathbf W {} \mathbf W ^T - \mathbf {I}\Vert ^2_F, \end{aligned}$$

(6)

where $\lambda $ is the relative weight of regularization, $\hat{L}$ is the new loss function. For updating the weight matrix $\mathbf W $, the gradient w.r.t. $\mathbf W $ is computed by

$$\begin{aligned} \frac{\partial \hat{L}}{\partial \mathbf W } = \frac{\partial {L}}{\partial \mathbf W } + \lambda (\mathbf W {} \mathbf W ^T - \mathbf {I})\mathbf W . \end{aligned}$$

(7)

When $\lambda $ is small, the Mahalanobis distance takes into account the correlations across dimensions. However, it may overfit to the training set, since the metric matrix (i.e. $WW^T$) is learnt from the training set which is usually small in person re-identification. On the other hand, when $\lambda $ is large, the matrix $WW^T$ becomes close to the identity matrix. In the extreme case, $WW^T$ equals to the identity matrix, and the distance reduces to the Euclidean distance. In this situation, the Euclidean distance does not consider the correlation, but may generalize robustly to unseen test sets. So, we incorporate the advantage of the Mahalanobis and Euclidean distances and balance the matching accuracy and generalization performance via the constraint.

4 Experiments

Our method is implemented via remodifying the CUDA-Convnet [14] framework. We report the evaluation with the one-shot standard protocol on three common benchmarks of person re-identification, i.e. CUHK03 [17], CUHK01 [16] and VIPeR [9].

We begin with the description of CNN architecture we used for extracting features. Then we report the evaluation on the validation set of CUHK03 for analyzing the effects of the moderate positive mining (Sect. 4.2), the weight constraint (Sect. 4.3), and the CNN architecture (Sect. 4.4). Then, we compare our performance with the state-of-the-art methods on CUHK03 and CUHK01 (Sects. 4.5 and 4.6). Finally, we show the proposed method also performs well on the small data-set of VIPeR [9] and gains competitive results (Sect. 4.7).

4.1 CNN Architecture

The CNN is built by 3 branches with the details shown in Fig. 4. The input image is normalized to $128\times 64$ RGB. Then, it is split into three $64\times 64$ overlapping color patches, each of which is charged by a branch. Each branch is constituted of 3 convolutional layers and 2 pooling layers. No parameter sharing is performed between branches. Then, the 3 branches are concluded by a FC layer with the ReLU activation. Finally, the output feature vector $\mathbf {x}$ is computed by another FC layer with linear activation. For the computational stability, the features are normalized before sending to the metric learning layers. The CNN and the metric layers are learned jointly via backward propagation.

Our network has much lighter weights (0.84M parameters) compared with the previous best methods on CUHK03&01 (IDLA [1], 2.32M) and VIPeR (DeepFeature [6], 26M). The reason that we build the CNN architecture in branches is to learn specific features from the different human body parts of pedestrian image; meanwhile, the morphological information is preserved from each part of human body. DML [35] adopted a similar architecture but with tied weights between branches. In Sect. 4.4, the experiments show the advantage of our untied architecture.

4.2 Analysis of Moderate Positive Mining

CUHK03 contains 1,369 subjects, each of which has around 10 images. The default protocol randomly selects 1,169 subjects for training, 100 for validation, and 100 for test. We pre-train the CNN with a softmax classification on the training set as the baseline. The outputs of softmax correspond to the identities.

To demonstrate the advantage of moderate positive mining, we compare the performances on the validation set with and without the moderate positive mining. The cumulative matching characteristic (CMC) curves and the rank-1 identification rates are shown in Fig. 5(a). We can find that the collaboration of moderate positive mining and hard negative mining achieves the best result (red line). The absence of moderate positive mining leads to a significant derogation of performance (blue). This reflects that the manifold is badly learned if all the positive pairs are used undiscriminatingly.

If both of the two mining methods are not used (magenta), the network gives very low identification rate at low ranks, even worse than the baseline (black). This indicates that moderate positive mining and hard negative mining are both crucial for training.

The CMC curves of the 3 trained networks tend to be saturated after the rank exceeds 20, whereas the baseline network remains at a relatively low identification rate. This indicates that the training with the metric layers is the basic contributor of the improvement.

The training of network converges well as the loss value descending with respect to the iterations (shown in Fig. 5(b)). Some positives, which are mined by moderate positive mining during training, are shown in Fig. 5(c). These positives are with moderate extent of difficulty compared with those hard ones in Fig. 1(a).

4.3 Analysis of Weight Constraint

We inspect the metric matrices learned with different relative weights ($\lambda $) of the regularization. In Fig. 6(a), we show the spectrums of the matrix $\mathbf M $. We also show the corresponding rank-1 identification rates in Fig. 6(b).

When $\lambda = 10^2$, the singular values are almost constant at 1, which means the metric layers almost give the Euclidean distance. This leads to the low variance and high bias. As $\lambda $ increases, the matrix has varying singular values across dimensions. This implies that the learned metric suits the training data well, but is more likely to have over-fitting. Therefore, a moderate value of $\lambda $ gives a trade-off between the variance and bias, which is an appropriate choice for good performance (Fig. 6(b)).

4.4 Analysis of Untied Branches

We show the learned filters of untied branches in Fig. 7(a). The network has learned remarkable color representations, which is coherent with the results of IDLA [1]. Since we apply untied weights between branches, each branch learns different filters from their own part. As shown in Fig. 7(a) where each row demonstrates a filter set from one branch, we can find each branch has its own emphasis in color. For example, the middle branch inclines to violet and blue, whereas the bottom branch has learned filters of obviously lighter colors than the other two. The reason is that pedestrian images have regular appearance of human body. Each part has its own color distribution. Therefore, the branches learn the part-specific filters, the morphological information is taken into account for the features.

We compare the performances with and without tied weights between branches in Fig. 7(b). We augment the filter number in the tied-branches network so to make roughly equal parameter number with the untied-branch. The untied-branch network gains a better performance than that of tied branches. It reflects that, when the network has a certain complexity, the neural structure (i.e. tied vs untied) becomes very important. How to organize the network structure is a critical issue for good performance.

4.5 Performance on CUHK03

We adopt a random translation for the training data augmentation. The images are randomly cropped (0–5 pixels) in horizon and vertical, and stretched to recover the size. According to the validation results (Sect. 4.3), we set the parameter $\lambda = 10^{-2}$ in all the following experiments. The moderate positive mining and hard negative mining are employed.

CUHK03 has 2 versions, one has manually labeled images, and the other has detected images. We evaluate our method on the test set of both versions. We compare our performance with the traditional methods and deep learning methods. The traditional methods include LOMO-XQDA [19], KISSME [13], LDM [10], RANK [24], eSDC [40], SDALF [7], LMNN [32], ITML [5], Euclid [40]. The deep learning methods include FPNN [17] and IDLA [1]. IDLA and LOMO-XQDA gained the previously best performance on CUHK03. The CMC curves and the rank-1 identification rates are shown in Fig. 8. Our method achieves better performance than the previous state-of-the-art methods on not only the labeled version but also the detected version. This indicates that our method achieves good robustness to the misalignment of detection.

4.6 Performance on CUHK01

The CUHK01 data set contains 971 subjects, each of which has 4 images under 2 camera views. According to the protocol in [16], the data set is divided into a training set of 871 subjects and a test set of 100. We train the network on CUHK03, and further fine-tune it on CUHK01, as the same setting with the state-of-the-art method IDLA [1]. We compare our approach with the previously mentioned methods. The CMC curves and rank-1 identification rates are shown in Fig. 9(a). Our approach gains the best result (the red line) with 69 % rank-1 identification rate.

Besides, to inspect the limitation of the data set CUHK01, we involve the recently released Market1501 [42] into the training. As the training data increases, our network gives a better performance (the red dash line marked as “Ours *”) with 87 % rank-1 identification rate. We show certain failed cases in Fig. 10. In each block, we give the true gallery, probe and false positive image from left to right. We find that most failed cases come from the dark color images or the negative pairs with significant color correspondence. This phenomenon is in line with the fact [1] that the learned filters in network mainly focus on image colors (as shown in Fig. 7(a)). The re-identification problem becomes extremely difficult when the true positive pairs have inconsistent colors in view while the negative pairs have similar colors (due to the lighting, camera setting etc.).

4.7 Performance on VIPeR

The VIPeR [9] data set includes 632 subjects, each of which has 2 images from two different cameras. Although VIPeR is a small data set which is not suitable for training CNN, we are still interested in the performance on this challenging task. The data set is randomly split into two subsets, each has non-overlapping subjects of the same size. The two subsets are for either training or test. We fine-tune the network on the 316-person training set and test it on the test set. We also adopt a random translation for training data augmentation. The results are shown in Fig. 9(b). We compare our model with IDLA [1], DeepFeature [6], visual word (visWord) [37], saliency matching (SalMatch), patch matching (PatMatch) [39], ELF [8], PRSVM [3], LMNNR [2], eBiCov [21], local Fisher discriminant analysis (LF) [28], PRDC [43], aPRDC [20], PCCA [25], mid-level filters (mFilter) [41] and the fusion of mFilter and LADF [18]. Our approach achieves the identification rate of 40.91 % at rank 1, which is the best result on VIPeR compared with the existing deep learning methods. Note that the highest rank-1 identification rate (43.39 %) is obtained by a combination of two methods (mFilter+LADF) [18]. The identification rate by DeepFeature [6] is close to ours at rank 1, but much lower at higher ranks.

5 Conclusion

The large variations of pedestrian data is a challenging point for the person re-identification methods. Although CNN has a strong ability to extract features, pedestrian data follows the very irregular distribution in the feature space due to the large variations. In order to cope with the problem and train the robust deep embedding, the positive training samples should be selected deliberately. In this paper, we propose a novel moderate positive mining method to embed robust deep metric for person re-identification. We find that mining the moderate positive samples is crucial for training deep networks, especially when it comes to the difficult data with large intra-class variations (e.g. pedestrian). The moderate positive mining method dynamically select the suitable positive pairs for learning robust embedding adaptive to the data manifold. Moreover, we propose the weight constraint for gaining the good robustness to the over-fitting problem in person re-identification.

Due to these improvements, our method achieves state-of-the-art performances on CUHK03 and CUHK01, and competitive results on VIPeR. By mining the moderate positive samples for the training, we can reduce the intra-class variance while preserving the intrinsic graphical structure of pedestrian data; the metric weight constraint helps to improve the generalization ability of the network, especially when the most parameters are in the metric layers.

Notes

1.
The source codes is available at http://www.cbsr.ia.ac.cn/users/hailinshi.

References

Ahmed, E., Jones, M., Marks, T.K.: An improved deep learning architecture for person re-identification. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2015)
Google Scholar
Bak, S., Corvee, E., Bremond, F., Thonnat, M.: Multiple-shot human re-identification by mean Riemannian covariance grid. In: 2011 8th IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS), pp. 179–184. IEEE (2011)
Google Scholar
Bazzani, L., Cristani, M., Perina, A., Murino, V.: Multiple-shot person re-identification by chromatic and epitomic analyses. Pattern Recognit. Lett. 33(7), 898–903 (2012)
Article Google Scholar
Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computat. 15(6), 1373–1396 (2003)
Article MATH Google Scholar
Davis, J.V., Kulis, B., Jain, P., Sra, S., Dhillon, I.S.: Information-theoretic metric learning. In: Proceedings of the 24th International Conference on Machine Learning, pp. 209–216. ACM (2007)
Google Scholar
Ding, S., Lin, L., Wang, G., Chao, H.: Deep feature learning with relative distance comparison for person re-identification. Pattern Recognit. 48(10), 2993–3003 (2015)
Article Google Scholar
Farenzena, M., Bazzani, L., Perina, A., Murino, V., Cristani, M.: Person re-identification by symmetry-driven accumulation of local features. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2360–2367. IEEE (2010)
Google Scholar
Gheissari, N., Sebastian, T.B., Hartley, R.: Person reidentification using spatiotemporal appearance. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 1528–1535. IEEE (2006)
Google Scholar
Gray, D., Brennan, S., Tao, H.: Evaluating appearance models for recognition, reacquisition, and tracking. In: Proceedings of IEEE International Workshop on Performance Evaluation for Tracking and Surveillance (PETS), vol. 3. Citeseer (2007)
Google Scholar
Guillaumin, M., Verbeek, J., Schmid, C.: Is that you? Metric learning approaches for face identification. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 498–505. IEEE (2009)
Google Scholar
Hu, J., Lu, J., Tan, Y.P.: Discriminative deep metric learning for face verification in the wild. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1875–1882. IEEE (2014)
Google Scholar
Khamis, S., Kuo, C.-H., Singh, V.K., Shet, V.D., Davis, L.S.: Joint learning for attribute-consistent person re-identification. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8927, pp. 134–146. Springer, Heidelberg (2015). doi:10.1007/978-3-319-16199-0_10
Google Scholar
Koestinger, M., Hirzer, M., Wohlhart, P., Roth, P.M., Bischof, H.: Large scale metric learning from equivalence constraints. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2288–2295. IEEE (2012)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Li, W., Wang, X.: Locally aligned feature transforms across views. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3594–3601. IEEE (2013)
Google Scholar
Li, W., Zhao, R., Wang, X.: Human reidentification with transferred metric learning. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012, Part I. LNCS, vol. 7724, pp. 31–44. Springer, Heidelberg (2013)
Google Scholar
Li, W., Zhao, R., Xiao, T., Wang, X.: DeepReID: Deep filter pairing neural network for person re-identification. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 152–159. IEEE (2014)
Google Scholar
Li, Z., Chang, S., Liang, F., Huang, T.S., Cao, L., Smith, J.R.: Learning locally-adaptive decision functions for person verification. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3610–3617. IEEE (2013)
Google Scholar
Liao, S., Hu, Y., Zhu, X., Li, S.Z.: Person re-identification by local maximal occurrence representation and metric learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2197–2206 (2015)
Google Scholar
Liu, C., Gong, S., Loy, C.C., Lin, X.: Person re-identification: what features are important? In: Fusiello, A., Murino, V., Cucchiara, R. (eds.) ECCV 2012. LNCS, vol. 7583, pp. 391–401. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33863-2_39
Chapter Google Scholar
Ma, B., Su, Y., Jurie, F.: BiCov: a novel image representation for person re-identification and face verification. In: British Machive Vision Conference, 11 pages (2012)
Google Scholar
Manly, B.F.: Multivariate Statistical Methods: A Primer. CRC Press, Boca Raton (2004)
MATH Google Scholar
Martinel, N., Micheloni, C., Foresti, G.L.: Saliency weighted features for person re-identification. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8927, pp. 191–208. Springer, Heidelberg (2015). doi:10.1007/978-3-319-16199-0_14
Google Scholar
McFee, B., Lanckriet, G.R.: Metric learning to rank. In: Proceedings of the 27th International Conference on Machine Learning (ICML-2010), pp. 775–782 (2010)
Google Scholar
Mignon, A., Jurie, F.: Pcca: A new approach for distance learning from sparse pairwise constraints. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2666–2672. IEEE (2012)
Google Scholar
Paisitkriangkrai, S., Shen, C., van den Hengel, A.: Learning to rank in person re-identification with metric ensembles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1846–1855 (2015)
Google Scholar
Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: Proceedings of the British Machine Vision (2015)
Google Scholar
Pedagadi, S., Orwell, J., Velastin, S., Boghossian, B.: Local fisher discriminant analysis for pedestrian re-identification. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3318–3325. IEEE (2013)
Google Scholar
Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000)
Article Google Scholar
Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: A unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
Google Scholar
Tenenbaum, J.B., De Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. science 290(5500), 2319–2323 (2000)
Article Google Scholar
Weinberger, K.Q., Blitzer, J., Saul, L.K.: Distance metric learning for large margin nearest neighbor classification. In: Advances in Neural Information Processing Systems, pp. 1473–1480 (2005)
Google Scholar
Xiong, F., Gou, M., Camps, O., Sznaier, M.: Person re-identification using kernel-based metric learning methods. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part VII. LNCS, vol. 8695, pp. 1–16. Springer, Heidelberg (2014)
Google Scholar
Yang, Y., Yang, J., Yan, J., Liao, S., Yi, D., Li, S.Z.: Salient color names for person re-identification. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part I. LNCS, vol. 8689, pp. 536–551. Springer, Heidelberg (2014)
Google Scholar
Yi, D., Lei, Z., Li, S.Z.: Deep metric learning for practical person re-identification (2014). arXiv preprint arXiv:1407.4979
Zhang, Z., Saligrama, V.: Prism: person re-identification via structured matching (2014). arXiv preprint arXiv:1406.4444
Zhang, Z., Chen, Y., Saligrama, V.: A novel visual word co-occurrence model for person re-identification. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8927, pp. 122–133. Springer, Heidelberg (2015). doi:10.1007/978-3-319-16199-0_9
Google Scholar
Zhang, Z., Chen, Y., Saligrama, V.: Group membership prediction. In: 2015 IEEE International Conference on Computer Vision (ICCV). IEEE (2015)
Google Scholar
Zhao, R., Ouyang, W., Wang, X.: Person re-identification by salience matching. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 2528–2535. IEEE (2013)
Google Scholar
Zhao, R., Ouyang, W., Wang, X.: Unsupervised salience learning for person re-identification. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3586–3593. IEEE (2013)
Google Scholar
Zhao, R., Ouyang, W., Wang, X.: Learning mid-level filters for person re-identification. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 144–151. IEEE (2014)
Google Scholar
Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., Tian, Q.: Scalable person re-identification: a benchmark. In: IEEE International Conference on Computer Vision (2015)
Google Scholar
Zheng, W.S., Gong, S., Xiang, T.: Person re-identification by probabilistic relative distance comparison. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 649–656. IEEE (2011)
Google Scholar

Download references

Acknowledgement

This work was supported by the National Key Research and Development Plan (Grant No. 2016YFC0801003), the Chinese National Natural Science Foundation Projects #61473291, #61572501, #61502491, #61572536, NVIDIA GPU donation program and AuthenMetric R&D Funds.

Author information

Authors and Affiliations

National Laboratory of Pattern Recognition and Center for Biometrics and Security Research, Institute of Automation, Chinese Academy of Sciences, Beijing, China
Hailin Shi, Yang Yang, Xiangyu Zhu, Shengcai Liao, Zhen Lei & Stan Z. Li
University of Chinese Academy of Sciences, Beijing, China
Hailin Shi, Yang Yang, Xiangyu Zhu, Shengcai Liao, Zhen Lei & Stan Z. Li
School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China
Weishi Zheng

Authors

Hailin Shi
View author publications
You can also search for this author in PubMed Google Scholar
Yang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Xiangyu Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Shengcai Liao
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Lei
View author publications
You can also search for this author in PubMed Google Scholar
Weishi Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Stan Z. Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhen Lei .

Editor information

Editors and Affiliations

RWTH Aachen , Aachen, Germany
Bastian Leibe
Czech Technical University , Prague 2, Czech Republic
Jiri Matas
University of Trento , Povo - Trento, Italy
Nicu Sebe
University of Amsterdam , Amsterdam, The Netherlands
Max Welling

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shi, H. et al. (2016). Embedding Deep Metric for Person Re-identification: A Study Against Large Variations. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds) Computer Vision – ECCV 2016. ECCV 2016. Lecture Notes in Computer Science(), vol 9905. Springer, Cham. https://doi.org/10.1007/978-3-319-46448-0_44

Download citation

DOI: https://doi.org/10.1007/978-3-319-46448-0_44
Published: 17 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46447-3
Online ISBN: 978-3-319-46448-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics