Keywords

1 Introduction

Given a person of interest as query, person re-identification (re-ID) is aim to determine whether the person has been observed by another camera [9, 16, 17, 20, 25, 26]. It is a completely different problem from classification, which can be considered as a close-set problem [6, 8, 11]. However, person re-ID needs to search a new person which has to be treated as a new class because it has never appeared in the training dataset. So person re-ID requires good features to represent the new identities. It is an unclose-set challenging problem.

Recently, the features learned by convolutional neural networks (CNNs) have been widely used for person re-ID [25]. However, these features are not good enough for person re-ID because CNNs is designed for classification of class-known objects, not for similarity comparison of any two identities. As shown in Fig. 1(a), the 2D features are learned by CNNs with only softmax loss on MNIST dataset by LeNets++ [19], where we find that the features fill the whole feature space and have an uniform and flat distribution for each class samples. This is inappropriate for person re-ID, because two intra-class features may have a relatively low similarity even if they are classified correctly. More importantly, there is no extra space for new identities. If we want the CNNs to learn more discriminative features for new identities, we need to compact the intra-class distribution of the learned features for the existing classes.

Fig. 1.
figure 1

The distributions of the features learned by the CNN models with (a) the softmax loss, (b) the softmax loss and our proposed pairwise cosine loss, on the MNIST dataset. In (a), \(p_a\) and \(p_c\) are two intra-class features while \(p_b\) has a different label with them. For classification, all of them can be classified correctly. But the cosine similarity between \(p_a\) and \(p_c\) is lower than that between \(p_a\) and \(p_b\), which is bad for person re-ID. In (b), the compact intra-class distribution is fit for cosine similarity comparison.

To compact the intra-class distribution, we propose a new pairwise cosine loss to measure the similarity between two intra-class features. As the features learned by the existing CNNs have an angle distribution as illustrated in Fig. 1(a), so it is desired to use a cosine loss to learn features and also utilize the cosine similarity for feature comparison in the evaluation stage. As shown in Fig. 1(b), by using our proposed pairwise cosine loss, the angle distribution of the features from the known classes are indeed compact. Hence, a lot of room is spared for describing new incoming identities. Another contribution of this paper is to design a novel network based on the Siamese network, which inputs only positive pair of images and pulls their features closer as possible. It is different from the exist methods, which inputs both positive and negative pairs [9, 20].

In this paper, we design a Siamese cosine network embedding (SCNE), to learn the discriminative features for person re-ID. Compared to previous networks, we make the learned features not only separable but also compact. Our contributions are:

  • A pairwise cosine loss is proposed to compact the distribution of the intra-class features. It is appropriate for cosine similarity comparison in person re-ID application.

  • We design the SCNE to learn discriminative features by the joint supervision of the softmax loss and the pairwise cosine loss. The input pairs of our proposed network only have the positive pairs, without the negative pairs. This is because the inter-class separation can be achieved by the softmax loss in CNNs.

  • Experimental results show that our approach achieves the state-of-the-art performance on the public Market1501 and CUHK03 person re-ID benchmarks.

2 Related Works

Our SCNE is inspired by the work of [26], where the identification loss and the verification loss are used for training. The former is the same as the softmax loss, and the latter is a variant of the center loss, where the added Square layer is an Euclidean distance for each dimension of the features. In evaluation, the similarity is computed by the cosine distance, so it is not good enough when the network is supervised by the center loss. However, the pairwise cosine loss we proposed in this paper is consistent with the similarity comparison. So it could achieve better performance than the work of [26].

Several works solved the person re-ID problem based on Siamese network, such as [5, 16, 17, 22]. The work of [17] adopted the Long Short-Term Memory (LSTM) for memorizing the spatial dependencies of the divided regions in a person image. The Siamese network architecture is used for comparing the input pair images by a contrastive loss function. The contractive loss is to repel dissimilar inputs and attract similar inputs. The work of [16] also used the Siamese network for comparing features across pairs of images. It adopted a gating function to selectively emphasize the fine common local patterns in a person image. The work of [5] is also very similar to our work, but it used the GoogLeNet [14] as the base network. And more, a loss specific dropout unit is proposed to have a pairwise-consistent dropout for the verification subnet. This special designed network has achieved great performance. All above works used the negative input pair and the positive input pair to learn the network, which is different from our only positive input pair.

Besides, the work of [22] also used a cosine distance for Siamese network, but they adopted it as a connection function for the cost function. They treated the output of the network as a binary-class classification problem just for similar measurement. It is naturally a verification network, it has been proved that it is not good enough for person re-ID, without the identification network [26]. In this paper, we propose to combine two identification networks by the pairwise cosine loss, which can separate inter-class features and effectively compact intra-class features.

3 Siamese Cosine Network Embedding (SCNE)

3.1 The Proposed Pairwise Cosine Loss

Suppose the input image of CNNs is \({{\mathbf {x}}_{i}}\) and its label is \({{y}_{i}}\). The input \({{\mathbf {f}}_{i}}\) of the last fully-connection (FC) layer is always used as feature to represent \({{\mathbf {x}}_{i}}\) for similarity comparison. In the last FC layer, suppose the parameters is \({{\mathbf {W}}^{j}}, j=1,\ldots ,C\), where C is the number of the output, and then the output is \(o_{i}^{j}={{({{\mathbf {W}}^{j}})}^{T}}{{\mathbf {f}}_{i}}\). If we want the jth output to be maximum, we need to maximize the value \(o_{i}^{j}\). For the widely used softmax log-loss, we have

$$\begin{aligned} {{L}_{s}}=-\sum \limits _{i=1}^{N}{\log \frac{\exp (o_{i}^{{{y}_{i}}})}{\sum \nolimits _{k=1}^{C}{\exp (o_{i}^{k})}}} \end{aligned}$$
(1)

where \(o_{i}^{{{y}_{i}}}\) is the output value at the label \({{y}_{i}}\) position, and N is the number of the samples.

Obviously, the softmax log-loss just separates the features into different class without compacting the intra-class features effectively. The problem boils down to develop an efficient loss function to compact the feature distribution of each class. Intuitively, based on the angle distribution of the features learned by the ImageNet pre-trained CNNs, the model is going to minimize the cosine loss of the two intra-class features produced by the input pair in Siamese network, to pull the intra-class features close to each other. The cosine similarity measurement could be adopted to achieve better performance for person re-ID.

To this end, we propose the pairwise cosine loss function, as formulated in (2):

$$\begin{aligned} {{L}_{c}}=\sum \limits _{i=1}^{N}{\left( 1-\cos ({\mathbf {f}}_{i}^{a},{\mathbf {f}}_{i}^{b})\right) } \end{aligned}$$
(2)

where \(\cos (\mathbf {f}_{i}^{a},\mathbf {f}_{i}^{b}) =\frac{{{(\mathbf {f}_{i}^{a})}^{T}}{\mathbf {f}}_{i}^{b}}{||{\mathbf {f}}_{i}^{a}|| ||{\mathbf {f}}_{i}^{b}||} ={{\left( \frac{{\mathbf {f}}_{i}^{a}}{||{\mathbf {f}}_{i}^{a}||} \right) }^{T}} \left( \frac{{\mathbf {f}}_{i}^{b}}{||{\mathbf {f}}_{i}^{b}||} \right) \), \({\mathbf {f}}_{i}^{a}\) and \({\mathbf {f}}_{i}^{b}\) are the deep learned features of the input pair, and \(\frac{{\mathbf {f}}_{i}^{a}}{||{\mathbf {f}}_{i}^{a}||}\) and \(\frac{{\mathbf {f}}_{i}^{b}}{||{\mathbf {f}}_{i}^{b}||}\) are the \(l_2\) normalized features. This loss function has a cosine part, which is the cosine value between \({\mathbf {f}}_{i}^{a}\) and \({\mathbf {f}}_{i}^{b}\). It effectively characterizes the intra-class cosine variation if the pair images have the same label. So it requires that the input of our Siamese network must have only the positive pair.

To learn and update the parameters of our network, we need to compute the gradient of \(L_c\) with respect to \({\mathbf {f}}_{i}^{a}\) and \({\mathbf {f}}_{i}^{b}\) to conduct the back propagation algorithm. The gradients are given as follows,

$$\begin{aligned} \frac{\partial {{L}_{c}}}{\partial {\mathbf {f}}_{i}^{a}}=\frac{1}{||{\mathbf {f}}_{i}^{a}||}\left( \cos ({\mathbf {f}}_{i}^{a},{\mathbf {f}}_{i}^{b})\frac{{\mathbf {f}}_{i}^{a}}{||{\mathbf {f}}_{i}^{a}||}-\frac{{\mathbf {f}}_{i}^{b}}{||{\mathbf {f}}_{i}^{b}||} \right) \end{aligned}$$
(3)
$$\begin{aligned} \frac{\partial {{L}_{c}}}{\partial {\mathbf {f}}_{i}^{b}}=\frac{1}{||{\mathbf {f}}_{i}^{b}||}\left( \cos ({\mathbf {f}}_{i}^{a},{\mathbf {f}}_{i}^{b})\frac{{\mathbf {f}}_{i}^{b}}{||{\mathbf {f}}_{i}^{b}||}-\frac{{\mathbf {f}}_{i}^{a}}{||{\mathbf {f}}_{i}^{a}||} \right) \end{aligned}$$
(4)

In (3) and (4), \(\cos ({\mathbf {f}}_{i}^{a},{\mathbf {f}}_{i}^{b})\), \(\frac{{\mathbf {f}}_{i}^{a}}{||{\mathbf {f}}_{i}^{a}||}\) and \(\frac{{\mathbf {f}}_{i}^{b}}{||{\mathbf {f}}_{i}^{b}||}\) can be pre-computed in the forward pass, and they will be re-used in back propagation for efficient computation purpose. It has to be noted that there is no parameters in our added pairwise cosine loss layer.

3.2 Joint Optimization

If one wants to compact the intra-class features while keeping them separated, the softmax log-loss in (1) and the pairwise cosine loss in (2) should be combined. The joint objective function of the two losses is given as follows:

$$\begin{aligned} L={{L}_{s}}+\lambda {{L}_{c}} \end{aligned}$$
(5)

where the parameter \(\lambda \) is used for balancing the two losses. The softmax log-loss can be considered as a special case when \(\lambda =0\).

If we only use the softmax log-loss as supervision, the learned features would contain large intra-class variations. On the other hand, if we only supervise CNNs by the pairwise cosine loss, the learned features will be degraded to zeros or lines (At this point, the cosine loss is very small). Simply using either of them could not achieve discriminative feature learning. So it is necessary to combine them.

3.3 Architecture of the Designed SCNE

Our network is based on the Siamese network. Figure 2 briefly illustrates the architecture of the proposed network, where the parameter shared layers can be replaced by ImageNet pre-trained CNN layers. The network consists of two parameter shared CNN streams, two modified FC layers and three losses. The features extracted by the network are used as the descriptors, which directly supervised by two softmax losses and one pairwise cosine loss. The softmax loss is used for class prediction and the pairwise cosine loss is used for compacting the intra-class variation. The high level feature \({\mathbf {f}}_{i}^{a}\), \({\mathbf {f}}_{i}^{b}\) are merged in our added pairwise cosine loss layer, which has no parameters. The ImageNet pre-trained CNN model can be taken as AlexNet [8], VGGNet [11] or ResNet [6]. In this paper, we take Res50Net as the baseline for comparing with the-state-of-arts.

Fig. 2.
figure 2

The architecture of our proposed SCNE.

In order to finetune the network on different person re-ID datasets, we replace the final FC layer of the pre-trained Res50Net model with a \(1\times 1\times 2048\times n\) dimensional FC layer, where n is the number of training identities in the training dataset. Given an input pair of intra-class images resized to \(224\times 224\), the network predicts the identities of the two images and computes the pairwise cosine loss for them. The pairwise cosine loss layer is coupled with the last FC layer and affects the distribution of the learned features.

4 Experiments

4.1 Datasets and Preparation

The proposed model is tested on two large-scale person re-ID benchmarks, Market1501 [24] and CUHK03 [9].

Market1501 dataset has 32668 images of 1501 identities. According to the dataset setting, 12936 images of 751 identities are for training and 19732 images of 750 identities and distractors are for testing. The images are cropped by the deformable part model (DPM) [4] detector automatically and are closer to the realistic setting. The evaluation is followed by the dataset baseline setting.

CUHK03 dataset consists of 13164 cropped images of 1467 identities collected in the CUHK campus. The bounding boxes detected by DPM detector are closer to realistic setting and are used in experiments. Following the given setting, the dataset is partitioned into a training set of 1367 identities and a testing set of 100 identities. The experiments are repeated with 20 random splits. In evaluation, we randomly select 100 images from 100 identities under another camera as galley.

The training images are resized to \(256\times 256\) uniformly, and subtract the mean image computed from all the training images. For adapting to the input of the Res50Net network, we cropped the images at \(224\times 224\). The training images are randomly mirrored horizontally. We get the batch of the training images randomly and online sample another same label images to compose an intra-class input pair.

4.2 Implementation Setting

The MatconvNet package [18] is used for training and testing. The epoch is set to 30 epochs. We adopt the mini-batch stochastic gradient descent to update the parameters of our network. The batch size is set 64 pairs. The learning rate is initialzed as 0.01 and set to 0.001 after 15 epochs, and 0.0001 for the final 5 epochs. There are three objectives in our network. All the gradients produced by every objectives respectively and added together by different weights. We assign 0.5 for the two gradients produced by two softmax log-losses and 1 for the gradient produced by the pairwise cosine loss.

For testing, we extract features by only activating one stream at the output before the FC layer in our fine-tuned model. Given an input image with size \(224\times 224\), we feed forward the image to the network and get the corresponding descriptor at the output of the ‘pool5’ layer for Res50Net. Once the descriptors for query and gallery sets are obtained, we sort the cosine distance between two sets to get the final result. The mean average precision (mAP) and rank-1 accuracy are used for evaluation.

4.3 Results on Market1501

On the Market-1501 dataset, we compare the results with state-of-the-art algorithms, in which PersonNet [20], Verification-Classification [26], DeepTransfer [5], Gated Reid [16] and S-LSTM [17] are all based on the Siamese network and have achieved the state-of-the art performance. SMOAnet [2] uses synthetic data to train a Inception network, while GAN ResNet [27] use the generative adversarial networks (GAN) to generate unlabeled samples for learning better models. Both of them can be thought as a variant of data augmentation.

Table 1. Comparison with the state-of-the-art methods on Market1501.

The single query (SQ) and multiple query (MQ) results are reported in Table 1. Our SCNE achieves 83.25% rank-1 accuracy and 63.50% mAP under the single query mode and 88.42% rank-1 accuracy and 71.27% mAP under the multiple query mode, which is the second among all the above results. It greatly outperforms Gated Reid [16] and S-LSTM [17] methods, which used the Siamese network without combining classification loss and verification loss. Our method also outperforms Verification-Classification [26], which used a Euclidean loss for verification. It’s not good enough for similarity comparison by using cosine similarity measurement. The best method is the DeepTransfer [5], which adopted a different designed dropout strategy to combine classification loss and verification loss, based on the GoogLeNet base network.

4.4 Results on CUHK03

On the CUHK03 dataset, there are two types of evaluations, single shot (SS) and multiple shots (MS).

In single shot setting, we compare with ImprovedDeep [1], PersonNet [20], Verification-Classification [26], Pose Invariant [23], DNN-IM [13], SOMAnet [2], GAN ResNet [27], CNN-FRW-IC [7], DeepTransfer [5] and ResNet baseline [25]. We randomly select 100 images from 100 identities under another camera as gallery and report the mAP and rank-1 accuracy in Table 2. We achieve rank-1 accuracy = 85.1%, mAP = 83.3%, which is the excellent result compared with above methods.

Table 2. Comparison with the state-of-the-art methods on CUHK03.

In multiple shot setting, all the images from another camera are used as gallery and the number of the candidate images is about 500. This evaluation is much closer to image retrieval and alleviate the unstable effect caused by random gallery selection. We compare with S-LSTM [17], Gated Reid [16], Verification-Classification [26], SOMAnet [2] and GAN ResNet [27] on the mAP and rank-1 accuracy. Our SCNE achieves rank-1 accuracy = 82.0%, mAP = 88.1%, which is also very competitive.

Fig. 3.
figure 3

The performance of our SCNE as different parameter \(\lambda \).

Fig. 4.
figure 4

The performance of our SCNE as the training iterations.

4.5 Parameter Sensitivity Analysis

As the parameter \(\lambda \) dominates the balance of the pairwise cosine loss and the softmax loss, it is essential to our SCNE. So we conduct experiments to investigate the influence of the parameter \(\lambda \) on the Market1501 dataset. The results are reported in Fig. 3. From Fig. 3, we find that a proper \(\lambda \) can achieve the best mAP and rank-1 accuracy. A good performance is achieved when \(\lambda =1\).

Besides, we also report the performance change of our SCNE as the iteration increases in training in Fig. 4. From Fig. 4, we can find that the performance rise slowly after 20 epoches.

5 Conclusion

In this paper, we propose a pairwise cosine loss to compact the distribution of the intra-class features and design the SCNE to learn the discriminative features for person re-ID. Our SCNE is trained by the joint supervision of the softmax loss and the pairwise cosine loss. Compared to previous networks, we make the learned features not only separable but also compact. Experimental results show that our approach achieves the state-of-the-art performance on the public Market1501 and CUHK03 person re-ID benchmarks. Since our SCNE is apt for similarity comparison, so we will apply it to identity retrieval in the further.