Abstract
Conventional deep semi-supervised learning methods, such as recursive clustering and training process, suffer from cumulative error and high computational complexity when collaborating with Convolutional Neural Networks. To this end, we design a simple but effective learning mechanism that merely substitutes the last fully-connected layer with the proposed Transductive Centroid Projection (TCP) module. It is inspired by the observation of the weights in the final classification layer (called anchors) converge to the central direction of each class in hyperspace. Specifically, we design the TCP module by dynamically adding an ad hoc anchor for each cluster in one mini-batch. It essentially reduces the probability of the inter-class conflict and enables the unlabelled data functioning as labelled data. We inspect its effectiveness with elaborate ablation study on seven public face/person classification benchmarks. Without any bells and whistles, TCP can achieve significant performance gains over most state-of-the-art methods in both fully-supervised and semi-supervised manners.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
The explosion of the Convolutional Neural Networks (CNNs) brings a remarkable evolution in the field of image understanding, especially some real-world tasks such as face recognition [1,2,3,4,5] and person re-identification (Re-ID) [6,7,8,9,10,11]. Much of this progress was sparked by the creation of large-scale datasets as well as the new and robust learning strategies for feature learning. For instance, MS-Celeb-1M [12] and MARS [13] provide more than 10-million face images and 1-million pedestrian images respectively with rough annotation. Moreover, in the industrial environment, it may take only a few weeks to collect billions of face/pedestrian gallery from a city-level surveillance system. But it is hard to label such billion-level data. Utilizing these large-scale unlabelled data to benefit the classification tasks remains non-trivial.
Most of recent unsupervised or semi-supervised learning approaches for face recognition or Re-ID [14,15,16,17,18,19,20] are based on self-training, i.e. the model clusters the training data and then the clustered results are used to fine-tune the model iteratively until converges, as shown in Fig. 1(a). The typical downsides in this process lie in two aspects. First, the recursive training framework is time-consuming. And second, since the clustering algorithms used in such approaches always generate ID-clusters with high precision scores but somewhat low recall score, that guarantee the clean clusters without inner errors, it may cause inter-class conflict, i.e. instances belonging to one identity are divided into different clusters, which hampers the fine-tuning stage. To this end, a question arises: how to utilize unlabelled data in a stable training process, such as a CNN modle with softmax classification loss function, without any recursion and avoid the inter-class conflict?
In this study, we design a novel Transductive Centroid Projection layer to efficiently incorporate the training of the unlabelled clusters accompanied by the learning of the labelled samples, and can be readily extended to an unsupervised manner by setting the labelled data to \(\varnothing \).
It is enlightened from the latent space learned by the common used Softmax loss. In deep neural network, each column in the projection matrix \(\mathbf {W}\) of the final fully-connected layer indicates the normal direction of the decision hyperplane. We call each column as anchor in this paper. For a labelled data, the anchor of its class already exists in \(\mathbf {W}\), and thus we can train the network by maximizing the inner product of its feature and its anchor. However, the unlabelled data doesn’t even have a class, so it cannot directly provide the decision hyperplane. To utilize unlabelled samples with conventional deep classification network, we need to find a way to simulate the their anchors.
Motivated by the observation that the anchor approximates the centroid direction as shown in Fig. 2, the transductive centroid projection layer could dynamically estimate the class centroids for the unlabelled clusters in each mini-batch, and treat them as the new anchors for unlabelled data which are then absorbed to the projection matrix so as to enable classification for both labelled and unlabelled data. As visualized in Fig. 1(b), the projection matrix \(\mathbf {W}\) of the classification layer in original CNN is replaced by the joint matrix of \(\mathbf {W}\) and ad hoc centroids \(\mathbf {C}\). In this manner, labelled data and unlabelled data function the same during training. As analyzed in Sect. 3.3, since the ad hoc centroids in each mini-batch is much fewer than the total cluster number, the inter-class conflict ratio is naturally low and can hardly influence the training process.
Comprehensive evaluations have been conducted in this paper to compare with some popular semi-supervised methods and some loss functions in metric learning. The proposed transductive centroid projection has a superior performance on stabilizing unsupervised/semi-supervised and optimizing the learned feature representation.
To sum up, the contribution of this paper is threefold:
-
(1)
Observation interpretation - We investigate the observation that the directions of anchor (i.e. weight \(\mathbf {w}_n\)) gradually coincides with the centroid as model converges, both theoretically and empirically.
-
(2)
A novel Transductive Centroid Projection layer - Based on the observation above, we propose an innovative un/semi-supervised learning mechanism to wisely integrate the unlabelled data into the recognition to boost its discriminative ability by introducing a new layer named as Transductive Centroid Projection (TCP). Without any iterative processing like self-training and label propagation, the proposed TCP can be simply trained and steadily embedded into arbitrary CNN structure with any classification loss.
-
(3)
Superior performance on face recognition and ReID benchmarks - We apply TCP to the task of face recognition and person re-identification, and conduct extensive evaluations to thoroughly examine its superiority to both semi-supervised learning and supervised learning approaches.
1.1 Related Works
Semi-supervised Learning. An effective way for deep semi-supervised learning is the label propagation with self-training [21] by trusting the predicted label from the model trained on labeled data or clustered by clustering model [22,23,24,25], for close set or open set respectively. It will hamper the model convergence if the threshold is not precisely set. Other methods like Generative models [26], semi-supervised Support Vector Machines [27] and some graph-based semi-supervised learning methods [28] hold clear mathematical framework but are hard to be incorporated with deep learning methods.
Semi-supervised Face/Person Recognition. In [16], a couple dictionaries are jointly learned from both labelled and unlabelled data. LSRO [8] adopts GAN [29] to generate person patches to normalize data distribution and propose a loss named LSRO to supervise the generated patches. Some works [18, 19] adopt local metric loss functions (e.g. triplet loss [2]) to avoid the inter-class conflict. These methods with local optimization function, however, are usually unstable and hard to converge, especially for large-scale data. Some other methods [19] adopt softmax loss to optimize global classes and suffer from the inter-class conflict. Most of these methods focus on transfer learning, self-training and data distribution normalization. In this work, we mainly pay attention to a basic question, namely how to wisely train a simple CNN model by fully leveraging both labelled and unlabelled data, without self-training or transfer learning.
2 Observation Inside the Softmax Classifier
In a typical straightforward CNN, let \(\mathbf {f} \in \mathbb {R}^D\) denote the feature vector of one sample generated by prior layers, where D is the feature dimension. The linear activation \(\mathbf {y}\in \mathbb {R}^N\) referring to N class labels is therefore accompanied with the weight \(\mathbf {W} \in \mathbb {R}^{D\times N}\) and bias \(\mathbf {b} \in \mathbb {R}^{N}\),
In this work we degenerate this classifier layer from affine to linear projection by setting the bias term \(\mathbf {b} \equiv \mathbf {0}\). Supervised by softmax loss and optimized by SGD, we can usually observe the following phenomenon: The anchor \(\mathbf {w}_i = \mathbf {W}_{[i]} \in \mathbb {R}^{D}\) for class i points to the direction of the data centroid of class i, when the model has successfully converged. We first show this observation in three toy examples from a low-dimensional space to a high one. Then we try to interpret it by gradient view.
2.1 Toy Examples
To investigate the aforementioned observation from small-scale to large-scale tasks and from low dimensional to high dimensional latent space, we empirically analyze three tasks with different data scales, feature dimension and network structure, i.e. character classification on MNIST [33] with 10 classes, object classification on CIFAR-100 [34] with 100 classes, and face recognition on MS1M [35] with 100, 000 classesFootnote 1. Table 1 records the detailed settings for these experiments. To each task, there are two FC layers after its backbone structure, in which FC1 learns an internal feature vector \(\mathbf {f}\) and FC2 acts as the projection onto the class space. All tasks employ the softmax loss. Figure 2 depicts the feature spaces extracted from different datasets, in which the 2-D features in MNIST are directly plotted and the 128-D features in CIFAR-100 and MS1M are compressed by Barnes-Hut t-SNE [36].
Visualization of feature spaces on different tasks, i.e. (a) MNIST, (b) CIFAR-100 and (c) MS1M, where the features of CIFAR-100 and MS1M are visualized by Barnes-Hut t-SNE [36], and (d) depicts the evolution of cosine distance between anchor direction and class centroid with respect to the training iteration on MNIST
MNIST – Figure 2(a) describes the feature visualization in three stages: 0, 2 and 10 epochs. We set the feature dimension \(D=2\) for \(\mathbf {f}\) so as to explore the distribution in low dimensional case. The training of this model progressively increases the congregation between features in each class and inter-discrepancy between classes. We pick four classes and show their directions \(\mathbf {W}_{[n]}\) from the projection matrix \(\mathbf {W}\), named as anchor. All anchors have random directions at the initial stage of training, and they gradually move towards the direction of their respective centroids.
CIFAR-100 and MS1M – To examine this observation in a much larger data scale and higher dimension case, we further apply CIFAR-100 and MS1M for an ample demonstration. Different from MNIST, the feature dimension for \(\mathbf {f}\) is \(D=128\) and t-SNE is used for dimensionality reduction without losing cosine metric. Similar to the phenomenon as observed in MNIST, features in each class tend to be progressively clustered together while features from different classes own more distinct margins in between. Meanwhile the anchors marked by red dots almost locate around its corresponding class centroids. The anchors of a well trained MS1M model also co-locate with the class centroids.
In addition, for a quantified assessment, we compute the cosine similarity \(\mathcal {C}(\mathbf {w}_n, \mathbf {c}_n)\) between the anchor \(\mathbf {w}_n = \mathbf {W}_{[n]}\) and the class centroid \(\mathbf {c}_n\) for the \(n^\text {th}\) class out of 10 classes in total on MNIST. Figure 2(d) exhibits \(\mathcal {C}(\mathbf {w}_n, \mathbf {c}_n)\) with respect to the training iterations. Almost all classes converge to a distance of 1 within one epoch, i.e. the direction of the anchor shifts to the same direction of the class centroid.
To conclude, the anchor direction \(\mathbf {W}_{[n]}\) is always consistent with the direction of the corresponding class centroid over different dataset scales with various lengths of the feature dimension in \(\mathbf {f}\).
2.2 Investigate in Gradients
We investigate the reason why the directions of anchor and centroid will be gradually consistent, from the perspective of gradient descent in the training procedure. Considering the input of linear projection \(\mathbf {f}\) which belongs to the n-th chass and the output \(\mathbf {y} = \mathbf {W}^T\mathbf {f}\), the softmax probability of \(\mathbf {f}\) belongs to n-th chass can be calculated by:
We want to minimize the negative log-likelihood, i.e. softmax loss \(\ell \):
where \(\theta \) denotes the set of all parameters in CNN. Now we can infer the gradients of softmax loss \(\ell _\mathbf {f}\) with respect to the anchor \(\mathbf {w}_n\) given the single sample \(\mathbf {f}\):
in which the samples of class n is denoted as \(\mathcal {I}_n\), and \(\mathbf {y}_n\) is the \(n^\text {th}\) element in \(\mathbf {y}\). \(\mathbb {I}\) refers to the indicator which is 1 when \(\mathbf {f}\) is in \(\mathcal {I}_n\), and 0 vice versa.
Now considering samples in one mini-batch, the gradient \(\nabla _{\mathbf {w}_n}\ell \) with respect to results in the summation of all feature samples in the class n with a negative contribution from the summation of feature samples from the rest classes:
In each iteration, the update value of \(\mathbf {w}_n\) equal to
Where \(\eta \) denote the learning rate. The former term can be assumed as the scaled summation of the data samples in class n, thus is approximately proportional to the class centroid \(\mathbf {c}_n\). And the feature samples are usually evenly distributed in the feature space, the summation of the negative feature samples for class n will also approximately follow the negative direction of the centroid \(\mathbf {c}_n\). Therefore, the gradient \(\nabla _{\mathbf {w}_n}\ell \) approximately points to the centroid direction \(\mathbf {c}_n\) in one time step, thus finally the anchor \(\mathbf {w}_n\) will also follow the direction of the centroid with sufficient accumulation of the gradients. Figure 3 describes the moving direction of anchor \(\mathbf {w}_n\) with the gradient \(\varDelta \mathbf {w}_n = -\nabla _{\mathbf {w}_n}\ell \) and the direction of samples \({x_n}\) with the gradient \(\varDelta \mathbf {x}_n = -\nabla _{\mathbf {x}_n}\ell \) marked in red dot lines. For a class n, the samples and anchors are marked with yellow dots and arrow line, respectively. When the network back-propagates, the direction of \(\mathbf {w}_n\) is updated towards the class centroid \(\mathbf {c}_n\) in tangential direction whilst the samples \(\mathbf {x_n}\in \mathcal {I}_n\) are also gradually transformed to the direction of \(\mathbf {w}_n\), which leads to \(\sum _{j=1}^{o} x_{nj} = c_n \rightarrow w_n\).
3 Approach
Inspired by the observation stated in the previous section, we propose a novel learning mechanism to wisely congregate the unlabelled data into the recognition system to enhance its discriminative ability. Let \(\mathcal {X}^{\text {L}}\) denote the labelled dataset with M classes and \(\mathcal {X}^{\text {U}}\) the unlabelled dataset. We first cluster the \(\mathcal {X}^{\text {U}}\) by [24] and get N clusters. According to the property \(w_n \approx c_n\) discussed in the previous section, the ad hoc centroid \(\mathbf {c}^\text {U}\) from an unlabelled cluster can be used to build up the corresponding anchor vector \(\mathbf {w}^\text {U}\), which means that it is possible to utilize the ad hoc centroid for a faithful classification of the unlabelled cluster.
3.1 Transductive Centroid Projection (TCP)
In one training step, we construct the mini-batch \(\mathcal {B} = \{ \mathcal {X}^\text {L}_p, \mathcal {X}_q^\text {U} \}\) by the labelled data \(\mathcal {X}_p^\text {L} \subset \mathcal {X}^\text {L}\) and unlabelled data \(\mathcal {X}_q^\text {U} \subset \mathcal {X}^\text {U}\), with \(p = \text {card}(\tilde{\mathcal {X}}^\text {L})\) and \(q = \text {card}(\tilde{\mathcal {X}}^\text {U})\) denote the number of selected labelled and unlabelled data in this batch, respectively. We randomly select \(\mathcal {X}_p^\text {L}\) from the labelled dataset as usual, but the unlabelled data are constructed by randomly selecting l unlabelled clusters with o samples in each cluster, i.e. \(q = l\times o\). Note that the selected l clusters are dynamically changed for each mini-batch. Therefore, this mini-batch \(\mathcal {B}\) is then fed into the network and the extracted features before the TCP layer are reformulated as \(\mathbf {f} = [ \mathbf {f}^\text {L}, \mathbf {f}^\text {U} ]^\top \in \mathbb {R}^{(p+q)\times D}\), where D is the feature dimension and \(\mathbf {f}^\text {L}, \mathbf {f}^\text {U}\) denote the feature vectors for labelled and unlabelled data, respectively.
The projection matrix for the TCP layer is reformulated as \(\mathbf {W} = [\mathbf {W}^M, \mathbf {W}^l] \in \mathbb {R}^{(M+l)\times (p+q)}\), in which the first M columns are reserved for the anchors of labeled classes and the rest l columns are substituted by the ad hoc centroid vectors \(\{ \mathbf {c}^\text {U}_\iota \}_{\iota =1}^l\) from the selected unlabeled data. Note that \(\mathbf {c}^\text {U}_\iota \) is calculated through the selected samples \(\{ \mathbf {f}^\text {U}_{\iota , i} \}_{i=1}^o\) of the cluster \(\iota \) in this mini-batch as
The scale factor \(\alpha \) is the average magnitude of the centroids for the labeled clusters. The output of the TCP layer is thereby obtained by \(\mathbf {y} = \mathbf {W}^\top \mathbf {f}\) without the bias term, which is then fed into the softmax loss layer.
Compared to the training in a purely unsupervised manner, the semi-supervised learning procedure in this paper (as shown in Fig. 4(a)) applies the proposed transductive centroid projection layer which not only optimizes the inference towards the labeled data but also indirectly gains the recognition ability for the unlabeled clusters. Actually, it can be easily transferred to the unsupervised learning paradigm by setting \(M = 0\) as shown in Fig. 4(b), or the supervised learning framework when there is no unlabeled data as \(l=0\).
3.2 Scale Factor \(\alpha \) Matters
As stated in Sect. 3.1, the scale factor \(\alpha \) is applied to normalize the ad hoc centroids for the unlabeled data. For the purpose of training stability and fast convergence, a suitable scaling criterion is to let the mapped activation \(\mathbf {y}^\text {U}\) for unlabeled data have a scale similar to the labeled one \(\mathbf {y}^\text {L}\). Indeed, the \(\ell _2\) norm of each centroid inherently offers a reasonable prior scale in mapping the input features \(\mathbf {f}^\text {L}\) to the output activation \(\mathbf {y}^\text {L}\). Therefore, by scaling the ad hoc centroids for the unlabeled data with an average scale \(\alpha = \frac{1}{M}\sum _{j=1}^M \Vert \mathbf {c}_j^\text {L} \Vert _2\) as the labeled centroids, activations for unlabeled data will have a similar distribution as the labeled activations, thus ensuring the stability and fast convergence during training.
3.3 Avoid Inter-class Conflict in Large Mini-Batch
A larger batch size theoretically induces a better training performance in conventional recognition tasks. However, in TCP, it might be possible that a larger batch size will introduce multiple clusters with a same class label for the unlabelled data. Let the classes be evenly distributed in the unlabeled clusters, and assume that N clusters in the unlabelled data actually belong to \(\tilde{N}\) classes, the probability that every cluster has a unique class label in the mini-batch \(\mathcal {B}\) is \(P(l) = (1-\frac{N/\tilde{N}-1}{N})^l\), where l is the number of selected clusters. This probability decreases as the batch size increases, as shown in Fig. 5.
In our experiment, the ratio \({N}/{\tilde{N}}\simeq 8\) for person re-id and \({N}/{\tilde{N}}\simeq 3\) for face recognition. To guarantee the probability \(P(l) > 0.99\), the number of cluster l selected in a mini-batch should not be larger than 40. To further increase the number of unlabelled clusters in the mini-batch as much as possible, we provide two strategies as follows:
Selection of Clusters – Based on the assumption that the probability of inter-class conflict reduces along with the time interval during data collection, to avoid the conflict in training stage, the l clusters should be picked with an minimum interval \(T_{l}\). In the experiment, we find that \(T_{l}\ge 120\) s presents a good performance.
Selection of Samples – The diversity of samples extracted from consecutive frames in one cluster is always too small to aid intra-class feature learning. To this end, we make a constraint on sample selection by setting the interval between each sampled frame larger than \(T_{o}\). In the experiment, we set \(T_{o}\) as 1 s.
Based on the aforementioned strategies, we find that only 19 out of 10, 000 mini-batches on Re-ID and 7 out of 10, 000 mini-batches on face recognition have duplicated identities when setting \(l = 48\) in our training dataset.
3.4 Discussion: Stability and Efficiency
We further discuss the superiority of the proposed TCP layer comparing with some other metric learning losses, such as triplet loss [2] and contrastive loss [37], that can also avoid inter-class conflict by elaborate batch selection. Both of these losses suffer from dramatic data expansion when forming the sample pairs or sample triplets from the training set. Take triplet loss as an example, n unlabelled samples constitute \(\frac{1}{3}n\) triplet sets and the metric only restricts on \(\frac{2}{3}n\) distances in each iteration, i.e. the anchor to the negative sample and the anchor to the positive sample in each single triplet. It makes the triplet term suffer severe disturbance during training. Alternatively, in the proposed TCP layer, \(n=p+q\) samples are compared with all the M anchors by labelled data as well as the l ad hoc centroids of the unlabeled data to achieve \((M+l)\times (p+q)\) comparisons, which is quadratically larger than other metric learning methods. It thus ensures a stable training process and a quick convergence.
4 Experimental Settings and Implementation Details
Labeled Data and Unlabeled Data. For both of person re-identification and face recognition, the training data consist of two parts: labeled data \(\mathcal {D}^{\text {L}}\) and unlabeled data \(\mathcal {D}^{\text {U}}\).
In experiments for Re-ID, following the pipeline of DGD [38] and Spindle [39], we take the combined training samples from eight datasets described in Table 2 together as \(\mathcal {D}^{\text {L}}\). Note that MARS [13] is excluded from the training set since it is an extension of Market-1501. For \(\mathcal {D}^{\text {U}}\) construction, we collect videos with a total length of four hours from three different scenes with four cameras. The person clusters are obtained by the POI tracker [40] and clustered by [24] without further alignment, where those shorter than one second are removed. The unlabeled dataset, named as Person Tracker Re-Identification dataset (PT-ReID)Footnote 2, contains 158, 446 clusters and 1, 324, 019 frames in total. For ablation study, we further manually annotate the PT-ReID, named as Labeled PT-ReID dataset (L-PT-ReID), and get a total of 2, 495 identities.
In experiments for face recognition, we combine a labelled MS-Celeb-1M [35] with some collected photos from internet as \(\mathcal {D}^{\text {L}}\), which in total contains \(\sim 10M\) images and 1.6M identities. For \(\mathcal {D}^{\text {U}}\) we collect 11.0M face frames from surveillance videos and cluster them into 500K clusters. All faces are detected and aligned by [41].
Evaluation Benchmarks. For Re-ID, The proposed method is evaluated on six significant publicly benchmarks, including the image-based Market-1501 [42], CUHK01 [43], CUHK03 [44], and the video-based MARS [13], iLIDS-VID [45] as well as Prid2011 [46]. For face recognition, we evaluate the method on NIST IJB-C [47], which contains 138000 face images, 11000 face videos, and 10000 non-face images. To the best of our knowledge, it is the latest and the most challenging benchmarks for face verification. Notice that we found more than one hundred wrong annotations in this dataset, which introduce significant confusion for recall rate on some small false positive rate (FPR \(\le \) 1e-3), so we remove these pairs in evaluationFootnote 3.
Evaluation Metrics. For Re-ID, the widely used Cumulative Match Curve (CMC) is adopted in both ablation study and comparison experiments. In addition, we apply Mean Average Precision (MAP) as another metric for evaluations on Market-1501 [42] and MARS [13] dataset. For face recognition, the receiver operating characteristic (ROC) curve is adopted as in most of the other works. On all datasets, we compute cosine distance between each pair of query image and any image from the gallery, and return the ranked gallery list.
Training Details. As a common practice in most deep learning frameworks for visual tasks, we initialize our model with the parameters pre-trained on ImageNet. Specifically, we employ resnet-101 as the backbone structure in all experiments which is followed by an additional fc layer after pool5 to generate 128-D features. Dropout [48] is used to randomly drop out a channel with the ratio of 0.5. The input size is normalized as \(224\,\times \,224\) and the training batch size is 3, 840, in which \(p=2,880, q=960, l=96\) and \(o=10\). Warm up technology [49] is used to achieve stability when training with large batch size.
5 Ablation Study
Since the training data, network structure and pre-processing for the data vary from method to method, we first analyse the effectiveness of the proposed method with quantitative comparisons to different baselines in Sect. 5.1 and visualize the feature space in Sect. 5.2. All the ablation study are conducted on Market-1501, a large-scale clean dataset with strong generalizability.
5.1 Component Analysis
Since the semi-supervised learning contains two data sources, i.e. labeled data \(\mathcal {D}^{\text {L}}\) and unlabeled data \(\mathcal {D}^{\text {U}}\), the proposed TCP is compared with nine typical configuration baselines listed in Table 3. These baselines can be divided into two types: single-task learning with only one data source and multi-task learning with multiple data sources.
The top four are single-task learning with single data source: (1) \(\mathbf {S}^{\text {L}}\) only uses \(\mathcal {D}^{\text {L}}\) supervised by the annotated ground truth IDs with softmax loss; (2) \(\mathbf {S}^{\text {U}}\) only uses \(\mathcal {D}^{\text {U}}\) supervised by taking the cluster IDs as the pseudo ground truth with softmax loss; (3) \(\mathbf {S}^{\text {U}}_{\text {self}}\) with self-training on unlabeled data, where self-training is a classical semi-supervised learning method. We first train the CNN with \(\mathcal {D}^{\text {L}}\) which is used to extract features of \(\mathcal {D}^{\text {U}}\), and then obtain the pseudo ground truth by a cluster algorithm. The pseudo ground truth is taken as the supervision for training on \(\mathcal {D}^{\text {U}}\); and (4) \(\mathbf {S}^{\text {U}}_{\text {labeled}}\) - We further annotate the real ground truth of unlabeled data and compare it with the model trained with pseudo ground truth.
The latter five are multi-task learning and three of them are a combination of the above single-task baselines as follows: (5) \(\mathbf {M}^{\text {U+L}}\) combines \(\mathbf {S}^{\text {L}}\) and \(\mathbf {S}^{\text {U}}\); (6) \(\mathbf {M}^{\text {U+L}}_{\text {self}}\) is a combination of \(\mathbf {S}^{\text {L}}\) and \(\mathbf {S}^{\text {U}}_{\text {self}}\); and (7) \(\mathbf {M}^{\text {U+L}}_{\text {labeled}}\) is a combination of \(\mathbf {S}^{\text {L}}\) and \(\mathbf {S}^{\text {U}}_{\text {labeled}}\). The last two take the annotated ground truth to supervise the branch with labeled data and compare the performance of operating triplet loss with our TCP on unlabeled data as (8) \(\mathbf {M}^{\text {U+L}}_{\text {tr-loss}}\) with triplet loss, where the selection strategy for triplets also follow the Online Batch Selection described in Sect. 3.3, and (9) \(\mathbf {M}^{\text {U+L}}_{\text {TCP{}}}\) utilizes the proposed TCP which is regarded as training in a unsupervised manner.
The proposed method TCP is neither single-task nor multi-task learning, but with the labeled and unlabeled data trained simultaneously in a semi-supervised manner. The results clearly prove that either single-task or multi-task learning will pull down the performance which are concluded as follows:
Clustered Data Contain Noisy and Fake Ground Truth. Compared with the n\(\ddot{\text {a}}\)ive baseline \(\mathbf {S}^{\text {U}}\) that directly uses cluster IDs as the supervision, the self-training \(\mathbf {S}^{\text {U}}_{\text {self}}\) outperforms it by \(42\%\). Similarly, by fusing labeled data, the \(\mathbf {M}^{\text {U+L}}_{\text {self}}\) is superior to \(\mathbf {M}^{\text {U+L}}\) with \(31.4\%\). It shows that (1) the source cluster data contains many fake ground truth and (2) many cluster fragments cause the same identity to be clustered to different ID ground truth.
It’s Hard to Manually Refine Unlabelled Cluster Data. We further annotate the cluster data to get the real ground truth of unlabeled data. Although \(\mathbf {S}^{\text {U}}_{\text {labeled}}\) outperforms \(\mathbf {S}^{\text {U}}\) with pseudo ground truth again demonstrating the noise of cluster, both \(\mathbf {S}^{\text {U}}_{\text {labeled}}\) and \(\mathbf {M}^{\text {U+L}}_{\text {labeled}}\) drop performance compared to training on labeled data \(\mathbf {S}^{\text {L}}\). It shows that there is a significant disparity between two source data domains, and it is non-trivial to get a clean annotation set due to the time gap between different clusters.
Self-training and Triplet-Loss are Not Optimal. Both self-training \(\mathbf {M}^{\text {U+L}}_{\text {self}}\) and triplet-loss \(\mathbf {M}^{\text {U+L}}_{\text {tr-loss}}\) provide solutions to overcome the problems caused by the pseudo ground truth of clusters data, significantly performing the n\(\ddot{\text {a}}\)ive combination of unlabeled and labeled data \(\mathbf {M}^{\text {U+L}}\), however, their results are still lower than that of our method by \(21.6\%\) and \(6.9\%\) respectively. As discussed in Sect. 3.4, the triplet-loss only consider \(\frac{2}{3}N\) distances that cannot fully exploit the information in each batch data, while self-training profoundly depends on the robustness of the pre-trained model with labeled data that cannot be guaranteed to intrinsically solve the problem.
The Superiority of TCP. By employing TCP, both the unsupervised learning \(\mathbf {M}^{\text {U+L}}_{\text {TCP{}}}\) and semi-supervised learning TCP, not surprisingly, outperform all of the above baseline variants by a large margin. It proves the superiority of the proposed online batch selection and the centroid projection mechanism which comprehensively utilize all labeled as well as unlabeled data by optimizing \((M+l)\times (p+q)\) distances.
5.2 Feature Hyperspace on Person Re-ID
The feature spaces learned on MNIST, CIFAR-100 and MS1M are discussed in Sect. 2.1. Here we examine whether the same observations and conclusions also occur on person re-identification with the proposed TCP layer, by visualizing the distribution related to the mini-batches on a single GPU in different training stages. For a clear visualization, we show the mini-batch with 8 labeled samples where each belongs to a distinct class and 24 unlabeled samples from 3 classes each of which has 8 samples in Fig. 6. As the number of epoch increases, the anchors of labeled data converge towards their corresponding sample centroids while those of unlabeled data keep still in the centroids. Until the network converges, the anchors of both labeled and unlabeled data are in the centroid of each class and thus the unlabeled data can be regarded as the auto-annotated data to enlarge the training data span.
6 Evaluation on Seven Benchmarks
6.1 Person Re-Identification Benchmarks
We first evaluate our method on the six Re-ID benchmarks. Notice that since the data pre-processing, training setting and network structure vary in different state-of-the-art methods, we only list recent best performing methods in the tables just for reference. The test procedure on iLIDS-VID and PRID2011 is the average of 10-fold cross validation result, whereas on MARS we use a fixed split following the official protocol [13]. As shown in Table 4, ‘Basel.’ denote the \(\mathbf {S}^{\text {L}}\) setting in Sect. 5. The proposed TCP, compared with a variety of recent methods, achieves the best performance on the Market-1501, CUHK03 and CUHK01 datasets. The performance will be further improved with an additional re-rank skill (Table 5).
6.2 Face Recognition Benchmarks
IJB-C [47] is the most challenging face recognition benchmark for now. Since it has just been released for a few months, few work report its result on it. We report the true positive rates on seven different levels of false positive rates (from 1e-1 to 1e-7) in Fig. 5. Comparison has been made between the proposed TCP with some baselines as described in Sect. 5. The best accuracy of existing works on the widely used LFW dataset is also reported for reference. The result of the proposed TCP outperforms all the baselines especially the self-training one, the training process of which takes more than 4-times the time of TCP.
7 Conclusion
By observing the latent space learned by softmax loss in CNN, we propose a semi-supervised method named TCP which can be steadily embedded in CNN and followed by any classification loss functions. Extensive experiments and ablation study demonstrate its superiority in utilizing full information across labelled and unlabelled data to achieve state-of-the-art performance on six person re-identification datasets and one face recognition dataset.
Notes
- 1.
The original MS1M dataset has one million face identities with several noises samples. Here we only take the first 100, 000 identities for the convenience of illustration.
- 2.
The dataset will be released.
- 3.
The list will be made available.
References
Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: Deepface: closing the gap to human-level performance in face verification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1701–1708 (2014)
Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)
Sun, Y., Wang, X., Tang, X.: Deep learning face representation from predicting 10,000 classes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1891–1898 (2014)
Sun, Y., Liang, D., Wang, X., Tang, X.: Deepid3: face recognition with very deep neural networks. arXiv preprint arXiv:1502.00873 (2015)
Liu, Y., Li, H., Wang, X.: Rethinking feature discrimination and polymerization for large-scale recognition. arXiv preprint arXiv:1710.00870 (2017)
Song, G., Leng, B., Liu, Y., Hetang, C., Cai, S.: Region-based quality estimation network for large-scale person re-identification. arXiv preprint arXiv:1711.08766 (2017)
Liu, Y., Yan, J., Ouyang, W.: Quality aware network for set to set recognition. In: CVPR, vol. 2, p. 8 (2017)
Zheng, Z., Zheng, L., Yang, Y.: Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In: The IEEE International Conference on Computer Vision (ICCV), October 2017
Zhou, Z., Huang, Y., Wang, W., Wang, L., Tan, T.: See the forest for the trees: joint spatial and temporal recurrent neural networks for video-based person re-identification. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017
Zhao, L., Li, X., Zhuang, Y., Wang, J.: Deeply-learned part-aligned representations for person re-identification. In: The IEEE International Conference on Computer Vision (ICCV), October 2017
Li, W., Zhu, X., Gong, S.: Person re-identification by deep joint learning of multi-loss classification. arXiv preprint arXiv:1705.04724 (2017)
Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: MS-Celeb-1M: a dataset and benchmark for large-scale face recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 87–102. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_6
Zheng, L., et al.: MARS: a video benchmark for large-scale person re-identification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 868–884. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_52
Weston, J., Ratle, F., Mobahi, H., Collobert, R.: Deep learning via semi-supervised embedding. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 639–655. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8_34
Lee, D.H.: Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on Challenges in Representation Learning, ICML, vol. 3, p. 2 (2013)
Liu, X., Song, M., Tao, D., Zhou, X., Chen, C., Bu, J.: Semi-supervised coupled dictionary learning for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3550–3557 (2014)
Odena, A.: Semi-supervised learning with generative adversarial networks. arXiv preprint arXiv:1606.01583 (2016)
Fan, H., Zheng, L., Yang, Y.: Unsupervised person re-identification: clustering and fine-tuning. arXiv preprint arXiv:1705.10444 (2017)
Yang, J., Parikh, D., Batra, D.: Joint unsupervised learning of deep representations and image clusters. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5147–5156 (2016)
Wang, X., et al.: Unsupervised joint mining of deep features and image labels for large-scale radiology image categorization and scene recognition. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 998–1007. IEEE (2017)
Zhu, X., Ghahramani, Z.: Learning from labeled and unlabeled data with label propagation (2002)
MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297, Oakland, CA, USA (1967)
Gowda, K.C., Krishna, G.: Agglomerative clustering using the concept of mutual nearest neighbourhood. Pattern Recognit. 10(2), 105–112 (1978)
Gdalyahu, Y., Weinshall, D., Werman, M.: Self-organization in vision: stochastic clustering for image segmentation, perceptual grouping, and image database organization. IEEE Trans. Pattern Anal. Mach. Intell. 23(10), 1053–1074 (2001)
Kurita, T.: An efficient agglomerative clustering algorithm using a heap. Pattern Recognit. 24(3), 205–209 (1991)
Cozman, F.G.: Semi-supervised learning of mixture models. In: ICML (2003)
Bennett, K.P.: Semi-supervised support vector machines. In: NIPS, pp. 368–374 (1999)
Liu, W., Wang, J., Chang, S.F.: Robust and scalable graph-based semisupervised learning. Proc. IEEE 100(9), 2624–2638 (2012)
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI, vol. 4., p. 12 (2017)
Lecun, Y., Cortes, C.: The MNIST database of handwritten digits (2010)
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images (2009)
Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: Ms-celeb-1m: challenge of recognizing one million celebrities in the real world. Electron. Imaging 2016(11), 1–6 (2016)
Maaten, L.V., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(Nov), 2579–2605 (2008)
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 1735–1742. IEEE (2006)
Xiao, T., Li, H., Ouyang, W., Wang, X.: Learning deep feature representations with domain guided dropout for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1249–1258 (2016)
Zhao, H., et al.: Spindle Net: person re-identification with human body region guided feature decomposition and fusion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1077–1085 (2017)
Yu, F., Li, W., Li, Q., Liu, Y., Shi, X., Yan, J.: POI: multiple object tracking with high performance detection and appearance feature. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 36–42. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_3
Liu, Y., Li, H., Yan, J., Wei, F., Wang, X., Tang, X.: Recurrent scale approximation for object detection in CNN. In: IEEE International Conference on Computer Vision, vol. 5 (2017)
Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., Tian, Q.: Scalable person re-identification: a benchmark. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1116–1124 (2015)
Li, W., Zhao, R., Wang, X.: Human reidentification with transferred metric learning. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012. LNCS, vol. 7724, pp. 31–44. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37331-2_3
Li, W., Zhao, R., Xiao, T., Wang, X.: Deepreid: deep filter pairing neural network for person re-identification. In: CVPR (2014)
Wang, T., Gong, S., Zhu, X., Wang, S.: Person re-identification by video ranking. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 688–703. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10593-2_45
Hirzer, M., Beleznai, C., Roth, P.M., Bischof, H.: Person re-identification by descriptive and discriminative classification. In: Proceedings Scandinavian Conference on Image Analysis (SCIA) (2011)
The iarpa janus benchmark-c face challenge (ijb-c). https://www.nist.gov/programs-projects/face-challenges. Accessed 15 Mar 2018
Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Goyal, P., et al.: Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)
Su, C., Li, J., Zhang, S., Xing, J., Gao, W., Tian, Q.: Pose-driven deep convolutional model for person re-identification. In: The IEEE International Conference on Computer Vision (ICCV), October 2017
Zhong, Z., Zheng, L., Cao, D., Li, S.: Re-ranking person re-identification with k-reciprocal encoding. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017
Xu, S., Cheng, Y., Gu, K., Yang, Y., Chang, S., Zhou, P.: Jointly attentive spatial-temporal pooling networks for video-based person re-identification. In: The IEEE International Conference on Computer Vision (ICCV), October 2017
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, Y., Song, G., Shao, J., Jin, X., Wang, X. (2018). Transductive Centroid Projection for Semi-supervised Large-Scale Recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11209. Springer, Cham. https://doi.org/10.1007/978-3-030-01228-1_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-01228-1_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01227-4
Online ISBN: 978-3-030-01228-1
eBook Packages: Computer ScienceComputer Science (R0)