Keywords

1 Introduction

Convolutional neural networks (CNNs) have achieved great success on vision community, significantly improving the state of the art in classification problems, such as object [11, 12, 18, 28, 33], scene [41, 42], action [3, 16, 36] and so on. It mainly benefits from the large scale training data [8, 26] and the end-to-end learning framework. The most commonly used CNNs perform feature learning and label prediction, mapping the input data to deep features (the output of the last hidden layer), then to the predicted labels, as shown in Fig. 1.

In generic object, scene or action recognition, the classes of the possible testing samples are within the training set, which is also referred to close-set identification. Therefore, the predicted labels dominate the performance and softmax loss is able to directly address the classification problems. In this way, the label prediction (the last fully connected layer) acts like a linear classifier and the deeply learned features are prone to be separable.

Fig. 1.
figure 1

The typical framework of convolutional neural networks.

For face recognition task, the deeply learned features need to be not only separable but also discriminative. Since it is impractical to pre-collect all the possible testing identities for training, the label prediction in CNNs is not always applicable. The deeply learned features are required to be discriminative and generalized enough for identifying new unseen classes without label prediction. Discriminative power characterizes features in both the compact intra-class variations and separable inter-class differences, as shown in Fig. 1. Discriminative features can be well-classified by nearest neighbor (NN) [7] or k-nearest neighbor (k-NN) [9] algorithms, which do not necessarily depend on the label prediction. However, the softmax loss only encourage the separability of features. The resulting features are not sufficiently effective for face recognition.

Constructing highly efficient loss function for discriminative feature learning in CNNs is non-trivial. Because the stochastic gradient descent (SGD) [19] optimizes the CNNs based on mini-batch, which can not reflect the global distribution of deep features very well. Due to the huge scale of training set, it is impractical to input all the training samples in every iteration. As alternative approaches, contrastive loss [10, 29] and triplet loss [27] respectively construct loss functions for image pairs and triplet. However, compared to the image samples, the number of training pairs or triplets dramatically grows. It inevitably results in slow convergence and instability. By carefully selecting the image pairs or triplets, the problem may be partially alleviated. But it significantly increases the computational complexity and the training procedure becomes inconvenient.

In this paper, we propose a new loss function, namely center loss, to efficiently enhance the discriminative power of the deeply learned features in neural networks. Specifically, we learn a center (a vector with the same dimension as a feature) for deep features of each class. In the course of training, we simultaneously update the center and minimize the distances between the deep features and their corresponding class centers. The CNNs are trained under the joint supervision of the softmax loss and center loss, with a hyper parameter to balance the two supervision signals. Intuitively, the softmax loss forces the deep features of different classes staying apart. The center loss efficiently pulls the deep features of the same class to their centers. With the joint supervision, not only the inter-class features differences are enlarged, but also the intra-class features variations are reduced. Hence the discriminative power of the deeply learned features can be highly enhanced. Our main contributions are summarized as follows.

  • We propose a new loss function (called center loss) to minimize the intra-class distances of the deep features. To be best of our knowledge, this is the first attempt to use such a loss function to help supervise the learning of CNNs. With the joint supervision of the center loss and the softmax loss, the highly discriminative features can be obtained for robust face recognition, as supported by our experimental results.

  • We show that the proposed loss function is very easy to implement in the CNNs. Our CNN models are trainable and can be directly optimized by the standard SGD.

  • We present extensive experiments on the datasets of MegaFace Challenge [23] (the largest public domain face database with 1 million faces for recognition) and set new state-of-the-art under the evaluation protocol of small training set. We also verify the excellent performance of our new approach on Labeled Faces in the Wild (LFW) [15] and YouTube Faces (YTF) datasets [38].

2 Related Work

Face recognition via deep learning has achieved a series of breakthrough in these years [25, 27, 29, 30, 34, 37]. The idea of mapping a pair of face images to a distance starts from [6]. They train siamese networks for driving the similarity metric to be small for positive pairs, and large for the negative pairs. Hu et al. [13] learn a nonlinear transformations and yield discriminative deep metric with a margin between positive and negative face image pairs. There approaches are required image pairs as input.

Very recently, [31, 34] supervise the learning process in CNNs by challenging identification signal (softmax loss function), which brings richer identity-related information to deeply learned features. After that, joint identification-verification supervision signal is adopted in [29, 37], leading to more discriminative features. [32] enhances the supervision by adding a fully connected layer and loss functions to each convolutional layer. The effectiveness of triplet loss has been demonstrated in [21, 25, 27]. With the deep embedding, the distance between an anchor and a positive are minimized, while the distance between an anchor and a negative are maximized until the margin is met. They achieve state-of-the-art performance in LFW and YTF datasets.

3 The Proposed Approach

In this Section, we elaborate our approach. We first use a toy example to intuitively show the distributions of the deeply learned features. Inspired by the distribution, we propose the center loss to improve the discriminative power of the deeply learned features, followed by some discussions.

Table 1. The CNNs architecture we use in toy example, called LeNets++. Some of the convolution layers are followed by max pooling. \((5, 32)_{/1, 2} \times 2\) denotes 2 cascaded convolution layers with 32 filters of size \(5 \times 5\), where the stride and padding are 1 and 2 respectively. \(2_{/2,0}\) denotes the max-pooling layers with grid of \(2 \times 2\), where the stride and padding are 2 and 0 respectively. In LeNets++, we use the Parametric Rectified Linear Unit (PReLU) [12] as the nonlinear unit.

3.1 A Toy Example

In this section, a toy example on MNIST [20] dataset is presented. We modify the LeNets [19] to a deeper and wider network, but reduce the output number of the last hidden layer to 2 (It means that the dimension of the deep features is 2). So we can directly plot the features on 2-D surface for visualization. More details of the network architecture are given in Table 1. The softmax loss function is presented as follows.

$$\begin{aligned} \small \mathcal {L}_S = -\sum _{i=1}^{m} \log \frac{e^{W_{y_{i}}^{T}\varvec{x}_{i}+b_{y_{i}}}}{\sum _{j=1}^{n} e^{W_{j}^{T}\varvec{x}_{i}+b_{j}}} \end{aligned}$$
(1)

In Eq. 1, \(\varvec{x}_i\in \mathbb {R}^d\) denotes the ith deep feature, belonging to the \(y_i\)th class. d is the feature dimension. \(W_j\in \mathbb {R}^d\) denotes the jth column of the weights \(W\in \mathbb {R}^{d \times n}\) in the last fully connected layer and \(\varvec{b}\in \mathbb {R}^n\) is the bias term. The size of mini-batch and the number of class is m and n, respectively. We omit the biases for simplifying analysis. (In fact, the performance is nearly of no difference).

Fig. 2.
figure 2

The distribution of deeply learned features in (a) training set (b) testing set, both under the supervision of softmax loss, where we use 50K/10K train/test splits. The points with different colors denote features from different classes. Best viewed in color. (Color figure online)

The resulting 2-D deep features are plotted in Fig. 2 to illustrate the distribution. Since the last fully connected layer acts like a linear classifier, the deep features of different classes are distinguished by decision boundaries. From Fig. 2 we can observe that: (i) under the supervision of softmax loss, the deeply learned features are separable, and (ii) the deep features are not discriminative enough, since they still show significant intra-class variations. Consequently, it is not suitable to directly use these features for recognition.

3.2 The Center Loss

So, how to develop an effective loss function to improve the discriminative power of the deeply learned features? Intuitively, minimizing the intra-class variations while keeping the features of different classes separable is the key. To this end, we propose the center loss function, as formulated in Eq. 2.

$$\begin{aligned} \small \mathcal {L}_C = \frac{1}{2}\sum _{i=1}^{m} \Vert \varvec{x}_{i}-\varvec{c}_{y_i}\Vert ^2_2 \end{aligned}$$
(2)

The \(\varvec{c}_{y_i}\in \mathbb {R}^d\) denotes the \(y_i\)th class center of deep features. The formulation effectively characterizes the intra-class variations. Ideally, the \(\varvec{c}_{y_i}\) should be updated as the deep features changed. In other words, we need to take the entire training set into account and average the features of every class in each iteration, which is inefficient even impractical. Therefore, the center loss can not be used directly. This is possibly the reason that such a center loss has never been used in CNNs until now.

To address this problem, we make two necessary modifications. First, instead of updating the centers with respect to the entire training set, we perform the update based on mini-batch. In each iteration, the centers are computed by averaging the features of the corresponding classes (In this case, some of the centers may not update). Second, to avoid large perturbations caused by few mislabelled samples, we use a scalar \(\alpha \) to control the learning rate of the centers. The gradients of \(\mathcal {L}_C\) with respect to \(\varvec{x}_i\) and update equation of \(\varvec{c}_{y_i}\) are computed as:

$$\begin{aligned} \small \frac{\partial \mathcal {L}_C}{\partial \varvec{x}_i} = \varvec{x}_i - \varvec{c}_{y_i} \end{aligned}$$
(3)
$$\begin{aligned} \small \varDelta \varvec{c}_{j} = \frac{\sum _{i=1}^{m} \delta (y_i=j)\cdot (\varvec{c}_{j} - \varvec{x}_i)}{1 + \sum _{i=1}^{m}\delta (y_i=j)} \end{aligned}$$
(4)

where \(\delta (condition)=1\) if the condition is satisfied, and \(\delta (condition)=0\) if not. \(\alpha \) is restricted in [0, 1]. We adopt the joint supervision of softmax loss and center loss to train the CNNs for discriminative feature learning. The formulation is given in Eq. 5.

$$\begin{aligned} \small \begin{aligned} \mathcal {L}&= \mathcal {L}_S + \lambda \mathcal {L}_C \\&= -\sum _{i=1}^{m} \log \frac{e^{W_{y_{i}}^{T}\varvec{x}_{i}+b_{y_{i}}}}{\sum _{j=1}^{n} e^{W_{j}^{T}\varvec{x}_{i}+b_{j}}} + \frac{\lambda }{2} \sum _{i=1}^{m} \Vert \varvec{x}_{i}-\varvec{c}_{y_i}\Vert ^2_2 \end{aligned} \end{aligned}$$
(5)

Clearly, the CNNs supervised by center loss are trainable and can be optimized by standard SGD. A scalar \(\lambda \) is used for balancing the two loss functions. The conventional softmax loss can be considered as a special case of this joint supervision, if \(\lambda \) is set to 0. In Algorithm 1, we summarize the learning details in the CNNs with joint supervision.

figure a

We also conduct experiments to illustrate how the \(\lambda \) influences the distribution. Figure 3 shows that different \(\lambda \) lead to different deep feature distributions. With proper \(\lambda \), the discriminative power of deep features can be significantly enhanced. Moreover, features are discriminative within a wide range of \(\lambda \). Therefore, the joint supervision benefits the discriminative power of deeply learned features, which is crucial for face recognition.

Fig. 3.
figure 3

The distribution of deeply learned features under the joint supervision of softmax loss and center loss. The points with different colors denote features from different classes. Different \(\lambda \) lead to different deep feature distributions (\(\alpha =0.5\)). The white dots (\(\varvec{c}_0\), \(\varvec{c}_1\),...,\(\varvec{c}_9\)) denote 10 class centers of deep features. Best viewed in color. (Color figure online)

3.3 Discussion

  • The necessity of joint supervision. If we only use the softmax loss as supervision signal, the resulting deeply learned features would contain large intra-class variations. On the other hand, if we only supervise CNNs by the center loss, the deeply learned features and centers will degraded to zeros (At this point, the center loss is very small). Simply using either of them could not achieve discriminative feature learning. So it is necessary to combine them to jointly supervise the CNNs, as confirmed by our experiments.

  • Compared to contrastive loss and triplet loss. Recently, contrastive loss [29, 37] and triplet loss [27] are also proposed to enhance the discriminative power of the deeply learned face features. However, both contrastive loss and triplet loss suffer from dramatic data expansion when constituting the sample pairs or sample triplets from the training set. Our center loss enjoys the same requirement as the softmax loss and needs no complex recombination of the training samples. Consequently, the supervised learning of our CNNs is more efficient and easy-to-implement. Moreover, our loss function targets more directly on the learning objective of the intra-class compactness, which is very beneficial to the discriminative feature learning.

4 Experiments

The necessary implementation details are given in Sect. 4.1. Then we investigate the sensitiveness of the parameter \(\lambda \) and \(\alpha \) in Sect. 4.2. In Sects. 4.3 and 4.4, extensive experiments are conducted on several public domain face datasets (LFW [15], YTF [38] and MegaFace Challenge [23]) to verify the effectiveness of the proposed approach.

Fig. 4.
figure 4

The CNN architecture using for face recognition experiments. Joint supervision is adopted. The filter sizes in both convolution and local convolution layers are \(3\times 3\) with stride 1, followed by PReLU [12] nonlinear units. Weights in three local convolution layers are locally shared in the regions of \(4\times 4\), \(2\times 2\) and \(1\times 1\) respectively. The number of the feature maps are 128 for the convolution layers and 256 for the local convolution layers. The max-pooling grid is \(2\times 2\) and the stride is 2. The output of the 4th pooling layer and the 3th local convolution layer are concatenated as the input of the 1st fully connected layer. The output dimension of the fully connected layer is 512. Best viewed in color. (Color figure online)

4.1 Implementation Details

Preprocessing. All the faces in images and their landmarks are detected by the recently proposed algorithms [40]. We use 5 landmarks (two eyes, nose and mouth corners) for similarity transformation. When the detection fails, we simply discard the image if it is in training set, but use the provided landmarks if it is a testing image. The faces are cropped to \(112\times 96\) RGB images. Following a previous convention, each pixel (in [0, 255]) in RGB images is normalized by subtracting 127.5 then dividing by 128.

Training data. We use the web-collected training data, including CASIA-WebFace [39], CACD2000 [4], Celebrity+ [22]. After removing the images with identities appearing in testing datasets, it roughly goes to 0.7M images of 17,189 unique persons. In Sect. 4.4, we only use 0.49M training data, following the protocol of small training set. The images are horizontally flipped for data augmentation. Compared to [27] (200M), [34] (4M) and [25] (2M), it is a small scale training set.

Detailed settings in CNNs. We implement the CNN model using the Caffe [17] library with our modifications. All the CNN models in this Section are the same architecture and the details are given in Fig. 4. For fair comparison, we respectively train three kind of models under the supervision of softmax loss (model A), softmax loss and contrastive loss (model B), softmax loss and center loss (model C). These models are trained with batch size of 256 on two GPUs (TitanX). For model A and model C, the learning rate is started from 0.1, and divided by 10 at the 16 K, 24 K iterations. A complete training is finished at 28 K iterations and roughly costs 14 h. For model B, we find that it converges slower. As a result, we initialize the learning rate to 0.1 and switch it at the 24 K, 36 K iterations. Total iteration is 42 K and costs 22 h.

Detailed settings in testing. The deep features are taken from the output of the first FC layer. We extract the features for each image and its horizontally flipped one, and concatenate them as the representation. The score is computed by the Cosine Distance of two features after PCA. Nearest neighbor [7] and threshold comparison are used for both identification and verification tasks. Note that, we only use single model for all the testing.

4.2 Experiments on the Parameter \(\lambda \) and \(\alpha \)

The hyper parameter \(\lambda \) dominates the intra-class variations and \(\alpha \) controls the learning rate of center c in model C. Both of them are essential to our model. So we conduct two experiments to investigate the sensitiveness of the two parameters.

Fig. 5.
figure 5

Face verification accuracies on LFW dataset, respectively achieve by (a) models with different \(\lambda \) and fixed \(\alpha =0.5\). (b) models with different \(\alpha \) and fixed \(\lambda =0.003\).

In the first experiment, we fix \(\alpha \) to 0.5 and vary \(\lambda \) from 0 to 0.1 to learn different models. The verification accuracies of these models on LFW dataset are shown in Fig. 5. It is very clear that simply using the softmax loss (in this case \(\lambda \) is 0) is not a good choice, leading to poor verification performance. Properly choosing the value of \(\lambda \) can improve the verification accuracy of the deeply learned features. We also observe that the verification performance of our model remains largely stable across a wide range of \(\lambda \). In the second experiment, we fix \(\lambda =0.003\) and vary \(\alpha \) from 0.01 to 1 to learn different models. The verification accuracies of these models on LFW are illustrated in Fig. 5. Likewise, the verification performance of our model remains largely stable across a wide range of \(\alpha \).

4.3 Experiments on the LFW and YTF Datasets

In this part, we evaluate our single model on two famous face recognition benchmarks in unconstrained environments, LFW and YTF datasets. They are excellent benchmarks for face recognition in image and video. Some examples of them are illustrated in Fig. 6. Our model is trained on the 0.7M outside data, with no people overlapping with LFW and YTF. In this section, we fix the \(\lambda \) to 0.003 and the \(\alpha \) is 0.5 for model C.

LFW dataset contains 13,233 web-collected images from 5749 different identities, with large variations in pose, expression and illuminations. Following the standard protocol of unrestricted with labeled outside data [14]. We test on 6,000 face pairs and report the experiment results in Table 2.

YTF dataset consists of 3,425 videos of 1,595 different people, with an average of 2.15 videos per person. The clip durations vary from 48 frames to 6,070 frames, with an average length of 181.3 frames. Again, we follow the unrestricted with labeled outside data protocol and report the results on 5,000 video pairs in Table 2.

Fig. 6.
figure 6

Some face images and videos in LFW and YTF datasets. The face image pairs in green frames are the positive pairs (the same person), while the ones in red frames are negative pairs. The white bounding box in each image indicates the face for testing.

Table 2. Verification performance of different methods on LFW and YTF datasets

From the results in Table 2, we have the following observations. First, model C (jointly supervised by the softmax loss and the center loss) beats the baseline one (model A, supervised by the softmax loss only) by a significant margin, improving the performance from (97.37 % on LFW and 91.1 % on YTF) to (99.28 % on LFW and 94.9 % on YTF). This shows that the joint supervision can notably enhance the discriminative power of deeply learned features, demonstrating the effectiveness of the center loss. Second, compared to model B (supervised by the combination of the softmax loss and the contrastive loss), model C achieves better performance (99.10 % v.s. 99.28 % and 93.8 % v.s. 94.9 %). This shows the advantage of the center loss over the contrastive loss in the designed CNNs. Last, compared to the state-of-the-art results on the two databases, the results of the proposed model C (much less training data and simpler network architecture) are consistently among the top-ranked sets of approaches based on the two databases, outperforming most of the existing results in Table 2. This shows the advantage of the proposed CNNs.

4.4 Experiments on the Dataset of MegaFace Challenge

MegaFace datasets are recently released as a testing benchmark. It is a very challenging dataset and aims to evaluate the performance of face recognition algorithms at the million scale of distractors (people who are not in the testing set). MegaFace datasets include gallery set and probe set. The gallery set consists of more than 1 million images from 690 K different individuals, as a subset of Flickr photos [35] from Yahoo. The probe set using in this challenge are two existing databases: Facescrub [24] and FGNet [1]. Facescrub dataset is publicly available dataset, containing 100 K photos of 530 unique individuals (55,742 images of males and 52,076 images of females). The possible bias can be reduced by sufficient samples in each identity. FGNet dataset is a face aging dataset, with 1002 images from 82 identities. Each identity has multiple face images at different ages (ranging from 0 to 69).

There are several testing scenarios (identification, verification and pose invariance) under two protocols (large or small training set). The training set is defined as small if it contains less than 0.5M images and 20 K subjects. Following the protocol of small training set, we reduce the size of training images to 0.49M but maintaining the number of identities unchanged (i.e. 17,189 subjects). The images overlapping with Facescrub dataset are discarded. For fair comparison, we also train three kinds of CNN models on small training set under different supervision signals. The resulting models are called model A-, model B- and model C-, respectively. Following the same settings in Sect. 4.3, the \(\lambda \) is 0.003 and the \(\alpha \) is 0.5 in model C-. We conduct the experiments with the provided code [23], which only tests our algorithm on one of the three gallery (Set 1).

Fig. 7.
figure 7

Some example face images in MegaFace dataset, including probe set and gallery. The gallery consists of at least one correct image and millions of distractors. Because of the great intra-variations in each subject and varieties of distractors, the identification and verification task become very challenging.

Face Identification. Face identification aims to match a given probe image to the ones with the same person in gallery. In this task, we need to compute the similarity between each given probe face image and the gallery, which includes at least one image with the same identity as the probe one. Besides, the gallery contains different scale of distractors, from 10 to 1 million, leading to increasing challenge in testing. More details can be found in [23]. In face identification experiments, we present the results by Cumulative Match Characteristics (CMC) curves. It reveals the probability that a correct gallery image is ranked on top-K. The results are shown in Fig. 8.

Fig. 8.
figure 8

CMC curves of different methods (under the protocol of small training set) with (a) 1M and (b) 10 K distractors on Set 1. The results of other methods are provided by MegaFace team.

Face Verification. For face verification, the algorithm should decide a given pair of images is the same person or not. 4 billion negative pairs between the probe and gallery datasets are produced. We compute the True Accept Rate (TAR) and False Accept Rate (FAR) and plot the Receiver Operating Characteristic (ROC) curves of different methods in Fig. 9.

Fig. 9.
figure 9

ROC curves of different methods (under the protocol of small training set) with (a) 1M and (b) 10 K distractors on Set 1. The results of other methods are provided by MegaFace team.

We compare our method against many existing ones, including (i) LBP [2] and JointBayes [5], (ii) our baseline deep models (model A- and model B-), and (iii) deep models submitted by other groups. As can be seen from Fig. 8 and Fig. 9, the hand-craft features and shallow model perform poorly. Their accuracies drop sharply with the increasing number of distractors. In addition, the methods based on deep learning perform better than the traditional ones. However, there is still much room for performance improvement. Finally, with the joint supervision of softmax loss and center loss, model C- achieves the best results, not only surpassing the model A- and model B- by a clear margin but also significantly outperforming the other published methods.

To meet the practical demand, face recognition models should achieve high performance against millions of distractors. In this case, only Rank-1 identification rate with at least 1M distractors and verification rate at low false accept rate (e.g., \(10^{-6}\)) are very meaningful [23]. We report the experimental results of different methods in Tables 3 and 4.

Table 3. Identification rates of different methods on MegaFace with 1M distractors.
Table 4. Verification TAR of different methods at \(10^{-6}\) FAR on MegaFace with 1M distractors.

From these results we have the following observations. First, not surprisingly, model C- consistently outperforms model A- and model B- by a significant margin in both face identification and verification tasks, confirming the advantage of the designed loss function. Second, under the evaluation protocol of small training set, the proposed model C- achieves the best results in both face identification and verification tasks, outperforming the 2nd place by 5.97 % on face identification and 10.15 % on face verification, respectively. Moreover, it is worth to note that model C- even surpasses some models trained with large training set (e.g., Beijing Facecall Co.). Last, the models from Google and NTechLAB achieve the best performance under the protocol of large training set. Note that, their private training set (500M for Google and 18M for NTechLAB) are much larger than ours (0.49M).

5 Conclusions

In this paper, we have proposed a new loss function, referred to as center loss. By combining the center loss with the softmax loss to jointly supervise the learning of CNNs, the discriminative power of the deeply learned features can be highly enhanced for robust face recognition. Extensive experiments on several large-scale face benchmarks have convincingly demonstrated the effectiveness of the proposed approach.