Keywords

1 Introduction

In recent years, the performance of ordinary face recognition algorithms has almost reached saturation. However, extreme lighting scenes, such as a dark environment, severely limit the application of face recognition in daily life. Near-infrared face images maintain high image quality regardless of lighting conditions, making face recognition possible in dark environments. Matching near-infrared face images to visible light face images, i.e., heterogeneous face recognition, is important in realistic face recognition systems [7, 16]. Compared with the canonical face recognition, it is faced with the following challenges:

Fig. 1.
figure 1

Comparison of series structure and parallel structure. Classic series structure means that the lower-level layer of the network, S-stream, is pre-trained and its parameters are fixed. The high-level layer of the network, T-stream, is trainable and uses only the high-level features of the S-stream as input. Our parallel structure means that one branch of the network, S-stream, is pre-trained and its parameters are fixed. Another branch of the network, T-stream, absorbs multi-scale feature maps extracted by S-stream and transfer face embedded features.

Insufficient Data. The matching task of near-infrared and visible light faces requires each subject to have both VIS images and NIR images. The collection of such data is difficult, and the existing NIR and VIS face databases are small in scale [12, 14] compared with ordinary face databases which are composed of visible light facial images collected by web crawlers.

Cross-domain recognition is the fundamental difficulty of heterogeneous face recognition [12, 14]. Images from different domains tend to vary greatly in appearance, attributes, and distribution. To match pictures from diverse domains, the algorithm needs to ignore the attributes of the domain and extract the domain-independent features.

Considering the above challenges, we argue that transfer learning and cross-domain are two key issues. First, we believe that parallel structure using multi-scale feature maps for transfer learning is an optimal choice. Since deep learning is a data starvation algorithm, training a deep model with insufficient samples is prone to overfitting. Therefore, the common practice is to pre-train the model on a large-scale VIS database (source domain) and then fine-tune the last few layers on a small-scale NIR database (target domain) [9, 10]. This is a series structure that only transfers high-level features. To facilitate comparison with our parallel structure, we refer to the untunable low-level layers in the series structure as S-stream, and the tunable high-level layers as T-stream. In the series structures, T-stream only uses the high-level features generated by S-stream to transfer the face embedded features, which loses a considerable amount of diagnostic information and is not competent enough. In addition, it is difficult to determine the number of layers to be fine-tuned. For the sake of compromise, we propose a parallel two-stream architecture that pre-trains S-stream on a large-scale VIS database to generate multi-scale feature maps and transfers them with T-stream which contains fewer parameters to avoid overfitting. The comparison of series and parallel structure is illustrated in Fig. 1.

Second, we believe that both NIR and VIS features should be transferred to a new unique modality agnostic feature space. For general transfer learning tasks, we tend to focus on model performance only in the target domain, ignoring the source domain. For NIR-to-VIS cross-domain face recognition, we must focus on both domains and transfer discriminative features of different domains to a unique space to eliminate the effects of modality. Using mixed NIR-VIS images, i.e., each subject has both NIR and VIS images, the classification loss automatically optimizes the features of different domains to the modality agnostic feature space.

The main contributions of our work are summarized as follows:

  • We propose a parallel architecture consisting of two streams of network to achieve cross-domain identification. With multi-scale feature maps from S-stream, T-stream transfers face embeddings of different domains to a modality agnostic feature space.

  • We optimize loss selection in heterogeneous face recognition and use the margin-based loss to help the model learn more expressive representations.

  • The CASIA NIR-VIS 2.0 Face database [8] and Megaface [6] are combined to form a more challenging NIR-to-VIS MegaFace scenario.

2 Related Work

Face Recognition. FaceNet [15] proposes triplet loss, minimums the distance between an anchor and a positive, and maximizes the distance between the anchor and a negative, which is an effective metric learning method. Sphereface [11], NormFace [19], CosFace [18, 20] and ArcFace [2] improve the softmax loss used for classification to accommodate metric learning, reaching state-of-the-art of face recognition at present. Our approach also benefits from this improvement.

Heterogeneous Face Recognition. IDR [4] divided high-level layer into two orthogonal subspaces that contain modality-invariant identity information and modality-variant spectrum information, respectively. W-CNN [5] aims to achieve the minimization of the Wasserstein distance between the NIR distribution and VIS distribution for invariant deep feature representation of heterogeneous face images. DVR [21] makes use of the Disentangled Variational Representation (DVR) for cross-modal matching. They both pre-train the model on big databases and keep the parameters of low-level layers fixed, which are the series-structure-based transfer learning methods.

PTU [24] argues that parameter transferability differs with domains and networks and should not be simply divided into random, fine-tune, and frozen. They propose a parameter transfer unit to tackle the above limitations. In our method, feature fusion units connect the S-stream used to preserve the discriminative ability learned on large databases and the T-stream used for transfer, which is similar to PTU. Moreover, feature fusion units use the multi-scale feature maps from S-stream to help T-stream learn expressive representations.

3 Method

In this section, we first present the details of the proposed architecture (Sect. 3.1) and explain how we transfer both NIR and VIS features to a unique space (Sect. 3.2). Subsequently, we briefly introduce the margin-based loss and explain its importance for representation learning in heterogeneous face recognition (Sect. 3.3). Finally, we introduce the more challenging NIR-to-VIS MegaFace scenario (Sect. 3.4).

Fig. 2.
figure 2

An overview of the proposed PST architecture. The proposed architecture consists of two parallel streams of network, i.e., a source stream (S-stream) and a transfer stream (T-stream). During training, we first pre-train S-stream on large-scale VIS data. Accordingly, parameters of S-stream are fixed, and we train T-stream with mixed NIR and VIS data. Intermediate feature maps of S-stream, in multiple resolutions, are absorbed by feature fusion units (FFU) and fused with the corresponding feature maps of T-stream. Finally, margin-based classification loss is applied on top of the fused features.

3.1 Architecture

The overall structure of PST is illustrated in Fig. 2. It is a parallel two-stream architecture consisting of a source stream (S-stream), a transfer stream (T-stream), and feature fusion units (FFUs). The S-stream adopts a Resnet-like network structure; the T-stream is a lite version of the S-stream. Convn.x (where \(\mathrm{n}=1,2,3,4\)) in Fig. 2 denotes units that contain several convolution layers, activation layers and residual units. FFUs absorb multi-scale intermediate feature maps of S-stream and fuse them with corresponding feature maps of T-stream. Given the structure of Resnet [3], which consists of 4 scale feature maps, we used 4 feature fusion units to connect the S-stream and the T-stream so that feature maps of different resolutions could be transferred. A feature fusion unit consists of several convolution layers and activation layers, which is formulated by:

$$\begin{aligned} f=FFU(f_s,f_t)=H(Concat(f_s,f_t)) \end{aligned}$$
(1)

where \(f_s\) and \(f_t\) are corresponding intermediate feature maps of S-stream and T-stream, respectively, and f is the result of feature transfer. These three tensors have the exact same shape. \(Concat(\cdot )\) denotes a concatenation operation and \(H(\cdot )\) denotes a residual unit composed of two depth-wise separable convolution and activation layers.

After S-stream is pre-trained with a large-scale VIS database and its parameters are fixed, we use the original NIR or VIS image and the corresponding multi-scale intermediate feature maps extracted by S-stream as input to the T-stream and FFUs. While T-stream generates feature maps of an NIR or VIS image, it absorbs the corresponding feature maps generated by S-stream through FFUs to form transferred feature maps. The top-level post-transfer features form face embeddings, which are optimized by margin-based classification loss.

Fig. 3.
figure 3

Description of the transferring target. Left. Vanilla transfer. The model only transfers NIR features while retaining VIS features extracted by S-stream. Right. Modality agnostic transfer. The model transfers both the VIS and NIR features to a new modality agnostic feature space.

3.2 Modality Agnostic Transfer

Since the S-stream in our model is pre-trained on large-scale VIS data and its parameters remain fixed, the model can extract discriminative features of the VIS domain only by S-stream. The vanilla transfer approach (left of Fig. 3) keeps the VIS feature space unchanged (using the features extracted only by S-stream) and transfers NIR features to the VIS feature space. However, forcing only NIR features to be transferred into the VIS feature space would cause the model to lose focus on the VIS features. The key point of cross-domain identification is that we must focus on both the source and target domains, and it is not appropriate to place emphasis on either side. We argue that the model should perform “modality agnostic transfer” on both NIR and VIS data (right of Fig. 3). This key phrase means that we indiscriminately transfer the features of both VIS and NIR data to a new modality agnostic unique feature space. Modality agnostic transfer ensures that the model balances both the source and target domains. The features of these two domains are simultaneously transferred to a new unique feature space, where features from different domains of the same subject are optimized more closely. Our empirical study further proves this.

3.3 Margin-Based Loss

Recently, many works [2, 11, 18,19,20] have improved the cross entropy loss function to accommodate metric learning, which significantly enhances the performance of face recognition. Cross entropy loss is formulated as follows:

$$\begin{aligned} L=-log\frac{e^{s \cdot score(\theta _y)}}{\sum _{j=1}^{n} e^{s \cdot score(\theta _j)}} \end{aligned}$$
(2)

where \(score(\theta _y)\) denotes the category score belonging to the y-th class and n is the class number of the dataset. The details of improvement are mainly reflected in the increase of margin(m) into the category score(\(\theta _y\)) of loss, which is formulated by the following three main types:

$$\begin{aligned} score(\theta _y)= & {} cosine(m \cdot \theta _y), m>1 \end{aligned}$$
(3)
$$\begin{aligned} score(\theta _y)= & {} cosine(\theta _y)-m, m>0 \end{aligned}$$
(4)
$$\begin{aligned} score(\theta _y)= & {} cosine(\theta _y+m), m>0 \end{aligned}$$
(5)

They are proposed in Sphereface [11], Cosface [20] and Arcface [2], respectively.

Improvements in loss design have been fully validated on large databases, but there are only poor works applying them to heterogeneous face recognition. The sample categories of the large-scale face databases are very rich, while the CASIA NIR-VIS 2.0 database [8] has less than 400 subjects as the training set. The margin-based loss design described above can effectively ensure the distribution of features remains more compact. Small databases with poor categories often require larger margins than large databases (Subsect. 4.2).

3.4 NIR-to-VIS MegaFace Scenario

Both CASIA NIR-VIS 2.0 database and Megaface’s evaluation protocol are for face identification [6, 8]. CASIA evaluates the performance of algorithms matching near-infrared and visible images. The probe set contains of near-infrared images, and the gallery set contains visible images. The probe set of Megaface evaluation is even smaller than CASIA, but there are as many as a million distractors in the gallery set, making the task much more challenging. We add the Megaface’s distractors into the CASIA’s gallery set to enhance the challenge of the NIR-to-VIS task. Specific evaluation details will be given in Subsect. 4.3.

4 Experiments

Our approach adopts a two-step training strategy. We first pre-train the S-stream on a large-scale VIS database, keeping its parameters fixed. Then, we train the T-stream and FFUs with mixed NIR-VIS images and the corresponding multi-scale feature maps extracted by S-stream. All experiments are based on Pytorch [13].

4.1 Pre-train S-stream with Large-Scale VIS Data

As mentioned in Sect. 1, pre-training model is a common and essential process. Our implementation is similar to Sphereface [11], Cosface [20] and Arcface [2]. We select a Resnet50-like backbone as in Arcface. The output of the network is embedded features with a length of 512. WebFace [23] and VggFace2 [1] are used to pre-train the S-stream.

4.2 Train T-stream and FFUs with Mixed NIR-VIS Data

To the best of our knowledge, the CASIA NIR-VIS 2.0 Face Database [8] is the largest public near-infrared face database that is the only accessible dataset designed for face recognition. All of our experiments are conducted on this database. The database contains a total of 725 subjects, each subject with 1–22 visible light images and 5–50 near-infrared images. We follow the 10-fold protocol in View 2 of the database. There are approximately 8,600 NIR or VIS images from around 360 subjects as training fold. Each testing fold consists of a probe set and a gallery set. The probe set contains over 6,100 NIR images of the remaining 358 subjects. The gallery set contains only one visible light image of these subjects. In the experiment, we compare the proposed method with a strong baseline and the state-of-the-art to prove the effectiveness of our method.

Fine-Tune Baseline. As mentioned in Subsect. 3.2, we argue that fine-tuning with only the last fully connected layer achieves a more convincing and the best performance based on the fine-tune method. We can be clear from Fig. 4 that when the margin is small, the model performance can even be degraded (compared with the results of the pre-training). Compared to large databases, we need to set a larger margin.

Modality Agnostic Transfer. The architecture of the network is illustrated in Fig. 2. The model performs modality agnostic feature transfer, i.e., VIS features transferred by T-stream and FFUs. The model performs vanilla transfer, i.e., the VIS features are extracted only by S-stream. It can be seen from Table 1 that performing modality agnostic transfer, i.e., indiscriminately transferring NIR and VIS to a unique space, can significantly improve the cross-domain recognition performance.

Table 1. Vanilla transfer vs. modality agnostic transfer.
Fig. 4.
figure 4

Fine-tuning rank-1 accuracy under various margins.

Fig. 5.
figure 5

ROC curves on CASIA. The legend is described in the following format—{pre-train dataset}-{method}. (Color figure online)

4.3 Evaluation

CASIA Protocol. We present the ROC curve (Fig. 5) and rank-1 identification performance (Table 2) of the fine-tune baseline and PST with different pre-training databases. We can see from Fig. 5 that the TAR of PST is far superior to the fine-tune baseline under any FAR (yellow vs. blue & purple vs. red). The rank-1 error rates of our algorithm are reduced from 3.43% to 0.66% and from 0.98% to 0.33%. Compared with the fine-tune baseline, PST reduces the rank-1 error rate by 0.2–0.3.

Table 2. Performance on CASIA. Pre-trained with WebFace (top) & Pre-trained with VggFace2 (bottom).
Table 3. Comparisons with recent methods. Ours mean S-stream of PST is pre-trained with VggFace2.

We also present a comparison with some recent heterogeneous face recognition methods. These results are from the DVR [21]. Compared with the state-of-the-art, our method reduces the rank-1 error rate by 11%, dropping from 3% to 0.33%. As shown in Table 3, we achieve the best performance.

NIR-to-VIS MegaFace Scenario. As shown in Tables 2 and 3, the improvement of rank-1 accuracy seems to be limited, which is to a great extent because the scale of the gallery set under the original CASIA evaluation is too small, and the performance is approaching saturation. Therefore, we add Megaface’s distractors into the CASIA’s gallery set to enhance the challenge of the NIR-to-VIS face recognition task as described in Subsect. 3.4.

Fig. 6.
figure 6

Identification performance on NIR-to-VIS MegaFace. The legend is described in the following format—{pre-train dataset}-{method}

Table 4. Identification performance under different protocols. The Model is described in the following format—{pre-train dataset}-{method}

We use MTCNN to detect and align face images in the MegaFace distractors. We follow the MegaFace evaluation to measure the results of rank-n under different distractor terms (Fig. 6). Particularly, a model with a similar performance in the original CASIA protocol shows a more significant performance gap on MegaFace scenario (Table 4). The rank-1 accuracy rates are slightly improved (from 99.902% to 99.967%) under the original CASIA protocol while greatly improved (from 97.763% to 99.183%) in the Megaface scenario. The Megaface scenario better reflects the model performance gaps and confirms the superiority of our algorithm.

5 Conclusion

In this paper, we propose a NIR-to-VIS face recognition architecture consisting of two parallel streams of network, i.e., a source stream (S-stream) and a transfer stream (T-stream). S-stream is pre-trained on a large-scale VIS database, and its parameters are fixed. Utilizing multi-scale intermediate feature maps extracted by S-stream, T-stream transfers both the NIR and VIS features to a new unique modality agnostic space. Only transferring NIR features to the VIS feature space is not competent and modality agnostic transferring is essential. Finally, we validate the effectiveness of our approach in the NIR-to-VIS MegaFace scenario.