Abstract
This paper considers a heterogeneous face recognition problem, i.e., matching near-infrared (NIR) to visible (VIS) face images. The significant domain gap between the NIR and VIS modalities poses great challenges to accurate face recognition. To overcome the domain gap problem, previous works usually adopted a series structure to transfer high-level features. This paper proposes a Parallel-Structure-based Transfer learning method (PST), which fully utilizes multi-scale feature map information. Specifically, PST consists of two parallel streams of network, i.e., a source stream (S-stream) and a transfer stream (T-stream). S-stream is pre-trained on a large-scale VIS database, and its parameters are fixed. It preserves the discriminative ability learned from the large-scale source dataset. T-stream absorbs multi-scale feature maps from S-stream and transfers the NIR and VIS face embeddings to a unique feature space, which is agnostic to the input image modality. The proposed PST method achieves state-of-the-art performance on CASIA NIR-VIS 2.0 Database, the largest near-infrared face database.
This work was supported by the state key development program in 13th Five-Year under Grant No. 2016YFB0801301 and the National Natural Science Foundation of China under Grant Nos. 61701277, 61771288.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
In recent years, the performance of ordinary face recognition algorithms has almost reached saturation. However, extreme lighting scenes, such as a dark environment, severely limit the application of face recognition in daily life. Near-infrared face images maintain high image quality regardless of lighting conditions, making face recognition possible in dark environments. Matching near-infrared face images to visible light face images, i.e., heterogeneous face recognition, is important in realistic face recognition systems [7, 16]. Compared with the canonical face recognition, it is faced with the following challenges:
Insufficient Data. The matching task of near-infrared and visible light faces requires each subject to have both VIS images and NIR images. The collection of such data is difficult, and the existing NIR and VIS face databases are small in scale [12, 14] compared with ordinary face databases which are composed of visible light facial images collected by web crawlers.
Cross-domain recognition is the fundamental difficulty of heterogeneous face recognition [12, 14]. Images from different domains tend to vary greatly in appearance, attributes, and distribution. To match pictures from diverse domains, the algorithm needs to ignore the attributes of the domain and extract the domain-independent features.
Considering the above challenges, we argue that transfer learning and cross-domain are two key issues. First, we believe that parallel structure using multi-scale feature maps for transfer learning is an optimal choice. Since deep learning is a data starvation algorithm, training a deep model with insufficient samples is prone to overfitting. Therefore, the common practice is to pre-train the model on a large-scale VIS database (source domain) and then fine-tune the last few layers on a small-scale NIR database (target domain) [9, 10]. This is a series structure that only transfers high-level features. To facilitate comparison with our parallel structure, we refer to the untunable low-level layers in the series structure as S-stream, and the tunable high-level layers as T-stream. In the series structures, T-stream only uses the high-level features generated by S-stream to transfer the face embedded features, which loses a considerable amount of diagnostic information and is not competent enough. In addition, it is difficult to determine the number of layers to be fine-tuned. For the sake of compromise, we propose a parallel two-stream architecture that pre-trains S-stream on a large-scale VIS database to generate multi-scale feature maps and transfers them with T-stream which contains fewer parameters to avoid overfitting. The comparison of series and parallel structure is illustrated in Fig. 1.
Second, we believe that both NIR and VIS features should be transferred to a new unique modality agnostic feature space. For general transfer learning tasks, we tend to focus on model performance only in the target domain, ignoring the source domain. For NIR-to-VIS cross-domain face recognition, we must focus on both domains and transfer discriminative features of different domains to a unique space to eliminate the effects of modality. Using mixed NIR-VIS images, i.e., each subject has both NIR and VIS images, the classification loss automatically optimizes the features of different domains to the modality agnostic feature space.
The main contributions of our work are summarized as follows:
-
We propose a parallel architecture consisting of two streams of network to achieve cross-domain identification. With multi-scale feature maps from S-stream, T-stream transfers face embeddings of different domains to a modality agnostic feature space.
-
We optimize loss selection in heterogeneous face recognition and use the margin-based loss to help the model learn more expressive representations.
-
The CASIA NIR-VIS 2.0 Face database [8] and Megaface [6] are combined to form a more challenging NIR-to-VIS MegaFace scenario.
2 Related Work
Face Recognition. FaceNet [15] proposes triplet loss, minimums the distance between an anchor and a positive, and maximizes the distance between the anchor and a negative, which is an effective metric learning method. Sphereface [11], NormFace [19], CosFace [18, 20] and ArcFace [2] improve the softmax loss used for classification to accommodate metric learning, reaching state-of-the-art of face recognition at present. Our approach also benefits from this improvement.
Heterogeneous Face Recognition. IDR [4] divided high-level layer into two orthogonal subspaces that contain modality-invariant identity information and modality-variant spectrum information, respectively. W-CNN [5] aims to achieve the minimization of the Wasserstein distance between the NIR distribution and VIS distribution for invariant deep feature representation of heterogeneous face images. DVR [21] makes use of the Disentangled Variational Representation (DVR) for cross-modal matching. They both pre-train the model on big databases and keep the parameters of low-level layers fixed, which are the series-structure-based transfer learning methods.
PTU [24] argues that parameter transferability differs with domains and networks and should not be simply divided into random, fine-tune, and frozen. They propose a parameter transfer unit to tackle the above limitations. In our method, feature fusion units connect the S-stream used to preserve the discriminative ability learned on large databases and the T-stream used for transfer, which is similar to PTU. Moreover, feature fusion units use the multi-scale feature maps from S-stream to help T-stream learn expressive representations.
3 Method
In this section, we first present the details of the proposed architecture (Sect. 3.1) and explain how we transfer both NIR and VIS features to a unique space (Sect. 3.2). Subsequently, we briefly introduce the margin-based loss and explain its importance for representation learning in heterogeneous face recognition (Sect. 3.3). Finally, we introduce the more challenging NIR-to-VIS MegaFace scenario (Sect. 3.4).
3.1 Architecture
The overall structure of PST is illustrated in Fig. 2. It is a parallel two-stream architecture consisting of a source stream (S-stream), a transfer stream (T-stream), and feature fusion units (FFUs). The S-stream adopts a Resnet-like network structure; the T-stream is a lite version of the S-stream. Convn.x (where \(\mathrm{n}=1,2,3,4\)) in Fig. 2 denotes units that contain several convolution layers, activation layers and residual units. FFUs absorb multi-scale intermediate feature maps of S-stream and fuse them with corresponding feature maps of T-stream. Given the structure of Resnet [3], which consists of 4 scale feature maps, we used 4 feature fusion units to connect the S-stream and the T-stream so that feature maps of different resolutions could be transferred. A feature fusion unit consists of several convolution layers and activation layers, which is formulated by:
where \(f_s\) and \(f_t\) are corresponding intermediate feature maps of S-stream and T-stream, respectively, and f is the result of feature transfer. These three tensors have the exact same shape. \(Concat(\cdot )\) denotes a concatenation operation and \(H(\cdot )\) denotes a residual unit composed of two depth-wise separable convolution and activation layers.
After S-stream is pre-trained with a large-scale VIS database and its parameters are fixed, we use the original NIR or VIS image and the corresponding multi-scale intermediate feature maps extracted by S-stream as input to the T-stream and FFUs. While T-stream generates feature maps of an NIR or VIS image, it absorbs the corresponding feature maps generated by S-stream through FFUs to form transferred feature maps. The top-level post-transfer features form face embeddings, which are optimized by margin-based classification loss.
3.2 Modality Agnostic Transfer
Since the S-stream in our model is pre-trained on large-scale VIS data and its parameters remain fixed, the model can extract discriminative features of the VIS domain only by S-stream. The vanilla transfer approach (left of Fig. 3) keeps the VIS feature space unchanged (using the features extracted only by S-stream) and transfers NIR features to the VIS feature space. However, forcing only NIR features to be transferred into the VIS feature space would cause the model to lose focus on the VIS features. The key point of cross-domain identification is that we must focus on both the source and target domains, and it is not appropriate to place emphasis on either side. We argue that the model should perform “modality agnostic transfer” on both NIR and VIS data (right of Fig. 3). This key phrase means that we indiscriminately transfer the features of both VIS and NIR data to a new modality agnostic unique feature space. Modality agnostic transfer ensures that the model balances both the source and target domains. The features of these two domains are simultaneously transferred to a new unique feature space, where features from different domains of the same subject are optimized more closely. Our empirical study further proves this.
3.3 Margin-Based Loss
Recently, many works [2, 11, 18,19,20] have improved the cross entropy loss function to accommodate metric learning, which significantly enhances the performance of face recognition. Cross entropy loss is formulated as follows:
where \(score(\theta _y)\) denotes the category score belonging to the y-th class and n is the class number of the dataset. The details of improvement are mainly reflected in the increase of margin(m) into the category score(\(\theta _y\)) of loss, which is formulated by the following three main types:
They are proposed in Sphereface [11], Cosface [20] and Arcface [2], respectively.
Improvements in loss design have been fully validated on large databases, but there are only poor works applying them to heterogeneous face recognition. The sample categories of the large-scale face databases are very rich, while the CASIA NIR-VIS 2.0 database [8] has less than 400 subjects as the training set. The margin-based loss design described above can effectively ensure the distribution of features remains more compact. Small databases with poor categories often require larger margins than large databases (Subsect. 4.2).
3.4 NIR-to-VIS MegaFace Scenario
Both CASIA NIR-VIS 2.0 database and Megaface’s evaluation protocol are for face identification [6, 8]. CASIA evaluates the performance of algorithms matching near-infrared and visible images. The probe set contains of near-infrared images, and the gallery set contains visible images. The probe set of Megaface evaluation is even smaller than CASIA, but there are as many as a million distractors in the gallery set, making the task much more challenging. We add the Megaface’s distractors into the CASIA’s gallery set to enhance the challenge of the NIR-to-VIS task. Specific evaluation details will be given in Subsect. 4.3.
4 Experiments
Our approach adopts a two-step training strategy. We first pre-train the S-stream on a large-scale VIS database, keeping its parameters fixed. Then, we train the T-stream and FFUs with mixed NIR-VIS images and the corresponding multi-scale feature maps extracted by S-stream. All experiments are based on Pytorch [13].
4.1 Pre-train S-stream with Large-Scale VIS Data
As mentioned in Sect. 1, pre-training model is a common and essential process. Our implementation is similar to Sphereface [11], Cosface [20] and Arcface [2]. We select a Resnet50-like backbone as in Arcface. The output of the network is embedded features with a length of 512. WebFace [23] and VggFace2 [1] are used to pre-train the S-stream.
4.2 Train T-stream and FFUs with Mixed NIR-VIS Data
To the best of our knowledge, the CASIA NIR-VIS 2.0 Face Database [8] is the largest public near-infrared face database that is the only accessible dataset designed for face recognition. All of our experiments are conducted on this database. The database contains a total of 725 subjects, each subject with 1–22 visible light images and 5–50 near-infrared images. We follow the 10-fold protocol in View 2 of the database. There are approximately 8,600 NIR or VIS images from around 360 subjects as training fold. Each testing fold consists of a probe set and a gallery set. The probe set contains over 6,100 NIR images of the remaining 358 subjects. The gallery set contains only one visible light image of these subjects. In the experiment, we compare the proposed method with a strong baseline and the state-of-the-art to prove the effectiveness of our method.
Fine-Tune Baseline. As mentioned in Subsect. 3.2, we argue that fine-tuning with only the last fully connected layer achieves a more convincing and the best performance based on the fine-tune method. We can be clear from Fig. 4 that when the margin is small, the model performance can even be degraded (compared with the results of the pre-training). Compared to large databases, we need to set a larger margin.
Modality Agnostic Transfer. The architecture of the network is illustrated in Fig. 2. The model performs modality agnostic feature transfer, i.e., VIS features transferred by T-stream and FFUs. The model performs vanilla transfer, i.e., the VIS features are extracted only by S-stream. It can be seen from Table 1 that performing modality agnostic transfer, i.e., indiscriminately transferring NIR and VIS to a unique space, can significantly improve the cross-domain recognition performance.
4.3 Evaluation
CASIA Protocol. We present the ROC curve (Fig. 5) and rank-1 identification performance (Table 2) of the fine-tune baseline and PST with different pre-training databases. We can see from Fig. 5 that the TAR of PST is far superior to the fine-tune baseline under any FAR (yellow vs. blue & purple vs. red). The rank-1 error rates of our algorithm are reduced from 3.43% to 0.66% and from 0.98% to 0.33%. Compared with the fine-tune baseline, PST reduces the rank-1 error rate by 0.2–0.3.
We also present a comparison with some recent heterogeneous face recognition methods. These results are from the DVR [21]. Compared with the state-of-the-art, our method reduces the rank-1 error rate by 11%, dropping from 3% to 0.33%. As shown in Table 3, we achieve the best performance.
NIR-to-VIS MegaFace Scenario. As shown in Tables 2 and 3, the improvement of rank-1 accuracy seems to be limited, which is to a great extent because the scale of the gallery set under the original CASIA evaluation is too small, and the performance is approaching saturation. Therefore, we add Megaface’s distractors into the CASIA’s gallery set to enhance the challenge of the NIR-to-VIS face recognition task as described in Subsect. 3.4.
We use MTCNN to detect and align face images in the MegaFace distractors. We follow the MegaFace evaluation to measure the results of rank-n under different distractor terms (Fig. 6). Particularly, a model with a similar performance in the original CASIA protocol shows a more significant performance gap on MegaFace scenario (Table 4). The rank-1 accuracy rates are slightly improved (from 99.902% to 99.967%) under the original CASIA protocol while greatly improved (from 97.763% to 99.183%) in the Megaface scenario. The Megaface scenario better reflects the model performance gaps and confirms the superiority of our algorithm.
5 Conclusion
In this paper, we propose a NIR-to-VIS face recognition architecture consisting of two parallel streams of network, i.e., a source stream (S-stream) and a transfer stream (T-stream). S-stream is pre-trained on a large-scale VIS database, and its parameters are fixed. Utilizing multi-scale intermediate feature maps extracted by S-stream, T-stream transfers both the NIR and VIS features to a new unique modality agnostic space. Only transferring NIR features to the VIS feature space is not competent and modality agnostic transferring is essential. Finally, we validate the effectiveness of our approach in the NIR-to-VIS MegaFace scenario.
References
Cao, Q., Shen, L., Xie, W., Parkhi, O.M., Zisserman, A.: VGGFace2: a dataset for recognising faces across pose and age. In: 2018 13th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2018), pp. 67–74. IEEE (2018)
Deng, J., Guo, J., Zafeiriou, S.: ArcFace: additive angular margin loss for deep face recognition. arXiv preprint arXiv:1801.07698 (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
He, R., Wu, X., Sun, Z., Tan, T.: Learning invariant deep representation for NIR-VIS face recognition (2017)
He, R., Wu, X., Sun, Z., Tan, T.: Wasserstein CNN: learning invariant features for NIR-VIS face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 41, 1761–1773 (2018)
Kemelmacher-Shlizerman, I., Seitz, S.M., Miller, D., Brossard, E.: The MegaFace benchmark: 1 million faces for recognition at scale. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4873–4882 (2016)
Klare, B.F., Jain, A.K.: Heterogeneous face recognition using kernel prototype similarities. IEEE Trans. Pattern Anal. Mach. Intell. 35(6), 1410–1422 (2013)
Li, S., Yi, D., Lei, Z., Liao, S.: The CASIA NIR-VIS 2.0 face database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 348–353 (2013)
Li, Y., Wang, S., Tian, Q., Ding, X.: Feature representation for statistical-learning-based object detection: a review. Pattern Recogn. 48(11), 3542–3559 (2015)
Li, Y., Wang, S., Tian, Q., Ding, X.: A survey of recent advances in visual feature detection. Neurocomputing 149, 736–751 (2015)
Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: SphereFace: deep hypersphere embedding for face recognition
Liu, X., Song, L., Wu, X., Tan, T.: Transferring deep representation for NIR-VIS heterogeneous face recognition. In: 2016 International Conference on Biometrics (ICB), pp. 1–8. IEEE (2016)
Paszke, A., et al.: Automatic differentiation in pytorch (2017)
Reale, C., Nasrabadi, N.M., Kwon, H., Chellappa, R.: Seeing the forest from the trees: a holistic approach to near-infrared heterogeneous face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 54–62 (2016)
Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)
Socolinsky, D.A., Selinger, A., Neuheisel, J.D.: Face recognition with visible and thermal infrared imagery. Comput. Vis. Image Underst. 91(1–2), 72–114 (2003)
Song, L., Zhang, M., Wu, X., He, R.: Adversarial discriminative heterogeneous face recognition. arXiv preprint arXiv:1709.03675 (2017)
Wang, F., Cheng, J., Liu, W., Liu, H.: Additive margin softmax for face verification. IEEE Signal Process. Lett. 25(7), 926–930 (2018)
Wang, F., Xiang, X., Cheng, J., Yuille, A.L.: NormFace: L2 hypersphere embedding for face verification. In: Proceedings of the 2017 ACM on Multimedia Conference, pp. 1041–1049. ACM (2017)
Wang, H., Wang, Y., Zhou, Z., Ji, X., Liu, W.: CosFace: large margin cosine loss for deep face recognition
Wu, X., Huang, H., Patel, V.M., He, R., Sun, Z.: Disentangled variational representation for heterogeneous face recognition. arXiv preprint arXiv:1809.01936 (2018)
Wu, X., Song, L., He, R., Tan, T.: Coupled deep learning for heterogeneous face recognition. arXiv preprint arXiv:1704.02450 (2017)
Yi, D., Lei, Z., Liao, S., Li, S.Z.: Learning face representation from scratch. arXiv preprint arXiv:1411.7923 (2014)
Zhang, Y., Zhang, Y., Yang, Q.: Parameter transfer unit for deep neural networks. arXiv preprint arXiv:1804.08613 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, Y., Li, Y., Wang, S. (2019). Parallel-Structure-based Transfer Learning for Deep NIR-to-VIS Face Recognition. In: Zhao, Y., Barnes, N., Chen, B., Westermann, R., Kong, X., Lin, C. (eds) Image and Graphics. ICIG 2019. Lecture Notes in Computer Science(), vol 11901. Springer, Cham. https://doi.org/10.1007/978-3-030-34120-6_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-34120-6_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34119-0
Online ISBN: 978-3-030-34120-6
eBook Packages: Computer ScienceComputer Science (R0)