Keywords

1 Introduction

As one of the most important biometrics, face recognition plays an important role in many application scenarios, such as device unlocking, application login and mobile payment. In the past decades, lots of face recognition algorithms are reported and great progress has been made. In recent years, the advances in deep learning techniques have greatly boosted the performance of face recognition. In fact, one of most important research topics lies in the extraction of more discriminative facial features by employing convolutional neural networks, which can be discussed from three aspects. Firstly, more powerful network architectures are adapted by introducing deeper or wider networks from VGG [1] to ResNet [2]. Secondly, large and refined face datasets are constructed to simulate real-world scenarios. For example, more wild MS1M [4] dataset is constructed to replace CASIA-WebFace [3]. Finally, more rigorous loss functions are designed to enable the network to learn more discriminative face features. In fact, most of commercial face recognition systems required the users to actively cooperate with the cameras so as to acquire clear face images. That is, most of them only work well in certain constrained environment. However, in some more tough application scenarios, it is almost impossible to meet the conditions. For example, the face recognition systems in public safety can even acquire a clear and complete face image. In this application scenario, the law enforcement agency has frequent requirements to compare ID document photos with spotted face images. What the face verification system needs to do is to help the police find out the faces of the criminals from these spotted images based on the ID document photos. In fact, this kind of face verification systems is quite different from existing commercial face recognition systems. First, the images are generally acquired in more natural environment, in which some factors like the capturing views and light conditions are uncontrolled. Second, criminals always have a psychological tendency to hide their faces, which further increases the difficulties of verification. Finally, the ID document photos are generally normative but low-quality, while the images from surveillance cameras or spot cameras are generally high-quality but arbitrary. That is, two types of faces are heterogeneous, which makes the matching of faces more difficult. Figure 1 illustrates the difference between ID faces and Spot faces. In this paper, we focus mainly on the ID-Spot face verification problem.

Fig. 1.
figure 1

The sample photo shows the three situations. Each column is a matching pair of images, from left to right, representing normal, sunglasses, and mask. It should be noted that the black rectangular block is for privacy protection, not the information that the image carries.

In the ID-Spot face verification scenario, the ID document photo is unified and clear, while may be covered by some stuffs like sunglasses and masks. Due to this special nature, when we follow the standard face feature extraction method, we can’t get a very effective and discriminative face representation. Therefore, we employ a pseudo-siamese network to solve the heterogeneous problem. In the network architecture, the network for processing ID document photos is different from the one for handling spot photos. That is, the two networks don’t share parameters, while they have the same architecture and trained jointly. In this manner, we can effectively enhance the discriminative ability of heterogeneous faces. In addition, a global weight pooling method is proposed to suppress the negative effect of background and occlusion. Using the global weight method, the available face area are assigned more significant weight than the background and occlusion, which makes face representation more discriminative.

In brief, we aim to create a fast and effective face verification system for face verification in wild conditions. In order to achieve this goal, we have proposed a face representation method and achieved good performance. The main contributions of this article are as follows:

  1. 1.

    We explored the face verification problem with partial occlusion and quantitatively analyzed the impact of occlusion on face verification.

  2. 2.

    We adjusted the global average pooling in CNN and achieved performance improvement. At FAR = 0.01%, we increased the TAR from 47.58% to 57.63%.

  3. 3.

    Our model achieved the best results on a Chinese ID-Spot dataset.

2 Related Works

2.1 Face Recognition Based on Deep Learning

Due to the emergence of massive data and the tremendous increase in computing power, deep learning has shown great vitality in the field of computer vision. Face recognition is a special type of task in image classification.

The use of softmax to classify faces is the most basic method for studying face recognition. Since softmax is only used for classification, it has a weak ability to increase the distance between classes and reduce the distance within the class. Therefore, a series of methods such as center loss [5], SphereFace [6], CosineFace [7], and arcface [8] have appeared. Center loss [9] adds an additional supervisory signal for compressing the intraclass distance. In order to enhance the softmax loss ability, the multiplicative margin and the additive margin are introduced into the angle space by ArcFace [8] and SphereFace [6] respectively. CosineFace [7], AMSoftmax [9] adds additive margins to the cosine space to increase the penalty power for more discerning facial representations. The Softmax series method classifies the face of the training set globally. Its advantage is that it can converge quickly. The disadvantage is that when the class of the training set is large, more memory space is needed.

The DeepID [10, 11] combines softmax and validation signals to train the network. Facenet [12] uses the triple loss function to learn facial representations in large-scale databases. Contrastive loss and triplet loss are both data-using strategies. So, we need to build data pairs before training, and the method of building the data pairs will have a big impact on the result. Hard samples are often used as a choice for triples. Such methods search for optimal solutions in local space, and often require longer training time on large-scale training data.

2.2 ID Versus Spot

ID-Spot verification can be considered as a special case of heterogeneous face verification. Although the image structures of the two are the same, the data distribution of the two has a huge gap that is hard to cross. There are usually two types of methods for solving heterogeneous problems. One is to first convert the image so that the two types of data are similarly distributed, and the other is to map the two types of images into the same shared feature space. There are many researchers [13,14,15,16] who have conducted a lot of experiments and explorations on this issue.

Large-scale [17] and DocFace [18, 19] explored ID-Scene validation issues. The two adopted a similar strategy in the general direction. Pre-training on open large-scale datasets to obtain a pretrained model that is sensitive to human faces is the first step. Then, what needs to be done is to fine-tune the model on the ID-Spot dataset, also known as transfer learning, so that the model has a stronger ability to process heterogeneous ID photos and spot photos. More specifically, Large-scale adopted a classification-verification-classification strategy to gradually improve the performance of the model. DocFace designed an optimization method called DWI to update the weights.

2.3 Face Verification with Partial Occlusion

In the development of face recognition, many researchers have conducted great explorations and experiments on occlusion problems. Subspace regression transforms the occlusion face recognition problem into an unoccluded face image and occlusion respectively return to their respective subspaces; robust error coding attempts to separate the occlusion image into occluded and unoccluded regions; Robust feature extraction method decomposes the image features, reduces the mutual interference between the features, and provides sufficient fine features for subsequent recognition.

In recent years, there have been many works on partially occluded faces. Robust LSTM [20] proposes a robust long- and short-term memory network-automatic coding model to restore occluded faces. DFI [21] recognizes a human face by forming a facial star network map by connecting key points of the face area. Enhancing [22] improves the recognition rate by finding areas that have a significant impact on recognition.

3 Approach

In this section, we will describe our approach in detail. This method improves the face verification performance in the case of partial occlusion.

3.1 The Impact of Occlusion on Face Verification

It is well known that partial occlusion like sunglasses can cause trouble for face verification. We conduct a quantitative analysis of the effects of sunglasses on face verification, and give the effect of sunglasses on face verification from numerical values. We first compare the matching similarities between images, and obtain the cosine similarity of scene-scene, glass-glass, and scene-glass. The three groups are all compared by different people. Second, we use the same model to verify the dataset with occlusion and the dataset without occlusion, and get the verification result. In the process, we use the mobilefacenet [23] model trained by ArcFace [8].

3.2 Network Architecture

Global Weight Pooling (GWP).

In practice, face verification should achieve an ideal balance between speed and accuracy. Mobilefacenet [23] is a lightweight network designed for face recognition that can be deployed on mobile devices. Mobilefacenet draws on mobilenetv1 [24], mobilenetv2 [25], and shufflenet [26] networks, which use many separable convolutions to reduce the amount of computation and parameters. Mobilefacenet uses the global separable convolution instead of the global average pooling for down sampling at the end of the convolution.

We use the backbone network of mobilefacenet, so the size of the final feature map is 7*7, and the channel is 512. In the global average pooling layer, we use a global pooling operation with weights. The weights on each channel are not shared, and the weight parameter size is 512*7*7. After each iteration in the training, the 49 weights on each channel are processed to ensure that the result of each inference is a weighted average. Please see Fig. 2.

Fig. 2.
figure 2

A 512*7*7 feature map will be obtained after the image has passed the CNN stem. CNN stem consists of four parts. there are two convolution modules in the first part. Then, 4 residual blocks and 6 residual blocks and 2 residual blocks, which contain many separable convolution operations, are executed sequentially. Usually, there is a global average pooling operation after convolution layer. We use GWC, which constrains the sum of the weights of each channel to 1, instead of GAP(Global Average Pooling). The physical meaning of GWC is to perform a weighted average operation, like its name. An embedded layer consisting of BN-FC-BN is executed at the end.

The global average pooling layer is calculated as:

$$ Output_{GAP - c} = \sum\nolimits_{i,j} {\frac{1}{W*H} \cdot F_{i,j,c} } $$
(1)

The global separable convolution is calculated as:

$$ Output_{GSC - c} = \sum\nolimits_{i,j} {W_{i,j,c} \cdot F_{i,j,c} } $$
(2)

The global weighted average pooling layer is calculated as:

$$ Output_{GWP - c} = \sum\nolimits_{i,j} {W_{i,j,c} \cdot F_{i,j,c} } $$
$$ \sum\nolimits_{i,j} {W_{i,j,c} } = 1 $$
(3)

Where F is the input feature map of size W × H × C. W is the weight matrix of size W × H × C. The (i, j) denotes the spatial position in W and F, and c denotes the channel index.

We use three methods for weight processing to ensure that it is a weighted average. They are:

  • Option-A: Use Softmax function

  • Option-B: Use Softmax function after relu

  • Option-C: Use Rescale process after relu

The GWP structure can increase the effective area weight of the face in the image to obtain more discriminative facial features, and process the parameter matrix into weights, which can highlight the importance of each region of the face.

Pseudo-siamese Network.

Most of the existing face verification networks are based on the siamese network, and the siamese network is used to handle similar input situations. In the ID-Spot verification, although the input is a human face, our task is to verify the face, that is, to find a larger dissimilarity in similarity. Especially in the case of the difference in the distribution of data between the ID photo and the Spot photo, and the existence of obvious heterogeneous characteristics, the pseudo-siamese network will be able to solve the problem better.

We use a pseudo-siamese network to solve the heterogeneity of ID photo and Spot photo. Two networks of the same structure process the ID photo and the spot photo separately, and the two networks do not share parameters. In addition, we do a comparative experiment, the performance of the shared embedded layer will be lower than that of a single network. The embedded layer is shown in Fig. 2 and the structure is BN-FC-BN. Therefore, we used two networks with completely independent parameters. Figure 3 shows the pipeline.

Fig. 3.
figure 3

This is the pipeline for extracting facial features.

4 Experiments

Our code is based on the MXNet framework. All experiments run on 2 NVIDIA 1080Ti (12G). The specific settings for the experiment will be described in detail below.

4.1 Dataset

MS1M.

MS1M [4] is currently the largest open source face dataset, which contains approximately 100k identities and 10Million images. However, the original MS1M had a lot of noise, and ArcFace [8] cleaned it up and got the cleaned dataset. The cleaned dataset contains approximately 85K identities and 5.8 Million images. We used a refined dataset [27] for training.

LFW, AgeDB, CFP-FP.

lfw [28] is a well-known face test dataset in the field of face recognition, which contains 13,233 images from 5,749 IDs collected from online. The agedb [29] is a test data set for age. The image in cfp-fp [30] emphasizes the side face. 6000 pairs, 6000 pairs, and 7000 pairs of images were obtained from lfw, agedb, and cfp_fp, respectively. We use these three datasets as validation sets to select the optimal model.

IDSpot.

The IDSpot dataset is a private dataset that includes 19,500 pairs of matched images and 100k pairs of unmatched images. Each pair of images is a ID photo and a spot photo. The matching image pairs are from the same ID and do not match from different IDs. The types of ID photos are uniform, with the same size, degree of blur, image type; the style of spot photos is very different, there is a big difference.

We adjust the size of the ID photos to 112*112 by image processing functions such as crop and resize. Then we use MTCNN [31] to preprocess the spot photos, make face alignment on the face images according to the 5 landmarks, and finally adjust the size of the spot photos to 112*112.

Because it is difficult to collect enough occlusion images with labels, we obtain a synthetic occlusion dataset IDSpot-paste by image processing. We chose 100 sunglasses templates and 10 mask templates as occlusions in the dataset. For each face, we first randomly select 3 cases, the first is to do no process, the second is to use the sunglasses template for processing, and the last is mask processing. In the latter two cases, we randomly select a corresponding template again to generate an occlusion face. We paste these templates to the relevant face positions, where the positions of the aligned faces, glasses and mouth corners are known. Figure 4 illustrates some examples of occluded faces generated using this method.

Fig. 4.
figure 4

Here are some spot photos and generated occlusion photos. It should be noted that the black rectangular block is for privacy protection, not the information that the image carries.

We selected 15600 pairs of matching images for training, using the remaining 3900 pairs of matching images and 100K pairs of unmatched images for verification, and repeating the above process 5 times for 5-fold cross-validation. It should be noted that the IDs appearing in the matching image pair do not appear in the unmatched image pair. This will ensure that the same ID does not appear in the training set and verification set.

4.2 Occlusion Impact Analysis

Random Matching Experiment.

We randomly select 10,000 images from the spot photos, and divide them into 2 groups, named A and B respectively. According to the occlusion generation method, A and B are processed to obtain A_paste and B_paste, respectively. We use the mobilefacenet pretrained model to extract the features of the above four datasets. Finally, we compare three groups according to the four groups of features: first, A and B calculate the cosine similarity, get 5000 cosine similarity, calculate the mean and variance; the second group is to calculate the cosine similarity of A_paste and B_paste, then get the mean and variance. the last group is to calculate the cosine similarity between A and B_paste, B and A_paste and to obtain 10000 cosine similarities, and calculate mean and variance. Table 1 shows the experimental results.

Table 1. Random matching experiment results

Face Verification Experiment.

In this experiment, the mobilefacenet pretrained model is used to perform 5-fold cross-validation on the IDSpot and IDSpot-paste datasets respectively. It should be noted that only the verification process is performed here, and there is no training process. We use tar@far as a metric to explore the effects of occlusion. Table 2 shows the experimental results.

Table 2. Face verification experiment results: TAR@FAR is used to metric performance. TAR: True Accept rate, and FAR: False Accept rate

We explore partial occlusion through the above two experiments. We can see that occlusion has a huge impact on face verification from Tables 1 and 2. The experimental results show that face verification with local occlusion is a much more difficult task than conventional face verification.

4.3 Metrics

We use two evaluation metrics in all the experiments in this paper. We evaluate the models trained on MS1M on the lfw, agedb, and cfp-fp datasets, and use Accuracy as the evaluation metric. According to Accuracy, the optimal model is selected as the pretrained model. Accuracy is the ratio of the model’s correct number of samples to the total number of samples, which reflects the overall predictive power of the model.

In addition, we also use tar@far as the evaluation metric. Different far values correspond to different tar values. This metric can reflect the prediction ability of the model at a certain extreme, and can pay more attention to a certain aspect of performance.

4.4 Training on MS1M [4]

According to ArcFace [8], we train the mobilefacenet [23] network on the refined MS1M dataset [27]. We take ArcFace with a margin of 0.5 as the loss function. We set the batchsize to 128, the learning rate is divisible from 10 at 50K, 80K, 100K iterations, and the total number of iterations is 140K. We set the momentum to 0.9, and the weight decay to 5e-4.

Finally, we get the pretrained model. The pretrained model achieves the results of 99.47%, 95.67%, and 93.45% on lfw [28], agedb [29], and cfp_fp [30].

4.5 Transfer Learning

In this part of the experiment, we take a triplet loss. We use hard samples to design triples, each time looking for the hardest sample from a batch of samples as a negative sample. We set up a 5-fold cross-validation experiment. With 50 epochs per experiment, the learning rate starts at 0.1 and droppes to 0.01 and 0.001 at the beginning of the 11th epoch and 26 epochs, respectively. We set the batchsize to 32.

In GWP, we use several different ways to process weights. The first is to process by using the softmax function; the second is relu processing before softmax; the third is relu processing and process by using rescaling (Table 3).

Table 3. GAP, GSC, GWC experiments results, TAR@FAR is used to metric performance. TAR: True Accept rate, and FAR: False Accept rate. A, B, C denote option-A, option-B, option-C. S denotes pseudo-siamese network. GAP-S(EMB) denotes pseudo-siamese network sharing embedding layer parameters.

From the experimental results, we can analyze that the pseudo-siamese network can effectively improve the representation of heterogeneous human faces, which also proves that the data distribution of ID photos and Spot photos is very different. The sharing of the embedded layer leads to a significant drop in performance, indicating that different distributed data should be subjected to different linear transformations. Both GSC and GWP can achieve performance improvements over GAP. At far = 0.01%, GWP performed best, indicating that GWP performs better under low false accepted rate.

5 Conclusion

In this paper, we explore face verification with partial occlusion and quantitatively analyze the impact of occlusion on face verification. We improve the representation of the face by adjusting the GAP part of the CNN. In addition, pseudo-siamese network is used to explore and analyze the heterogeneity of ID photos and spot photos.