1 Introduction

The image quality assessment algorithms [1,2,3] are widely used in medical, information technology, military and other fields. In the aspects of image content understanding and video understanding analysis, image quality directly affects the discrimination effect of subsequent modules. For example, in the process of face recognition, the images of poor quality may directly lead to the face recognition module can’t accurately verify user’s identity information.

The image quality assessment algorithms mainly include full reference (FR) quality assessment, partial reference (PR) quality assessment and no reference (NR) quality assessment. FR methods mainly use the ideal image and the distorted image, comparing the difference between the distorted image and the ideal image, so as to evaluate the image quality scores. However, ideal images are difficult to obtain in practice. And the speed of this method is slow in actual evaluation, such as PSNR [4] and SSIM [5].

Fig. 1.
figure 1

Examples of face images. (a)–(d) show that the image quality is good, but the face image quality is poor.

In contrast, the NR methods don’t include ideal images and estimate quality scores from input images directly. The method proposed in this paper belongs to NR quality assessment algorithm. NR methods mainly include the algorithms which based on distortion types and machine learning. Zhou [6] proposed the fuzzy degree caused by the error when quantizing by DCT transform coefficient as the evaluation standard of image quality. Cohen, Erez and Yitzhaky [7] combined two factors of noise and fuzziness to evaluate the image quality scores. These methods use distortion types to calculate image quality scores have certain effect, but the calculation speed is slow.

With the rapid development of deep learning, more and more scholars begin to use deep learning technology to evaluate the quality of images. Gao and Wang [8] used pre-trained VGG model to extract image features and then used SVM to complete the prediction process. Ma, Liu [9] used RankNet to learn a blind image quality assessment (BIQA) model, which predicted the quality scores of images without a reference image. The CNN model can independently learn the content information of images and accomplish different tasks by learning various information based on the prior knowledge during training processing.

At present, although many scholars have begun to study CNN to predict image quality scores, they all studying on the datasets of some image quality assessment fields. However, the face image quality assessment problems are different from the traditional image quality assessment problems. It is more concerned with the change of face angle, blur, illumination and other factors in the actual scene. For example, in Fig. 1(a), the image quality is good. While the face image quality is poor due to poor face angle. In Fig. 1(b), the face is obscured. In Fig. 1(c), the face’s illumination is uneven. In Fig. 1, the image quality is better from the perspective of traditional image quality assessment. However, the image quality is poor from the perspective of face image quality assessment. On the other hand, as shown in Fig. 2, in the process of face recognition training, some face images of extremely poor quality will affect the recognition effect and lead to the recognition error. Filtering out the worst quality images can help face recognition systems improve their accuracy and effectiveness. Toward this end, we propose a two-stream CNN model to filter out the worst quality images.

Fig. 2.
figure 2

In the process of face recognition training, some face images of extremely poor quality will affect the recognition effect. Filtering out the worst quality images can help face recognition systems improve their accuracy and effectiveness. Toward this end, we propose a two-stream CNN model to filter out the worst quality images.

This paper proposes a two-stream CNN model, named Deep Face Quality Assessment (DFQA) for the quality assessment of face images. Instead of the Euclidean loss used in other methods, we exploit the SVR loss [10] as the objective function, because SVR has been successfully applied to the regression function in many practices in the previous image quality assessment tasks [11, 12]. Since the image quality assessment task is a small module in practice, the space size and time complexity occupied by it should not be too high from the practical point of view. The single branch of the proposed DFQA model chooses the Squeezenet [14], which is a lightweight CNN model, it has advantages in both size and speed. In this paper, the quality scores of 3000 images are marked firstly. And the comprehensive quality scores are given according to the face angle, sharpness, illumination and other factors from six volunteers. After that, the 3000 images are used to train the pre-trained squeezenet model using Imagenet dataset. Using the scores-trained squeezenet model to predict the image quality scores of the MS-Celeb-1M [14] dataset as label. Then, the MS-Celeb-1M dataset is used to train the DFQA network to obtain the final face image quality assessment model. The comparison experiment results show that our model is smaller in size and faster in speed, which is suitable for quality assessment tasks. The branch comparison experiments show the validity of the double branch network than single branch. Face recognition experiments on CASIA-WebFace and VGGFace2 datasets prove that our model can help improve face recognition accuracy. Our main contributions can be summarized as follows:

(1) We build a new dataset manually marked with objective quality scores. (2) We propose a DFQA model, which based on the face image quality assessment task and using the two-stream parameter sharing network structure to enhance the prediction capability. (3) We conduct experiments to validate advantages of DFQA model. The importance of DFQA model is demonstrated in face recognition experiments to help improve the accuracy of face recognition.

2 Related Work

In the task of image quality assessment using CNN, many classical methods emerge. Bare Bahetiyaer and Ke [15] used the knowledge of residual network to design the CNN model, and used FSIM to calculate the label, and finally got the predicted quality score. Bosse S and Maniry [16] proposed two frameworks. They were FR and NR. The 32 image blocks were extracted from each image, one network branch predicted image block weight, and one network branch predicted image quality score. Kim and Lee [17] got the weight of each image block and extracted features to predict the quality score. These methods all use CNN to predict the image quality score.

At present, there are few methods for face image quality assessment based on CNN. Nasrollahi and Moeslund [18] used four simple features to assess face quality in video sequence. Truong and Dang [19] used contrast, brightness, focus and illumination factors to assess face image quality score. Ranjan, Rajeev and Bansal [20] mentioned using the probability score of face detection as the quality score of face image. Chen and Yu [21] extracted Hog feature, Gabor feature, Gist feature, LBP feature and CNN feature. Then the weight value of each feature was obtained by sorting algorithm. Finally, the quality score was obtained by fusion. These methods use some image features or factors to predict the quality score of the face image, but don’t start from the essence of the face image to consider a variety of factors in the face image, such as face angle, face sharpness, face illumination and other factors. The method proposed in this paper starts from the perspective of manual annotation, taking all kinds of factors of face image into consideration. This paper designs a lightweight two-stream CNN model, and uses SVRloss constraint to predict the quality score of face image.

Fig. 3.
figure 3

Manually graded face image quality score results. The final score range is between [0, 1].

3 Method

In order to predict the quality score of face image, we select 3000 face images for annotation in the IJB-A dataset from six volunteers. We propose a two-stream CNN model called Deep Face Quality Assessment (DFQA). In this section, we introduce the tagging process and DFQA’s framework structure and parameter setting.

3.1 Annotation

We select 3000 face images for manual annotation. Based on the essence of face image, various factors are considered to score face images. We chose six volunteers that all image area researchers and ranging in age from 25 to 35 to score the quality scores of face images. For example, as shown in Fig. 3, we divide the scores into 5 segments from 0 to 1 and label the quality score according to the labeling criteria in Fig. 3. We mainly consider the face angle, face clear and face illumination. But in labeling proceesing, we also consider face visibility, facial expression and other factors, it’s not a requirement. Six volunteers score the images according to the requirements, and then calculate an average value as the final quality score of the images based on their results.

Fig. 4.
figure 4

The framework of DFQA. The DFQA model adopts the two-stream parameter sharing network structure, and fuses the features at the tenth layer, and then passing through the average pooling to get the quality score. We first annotate 3000 face images to pre-train a CNN model, and then use the CNN model to predict the quality scores label of MS-Celeb-1M dataset to train the DFQA model.

As can be seen from Fig. 3, the marked quality scores are between [0, 1]. The score interval is 0.01. Currently, there is no public face image quality score datasets and no standard to mark face image quality score. We carry out a comprehensive assessment score from practical point of view which may affect face recognition of some factors. In the labeling process, we mainly consider several factors: face angle, face sharpness, face illumination and face expression and occlusion. These factors will affect the quality of face in practice and further affect accuracy of the face recognition system.

3.2 DFQA

The framework diagram of DFQA is shown in Fig. 4. The single channel network structure of DFQA uses squeezenet. Since the quality assessment module is a relatively small module in practical application, it has high requirements on size and speed. So, with a size of only 3M, squeezenet has strong prediction ability and fast calculation speed. Although the structure of single branch network also has good prediction ability, the feature learning of single branch is not sufficient and can’t represent the quality score due to the numerous factors that need to be considered in face image. The choice of double branch network structure can enhance the expression of features, and more comprehensively consider various factors to better characterize the image quality.

DFQA is a two-stream parameter sharing network structure. And each branch consists of two convolutional layers and eight fire layers (conv10 and conv11 layer parameters are not shared). The input is a 128 \(\times \) 128 face image. The conv10 convolutional layer consists of a 1 \(\times \) 1 convolution kernel, and the final output is a 9 \(\times \) 9 matrix \(X_{1}\). The output of another branch conv11 layer is \(X_{2}\). Sum fusion method is used to fuse \(X_{1}\) and \(X_{2}\), as shown in formula 1.

$$\begin{aligned} X_{fusion}= & {} \lambda X_{1}+(1-\lambda )X_{2} \end{aligned}$$
(1)

Among, \(\lambda \) represents the weight of \(X_{1}\). As the first nine layers share parameters, the parameters are not shared at conv10 layer. So the \(X_{1}\) and \(X_{2}\) matrix data are not the same. \(X_{1}\) and \(X_{2}\) represent the distribution of the quality scores. Because of the structure setting of the two-stream parameter sharing network, the parameters of the previous layers are shared in the process of forward and reverse propagation. The parameters are not shared at the conv10 layer to increase diversity of the fraction distribution, but the weight distribution of \(X_{1}\) and \(X_{2}\) shouldn’t be too different to maintain the uniformity of the data. Considering the above problems, the parameter \(\lambda \) is set to 0.5 to achieve the effect of balance and further improve the accuracy.

\(X_{fusion}\) is a 9 \(\times \) 9 matrix where the values are the quality scores in each block of the map. Then, the \(X_{fusion}\) through a pooling layer. The pooling layer calculates the mean of the \(X_{fusion}\), as shown in formula 2.

$$\begin{aligned} score= & {} \frac{1}{81}\sum _{i=1}^9\sum _{j=1}^9 X_{fusion}(i,j) \end{aligned}$$
(2)

Among, score is the predicted quality score. The overall framework is constrained by the objective function, as shown in formula 3.

$$\begin{aligned} W^{*}= & {} arg min\frac{1}{N}\sum _{i=1}^N \left\| f(w,x_{i})-y_{i} \right\| +\beta \left\| W \right\| \end{aligned}$$
(3)

Among, W is the learning parameters of DFQA. N is the predicted quality score dimension, in this case, N=1. y represents label. \(\beta \) is a constant that controls the norm of parameters in DFQA.

Through minimization formula 3, DFQA model can well predict the quality score of face image and comprehensively analyze the content information of face image to get the most appropriate score value by integrating multiple factors. To train the DFQA model, we select the Caffe platform and use a batch size of 100. We adopt the step training method, and set the initial learning rate at \(10^{-3}\). The learning rate drops once for every 20 rounds of iteration, and drops three times in total. Moreover, the momentum is set to 0.96. The weight decay is set to 0.0005. DFQA model fully analyzes the content information of face image by considering various factors in practice and applies the characteristics of CNN independent learning, and finally predicts the quality score of face image.

We use the 3000 face annotation images mentioned in Sect. 3.1 to finetune the squeezenet model using the ImageNet dataset for pre-training. The model is then used to predict the label of the MS-Celeb-1M dataset. In this way, we get the label information of the MS-Celeb-1M dataset, and train the DFQA model with the MS-Celeb-1M dataset to get the most appropriate training parameters. At this point, DFQA model complete training, inputing face image can get the quality score.

4 Experiment

In order to verify the effectiveness of DFQA model, we design a series of experiments, including:

  1. (1)

    Loss function test: Comparing the performance impact of Euclidean loss and SVRloss on DFQA model, and proves the effectiveness and necessity of selecting SVRloss.

  2. (2)

    Classical method contrast test: Making comparison between resnet10 [22], squeezenet [14], NRIQA [10], RankIQA [23] and DFQA for face image quality assessment, and verifies the accuracy and efficiency of DFQA.

  3. (3)

    Face recognition test: DFQA model is used to filter out face images with poor quality in the training process of face recognition model to improve the accuracy of face recognition.

  4. (4)

    Branch selection test: Different branches were selected for quality evaluation experiment. Single branch, three branches and four branches (referred to as Model-1, Model-3 and Model-4) were compared with DFQA to verify the effectiveness and necessity of selecting the structure of double branch network.

During the training and testing phase of DFQA model, we select the MS-Celeb-1M dataset. The MS-Celeb-1M dataset consists of 2,000 person ids with a total of 165155 images. We use the pre-trained CNN model of 3,000 manually labeled face images to predict the labels of the 165155 images. These images are assigned to a quality score level with an interval of [0,1] and a scale of 0.01. We select 400 face images as the training set and 100 images as the testing set in each quality score level. In the end, the training set contains 37409 face images. The testing set contains 9100 face images.

Table 1. Loss function results

4.1 Loss Function Test

In the loss function test, we compare Euclidean loss with SVRloss. Euclidean loss is a common objective function for regression problems. In order to verify the advantages and effectiveness of SVRloss, we replace the SVRloss of DFQA model with Euclidean loss, remaining the network structure unchanged. And MS-Celeb-1M dataset is used for training with consistent parameters. The experiment results are shown in Table 1.

In the testing stage, because the predicted quality score is a high precision value, some images with very close scores can be regarded as correct within a certain acceptable range. Three thresholds of score difference between predicted score and groundtruth are selected in the experiment, which are 0.2, 0.1 and 0.05 respectively, as shown in formula 4. The values in Table 1 represent the prediction accuracy. The predicted result should be true when the difference of predicted score and groundtruth is less than the threshold value. And the predicted result should be wrong when it is greater than the threshold value. The ratio of the final true numbers and the total number represents the prediction accuracy, as shown in formula 5.

$$\begin{aligned} \left| S_{pre} - S_{gnd} \right|< & {} threshold \end{aligned}$$
(4)
$$\begin{aligned} Acc=Num(\left| S_{pre} - S_{gnd} \right|< & {} threshold)/N \end{aligned}$$
(5)

As can be seen from Table 1, SVRloss has obvious advantages and can be used for better regression of quality scores. In the case that the threshold of score difference is very small, the results can be predicted with high accuracy.

4.2 Classical Method Contrast Test

In the classical model test, we use resnet10, NRIQA and RankIQA to predict the quality scores of face images, compared with DFQA. Since the quality assessment is a small module, the model selected in the comparison experiment should not be too large considering the size and speed of the model. Otherwise, it doesn’t meet the size and speed requirements in the actual module. And the model size, speed and accuracy of DFQA have reflected high performance from Table 1. We use MS-Celeb-1M dataset to train the resnet10, NRIQA and RankIQA. The experiment results are shown in Table 2. All the speed values of experiment results were obtained on the 64-bit Windows 10 system, Intel i5 processor, and Caffe platform.

As can be seen from Table 2, DFQA has the highest accuracy under various thresholds. Although NRIQA’s model is smaller, its accuracy is lower than the DFQA model. The resnet10 and RankIQA model not only has more parameters, but also performs worse in face image quality assessment. The predicted accuracy of resnet10 and RankIQA is good under the condition of loose threshold. However, with the decrease of the threshold value, the predicted precision significantly reduced. This means it is a certain gap between the predicted quality score and groundtruth. But the DFQA model still maintains a higher prediction precision under the conditions of extremely low threshold. The comparison experiment illustrates the prediction ability is strong of DFQA. The DFQA model is faster and applicable to practical applications. The two-stream parameter sharing structure can further analyze the image content information and improve the accuracy of the estimated quality score.

Table 2. Classical method contrast results

4.3 Face Recognition Test

In face recognition test, we select DFQA model to screen out the poor quality face images in training stage. We select CASIA-WebFace [24] and VGGFace2 [25] as the training dataset and use the LFW [26] as the testing dataset.

We select 6000 person ids from CASIA-WebFace and VGGFace2. Each person id includes about 60 images. Finally, a total of 338307 face images are selected as the training set in the CASIA-WebFace dataset, and 360000 face images as the training set in the VGGFace2 dataset. For the face recognition model, we choose FaceNet [27]. During the training, we use CASIA-WebFace and VGGFace2 to train the FaceNet model, and use LFW as the test dataset to obtain baseline. Then, we use DFQA model to predict the quality scores of face images in CASIA-WebFace and VGGFace2 dataset, and filter the quality scores of face images less than 0.1 to obtain a new training set, and then retrain the FaceNet model. The experiment results are shown in Table 3.

Table 3. Face recognition results
Table 4. Branch selection results

As can be seen from Table 3, DFQA can help face recognition to improve the recognition accuracy. There are many particularly poor quality face images in the CASIA-WebFace and VGGFace2 dataset. Some of them are vague, and some couldn’t see faces in the dark. These face images with poor quality will affect the adjustment direction of training parameters in the training process. Because there are some very fuzzy and dark face images under different person ids at the same time, during the training process, it is difficult for the model to distinguish differences between face images of different characters with poor quality. In the process of training, filtering some poor quality images can enhance the robustness of the face recognition model and improve the anti-interference ability.

4.4 Branch Selection Test

In branch selection test, we select single branch, three branches and four branches (referred to as Model-1, Model-3 and Model-4) to conduct face image assessment experiment to validate necessity of selecting double branches. The experiment results are shown in Table 4.

As can be seen from Table 4, the prediction accuracy will not change much after three branches or four branches. The model size will be larger. And the speed will be slower. As can be seen from Table 2, with the decrease of threshold value, the accuracy of many models will be greatly reduced. This phenomenon shows that it is very difficult to improve the prediction accuracy when the threshold value is very low. For example, when the threshold value is 0.05, the difference between the predicted quality score and groundtruth can only be maintained at the tiny difference of 0.05. It can also be seen from Fig. 3 that in the process of labeling, 0.2 is a threshold with a relatively high degree of discrimination, while the threshold of 0.05 is already a threshold with a very low degree of discrimination in manual labeling, making it difficult to distinguish the quality score of face images. Therefore, under the strict threshold constraint of 0.05, it is extremely difficult for DFQA to improve 3 points compared with Model-1, which also indicates that DFQA model has a strong ability of prediction and characterization. Although the performance of Model-1 is also good, considering the difficulty in improving the prediction accuracy in the case of model size and speed already good, we choose DFQA model with stronger prediction ability to predict the quality score of face image.

5 Conclusion

In this paper, we propose a two-stream network structure called DFQA to predict quality score of face image. Considering a variety of practical factors, we make a comprehensive score of 3000 face images by manual annotation. The single branch of DFQA uses the lightweight model squeezenet to realize parameter sharing and co-learning in the ninth layers to further analyze the quality information of face images. SUM fusion method is used to fuse the quality information. And SVRloss is used to further constrain the overall framework. Finally, a quality assessment model with fewer parameters and high accuracy is obtained. The experiment results prove the validity and necessity of the overall architecture selection. The proposed model can help improve the accuracy of face recognition.

In our future work, we will apply this idea to more practical scenarios, such as pedestrian detection, video analysis, image search to help them more quickly and accurately remove some unnecessary interference information and improve the accuracy rate.