Keywords

1 Introduction

Mobile devices are used daily to create and transmit private information. Traditionally, numbers and passwords have been used for mobile devices security, but they are not so secure since users tend to forget and reuse them. Recently, biometric-based authentication has shown to be a convenient and secure option [2]. Among different biometric technologies, face recognition has become a promising security option for mobile devices [10]. Faces are easy to capture and to store in mobile devices; besides, the achieved levels of accuracy of face recognition systems in the past few years, make them more secure. One of the main studied problems in mobile authentication, known as Active Authentication (AA), is to guarantee that the user originally authenticated, is the one that maintain the control of the device.

Different active authentication approaches have been proposed for verifying the user identity based on the facial information [14]. In [5], nine state-of-the-art face recognition methods were evaluated for active authentication, and a challenging evaluation protocols representing real scenarios was introduced. Continuous authentication based on facial attributes was proposed in [15] for fast processing. A bunch of binary attribute classifiers was trained and the authentication was done by comparing the estimated attributes with the enrolled attributes of the original user. Recently, a sparse representation-based method for multiple users authentication, was proposed in [12]. A parameter selection scheme was introduced for extreme value distributions, to make them feasible for an automated mechanism.

There are just a few methods based on deep learning for mobile authentication [4, 14]. Although this kind of methods achieve very high recognition rates, they require high computational resources. Besides, existing methods usually use the complete set of frames available in a captured face sequence, that can also be time consuming. On the other hand, it is well known that the degradation of the face quality overcomes in a poor performance of the recognition algorithms. Due to the non-controlled environment of a phone camera, the obtained videos for training the system and the ones that are recognized later, can be very different. Hence, training a method for all possible qualities that can appear, requires too much information [16]. Under this situation, algorithms able of adapting their behavior depending on the quality of the biometric sample are needed [17, 18].

In this paper is introduced a method that decides which biometric information is good enough for obtaining a reliable face authentication result. In order to choose which part of a video sequence contains good enough biometric information, a frame selection method that uses face features quality, is introduced. The proposed method takes into account blur, pose and expression variations, which are among the most affecting factors in mobile authentication.

The rest of this paper is organized as follows. Section 2 reviews some of the existing works to determine face images quality. Section 3 introduces the proposed method. In Sect. 4 the experimental evaluation is presented. Finally, conclusions and future works are given.

2 Related Work

The sample quality has an important impact in the accuracy of biometric recognition systems [9]. Particularly for faces, a number of approaches have been proposed to determine the quality of face images and selected a reduced number of frames from a given sequence [1, 3, 13, 16].

Most of the recent proposals that achieve very good results are based on deep learning methods. Yang et. al. [16] proposed to use deep neural networks for classifying the images based on their quality and then, separated face detectors and recognizer algorithms are used depending on the images quality. Two types of image quality problems are considered: JPEG compression, and low-resolution. One drawback of this approach is that these types of quality problems are not always proportional to the quality of the face features that can be extracted from the images. Besides, for each type of quality problem, a large number of samples are needed for training a specialized face detector and a recognizer method. Another approach that uses a deep convolutional neural network (CNN) for predicting the face image quality is [1]. First, features are extracted by using a CNN, and with these features, a prediction model of the face quality is learned by using support vector regression. The main disadvantage of using deep learning in this early step is the high computational cost that implies analyze all the frames in a sequence, which can be not feasible for mobile applications.

Zohra and Gavrilova [17, 18] proposed a system where the illumination distortion is normalized by using quality-based normalization approaches. Only quality problems related to illumination are overcome in their approach. The method proposed in [3] considers both image and face characteristics, but it was specifically designed for FPGA architectures.

It is evident from the above analysis, that methods particularly designed for mobile authentication are still needed. They should be able to analyze the most common quality problems on these scenarios in an efficient way.

3 Proposed System

The proposed method for face recognition on mobile devices is composed by three main steps that will be explain on details in this section. The first step is to determine the frames of the sequence which contain the most valuable face information for recognition. The selected frames are then represented by a unique feature vector that is obtained using a CNN model for face feature extraction. In this paper, three different models are evaluated. Finally, the classification is made through a SoftMax function.

3.1 Face Image Quality Assessment

For active authentication in mobile devices, a video sequence can be captured and then the face classification can be done with those frames with the most relevant information. For this aim, a quality value is estimated for each frame of the video. This quality value is calculated by measuring four parameters that describe the most common problems present in face mobile authentication. The proposed quality measures are based on the use of facial landmarks, obtained through a fast and accurate method [7], implemented on Dlib library [8].

Pose Evaluation. The problem of the face pose is that subjects with different identities in different poses are grouped better than the same subject in different poses. Enrolled face images are usually in frontal position, hence, it is desirable that the face pose of the images received for the authentication ranges between \(\pm 45^\circ \).

The estimation of the pose is determined by calculating the displacement of a set of landmark points with respect to that points on a face with neutral pose in a 3D model. The selected points belong to the edges of the eyes, the edges of the lips, the tip of the nose and the tip of the chin. For a better understanding, a graphic example is illustrated in Fig. 1. The average displacement is computed and normalized between [0, 1], where 1 stands for no displacement and 0 means a large displacement from the frontal position.

Fig. 1.
figure 1

A graphic example of the points used for estimating the face pose.

Facial Expression Evaluation. Different facial expressions modify the face shape and appearance. The eyes and the mouth, are two of the face regions that have greater changes with different expressions. In the case of the eyes being completely closed (both or one eye), can be seen as a particular case of occlusion. On the other hand, if the mouth is not naturally closed or if it is very open, the facial appearance changes drastically.

For defining a neutral expression based on the eyes and the mouth, the landmark points of the corners of these regions are used. With these points a triangle with side labels A, B and C is formed, where the aperture angle \(\alpha \) is that which is opposite to the side A. Then \(\alpha \) is calculated as follows:

$$\begin{aligned} \alpha = arc cos(\frac{B^2 + C^2 - A^2}{2 * A * B}) \end{aligned}$$
(1)

The Eyes. A natural eyes expression is defined by determining the angle that is formed with the commissure of the edges of the eyes: the greater the angle, the better the expression. By using a set of 500 face images, the average values of the maximum and minimum aperture is determined, in order to define the classification intervals. The maximum and minimum angles from this set, are used to linearly normalize a given angle between [0, 1]. For a better understanding, a visual example is shown in Fig. 2. Both corners of the left and right eyes are taken into account to determine the state of the eyes.

Fig. 2.
figure 2

A graphic example of the eyes expression state. Left: eyes with neutral expression; Right: closed eyes.

The Mouth. The mouth is also analyzed using the angles of the corners. In this case, the smaller the angle, the better the expression. The landmark points detected at the corners of the lips are used as illustrated in Fig. 3. Besides, it is also taken into account if both angles have similar angle values, to determine a neutral expression. The mouth expression can change in different ways, reason why this heuristic is based on determining if the mouse remains closed to ensure a neutral expression.

Fig. 3.
figure 3

A graphic example of the mouth expression state. Left: mouth with neutral expression; Right: opened mouth.

Face Blurness. For detecting how blur the face image is, the variance of the Laplacian is calculated [11]. By using the landmark points it is ensured that only the face region is considered and that background information is not taken into account. The Laplacian kernel, Equation (2), is commonly used for detecting edges. Hence, its variance gives an idea of how normal is the response of the edges in an image, allowing to determine how blurred the face image is.

$$\begin{aligned} Lap(m,n)= \left[ {\begin{array}{ccc} 0 &{} -1 &{} 0 \\ -1 &{} 4 &{} -1 \\ 0 &{} -1 &{} 0 \\ \end{array} } \right] \end{aligned}$$
(2)

Final Quality Assessment. Let \(p \in [0, 1]\), \(e \in [0, 1]\), \(m \in [0, 1]\) and \(b \in [0, 1]\) be the values of the four quality measures defined: pose, eyes, mouth and blur respectively. The global quality value can be estimated by using the following linear equation:

$$\begin{aligned} q = p * k_p + e * k_e + m * k_m + b * k_b , \end{aligned}$$
(3)

where \(q \in [0, 1]\): 1 represents the highest quality and 0 the lowest, and \(k_i, i \in \lbrace p, e, m, b \rbrace \) are the weights for each feature with:

$$\begin{aligned} \underset{i \in \lbrace p, e, m, b \rbrace }{\sum k_i} = 1. \end{aligned}$$
(4)

The weights can vary depending on the scenario and the most affecting factors. A visual example can be seen in Fig. 4, where the general steps of the face image quality assessment method are presented.

Fig. 4.
figure 4

General steps for selecting the best N frames in a video sequence.

3.2 Feature Extraction

Once the best frames are selected, a face descriptor is needed. In this paper we explore the use of three different CNN models selected from the literature. One of these models is the widely used face deep model from Dlib library [8]. This model is based on a version of the ResNet50 model which has 29 convolutional layers [6]. The output of this network is a 128 dimensional vector, that represents the subject facial features and appearance on every frame. Since the obtained vectors represent a unique face image, it is possible to combine them by applying an average pooling operation across the 128 dimensions. Other model used is the original ResNet50 trained on MS-Celeb-1M dataset [6] and then fine-tuned on VGGFace2 dataset. The last one, is a model specifically designed for high-accuracy real-time face verification on mobile devices, the MobileFaceNet [4]. This network was trained with the refined MS-Celeb-1M dataset using the ArcFace loss.

3.3 Classification

The classification stage is carry out using the SoftMax function. The loss function of SoftMax is based on the cross-entropy loss:

$$\begin{aligned} L_i = -log\left( \frac{e^{f_{y_i}}}{\sum _j{e^{f_j}}}\right) \end{aligned}$$
(5)

where \(f_j\) is the j-th element of the feature vector representing subject f, while \(L_i\) is the full loss of the dataset over the training examples.

Softmax function gives probabilities for each class and it is commonly used as the final layer at the end of a neural network. Supposing that we have a classification problem with 10 different classes, thus the dimension of the output layer is 10. The ideal goal is to find 1.0 as score for a single output node, and a probability of zero for the rest of the output nodes. The best architecture for such requirement is Max-layer output, which will provide a probability of 1.0 for the maximum output of previous layer and the rest of the output nodes will be considered as zero. But such output layer will not be differentiable, hence it will be difficult to train. Alternatively, if the SoftMax function is used, it will almost work like the Max-layer and it will be differentiable by gradient descent. Exponential function will increase the probability of maximum value of the input compared to the other values. Another special characteristic of SoftMax layer is that the summation of all outputs is always equal to 1.0.

4 Experimental Results

The UMD-AA dataset [5], a very challenging testbed for performing experiments on active authentication for mobile devices, has been used to perform the experimental evaluation. The videos are recorded in different illumination conditions within a laboratory room. The first subset of videos was captured with artificial lighting (Session 1). The second subset was captured without any illumination (Session 2). The last subset was captured under natural sunlight (Session 3). The database is composed by videos from 150 subjects. For each subject 5 videos are available in each session. One out of the five videos, containing different changes in the face position and rotation, is used for enrollment. The remaining four videos are used for testing. The test videos were captured from mobile devices while the user was performing a specific activity, such as looking at a window popup, scrolling test, taking a picture or working on a document.

We use Protocol 1, which the most difficult one. Under this protocol, the training data is composed by the enrollment videos from one session while the test videos belonging to the other two sessions are used for testing. Hence, there are six available scenarios for this protocol: enrollment from Session 1 with testing from Sessions 2 and 3; enrollment from Session 2 with testing from Sessions 1 and 3; and enrollment from Session 3 with testing from Sessions 1 and 2. In our experiments, the landmark points are obtained directly from the images/frames without any preprocessing. Considering that blur is the less affecting factor on this database, we use \(k_b=0.1\) and 0.3 as weights for the other three parameters.

The first experiment conducted, focuses on the selection of the best frames. We evaluate selecting the best three and the best ten frames and the results are compared with respect to use all frames of the sequence and ten random frames. The Rank-1 recognition rates for the three models on each evaluated scenario are shown in Table 1. As can be seen from the table, the results for all cases selecting a given number of frames, are much better that using all frames or randomly selected frames. On the other hand, selecting ten frames is in general better than selecting only three. However, except for Dlib model, the results are very close, so can be an option using only three frames for devices with limited resources. It should be noticed that using all the available frames in a sequence is not feasible in terms of computing time for mobile authentication. The average processing time of an image, in the feature extraction stage is around 352.63 ms for Dlib network, 85.56ms for ResNet50 and 20.65 ms for MobileFaceNet. The videos of the database have an average of 180 frames. Is not possible to process this number of frames on mobile devices in real time using a traditional deep-learning based approach (Dlib and ResNet50). Even for the MobileFaceNet neural network which is designed for mobile environments, processing all the frames could be very time consuming.

Table 1. Rank-1 Recognition rates (%) for different frames selection strategies.

The ROC curves for the three models when selecting 10 frames, using each session for enrollment, are shown in Fig. 5. By analyzing the balance between accuracy and efficiency of the three evaluated models, we believe that the most adequate for mobile authentication is the MobileFaceNet network.

Fig. 5.
figure 5

ROC curves for the proposed method (selecting 10-frames) when each session is used as enrollment.

In Table 2 we compare the results obtained by the MobileFaceNet model in Protocol 1, with the best performing methods in [5]: Fisherfaces (FF), Sparse Representation-based Classification (SRC), and Mean-Sequence SRC (MSSRC). One can see clearly that our proposal outperforms the other methods by a large margin.

Table 2. Rank-1 Recognition rates (%) of state-of-the-art methods on Protocol 1 of UMD-AA dataset.

On the other hand, by analyzing the Area Under the Curve (AUC) for this model on Fig. 5, it can be seen that it is much more higher than those exhibit in [15]. In Table 3 the EER values for the proposed strategy, using the MobileFaceNet model, are compared with those based on attributes presented in [15], and it is corroborated the superiority of the proposal.

Table 3. EER for different methods in UMD-AA dataset.

5 Conclusion

In this paper an approach for face active authentication on mobile devices is presented. The proposal makes use of facial landmarks to efficiently select the best frames of a face video sequence. It is shown that, for three different CNN models, selecting the best frames is not only more efficient but also more accurate than using all frames of a video sequence captured during an authentication session.