Keywords

1 Introduction

Face inpainting aims to restore missing facial region, which is catching more and more attentions in the field of facial recognition, especially with the emergence of convolutional neural networks (CNN) [1] and generative adversarial networks (GAN) [2]. Most of previous works contain two branches: conventional methods and deep learning based methods. Conventional methods often calculate the similarity of two facial patches and using most similar facial patch to replace the missing facial patch by searching nearest neighbor [3, 4]. Deep learning based methods usually are end to end models to complete face image, which cannot retain the facial attributes or actions of ground-truth face image in most cases [5, 6]. Though deep learning based methods have achieved a great success, the generated face image often has distortion and fake trace. Some works also use low-level facial structure information to generate the whole face, such as landmarks and face parsing [13, 15], which are captured easily from the contents of face images. But these methods cannot guarantee the facial action comparing to ground-truth. In order to better learn the relation between occluded face region and non-occluded face region, we exploit co-existence of AUs to help complete face image.

Face inpainting is beneficial to the study of facial emotion analysis, which is easily affected by challenging environment, i.e. low resolution, occlusion, and so on. In the field of facial expression analysis, facial action units (AUs) refer to a unique set of basic facial muscle actions at certain facial locations defined by Facial Action Coding System(FACS) [7], which is one of the most comprehensive and objective systems for describing facial expressions. Considering our facial structure and patterns of facial expression are relatively fixed, it should be beneficial for face inpainting if taking relations of different AUs into consideration under occlusion. However, it is rare to see such face inpainting study by exploring the relations of different facial regions under different facial expressions in previous literature.

In this paper, we propose a novel face inpainting framework by exploiting the correlations of occluded and non-occluded facial region. Based on statistic methods, we can get AUs knowledge. According AUs knowledge, we proposed an AUs discriminator to help the classifier to generate the distribution of AUs in one dataset. The AUs information of coarse face image is connected in refinement network. This weak-supervised design can help generator learn the face image under different distribution of AUs. More accurate AUs results, the generated face image of refinement network will better retain the details corresponding to the ground-truth. The contributions of this paper are three-fold. First, a novel coarse-to-fine face inpainting network is proposed by exploiting the dynamic structure information of facial action units. Second, the weak-supervised design with the help of an AU discriminator is efficient for face inpainting with different facial expressions. Third, AU knowledge is combined into the generative model to recover facial expression information during inpainting, to the best of our knowledge, which has not been done before.

2 Related Work

Our proposed framework is closely related to generative adversarial networks, existing face inpainting methods, and facial action unit detection methods.

2.1 Generative Adversarial Networks

As one of the most popular and significant image generative techniques, GAN [2] achieves promising results by utilizing the mix-max adversarial game to train a generator and a discriminator alternatively. Due to adversarial training, the generator generates realistic face image. Recently, different versions of extended GAN have achieved a great success on face inpainting [5, 8], super-resolution [9, 10], style transfer [11] and face rotation [12]. Motivated by these successful solutions, we develop the face inpainting based on GAN and AUs discriminator.

2.2 Face Inpainting

Human face inpainting is much more challenging than general image inpainting tasks because facial structure is relatively fixed but contain large appearance variations. In addition, face identity information preserving is also very important during face inpainting. Li etc. proposed to complete face with face parsing result [13]. Zhang etc. developed models to complete face images with wavy lines [14]. Both face parsing and facial landmarks were utilized to supervise model learning [15]. However, these low-level geometry information is difficult to recover facial details with semantic relations of different facial regions. To take advantage of the co-existence of different AUs under different facial expressions and to restore the actions of filled region is still an open challenge, which motivate us to complete face image with high-level AU information.

2.3 Facial Action Units Detector

Automatic facial AUs detection plays an important role in describing facial actions. To recognize facial action units under complex environment, many works have been devoted to explore various features and classifier. With the development of deep learning, recent works try to exploit deep representations to recognize AUs, such as [16]. Due to the correction between different AUs, [17] attempts to attention the region centered at facial landmarks, which is efficient for model learning and [18] proposes region layer help learn deep representations. AUs knowledge becomes a possibility that model learning via a weak-supervised way in [19]. These methods achieve good performance for AUs detection, which motivates us to exploit the correlation between occlusion facial region and non-occlusion facial region.

3 Our Method

In our section, the overview of the proposed model is firstly introduce in Sect. 3.1, which exploits prior knowledge of facial action unit to complete face images. Then, the AU knowledge is presented in Sect. 3.2. Lastly, the pre-training of AUs classifier is explained in Sect. 3.3.

3.1 Proposed Method

In this section, the overview structure of our proposed model is introduced from three parts: the first network is the face inpainting generator network; the second network is the discriminator network, which is mainly used to distinguish the real original image and the fake repaired image; the third network is an AUs recognition network. It is worth noting that the discriminator network and the AU recognition network are not used during the time of testing. The overview of our proposed face inpainting framework is shown in Fig. 1.

Fig. 1.
figure 1

Overview of the proposed face inpainting framework. E and D represent encoder and decoder networks of coarse network, while Encoder and Decoder combine into refinement network. Pre-trained Classifier is utilized to detect facial action units. The last output 0/1 is to discriminate the detected AU probability from the AU knowledge.

In the first part, a coarse-to-fine face inpainting generator consisting of a coarse network and a refinement network is proposed. Given a facial image with mask, the coarse network with auto-encode network is used to generate a coarse face image at first. In the refinement network, by taking the facial action unit information of the coarse face image as a condition, refined facial image with corresponding AUs details is generated. The input is a 128 × 128 × 3 face image with a mask and the output is a restored 128 × 128 × 3 face image.

Coarse Network:

The structure of the coarse network is shown in Fig. 2(a), in which, four 5 × 5 convolutional layers with strides 2 are used in the down-sample network, and four deconvolutional of 5 × 5 are used in the symmetric up-sample network. During the last deconvolutional layer, the tanh activation function is utilized in the output layer, and relu activation function is utilized in the other layers. To get a better coarse result, pixel-wise loss is adopted to supervise model learning as defined as:

$$ L_{pixel\_c} = \left\| {x_{r} - x_{coar} } \right\|_{2}^{2} , $$
(1)

in which, \( x_{r} \) is the ground-truth face image without mask, \( x_{coar} \) is the output coarse face image.

Fig. 2.
figure 2

The structure of the coarse network and the refinement network in the generator.

Refinement Network:

The structure of the refinement network is shown in Fig. 2(b), in which, four 5 × 5 convolutional layers with stride 2 are used in the down-sample network, which is followed by a layer of fully convolutional layer. The condition information here is the combination of the output latent feature of the encoder and a one hot vector of the AU classification results. The1024-dimensional concatenate vector is regarded as the bottle-neck of the refinement network, and utilized as the first layer of fully convolutional layer in the up-sample network with the output of an 8 × 8 × 512 dimensional vector. Then the output is reshaped into a tensor of (8, 8, 512). Finally, four 5 × 5 deconvolutional layers are used in decoder network. Same as the coarse network, the activation of the last layer is tanh, and similar pixel-wise loss is defined as:

$$ L_{pixel\_f} = \left\| {x_{r} - x_{fine} } \right\|_{2}^{2} , $$
(2)

in which, \( x_{fine} \) is the output fine face image of the refinement network.

Due to the fact that the input face image is partially occluded, which will lead to the obvious boundary between the occlusion area and the non-occlusion area in the output. To solve this problem, the total variation (TV) loss are defined in both coarse network and refinement network as following:

$$ L_{tv\_c} = \frac{1}{n}\sum\nolimits_{i,j = 1}^{n} {\left( {\left\| {x_{{coar_{i,j} }} - x_{{coar_{i,j - 1} }} } \right\|_{2}^{2} + \left\| {x_{{coar_{i,j} }} - x_{{coar_{i - 1,j} }} } \right\|_{2}^{2} } \right)} , $$
(3)
$$ L_{tv\_f} = \frac{1}{n}\sum\nolimits_{i,j = 1}^{n} {\left( {\left\| {x_{{fine_{i,j} }} - x_{{fine_{i,j - 1} }} } \right\|_{2}^{2} + \left\| {x_{{fine_{i,j} }} - x_{{fine_{i - 1,j} }} } \right\|_{2}^{2} } \right)} . $$
(4)

Discriminator Network:

Discriminator network consists of five 5 × 5 convolutional layers with stride of 2 and one fully convolutional layer. Following GAN loss is utilized to train the generator network and discriminator network:

$$ L_{D\_c} = E_{{x_{r} \sim P_{r} }} \left[ {\log D\left( {x_{r} } \right)} \right] + E_{{x_{coar} \sim P_{coar} }} \left[ {\log \left( {1 - D\left( {x_{coar} } \right)} \right)} \right], $$
(5)
$$ L_{D\_f} = E_{{x_{r} \sim P_{r} }} \left[ {\log D\left( {x_{r} } \right)} \right] + E_{{x_{fine} \sim P_{fine} }} \left[ {\log \left( {1 - D\left( {x_{fine} } \right)} \right)} \right], $$
(6)

In which, \( D \) is the discriminator network utilized to distinguish the restored face image and the ground-truth face image. In order to generate better face image and lead to quick convergence for face inpainting generative model, feature map loss \( L_{f\_D} \) is defined as following:

$$ L_{f\_D} = \frac{1}{2}\left\| {f_{D} \left( {x_{r} } \right) - f_{D} \left( {x_{coar} } \right)} \right\|_{2}^{2} + \frac{1}{2}\left\| {f_{D} \left( {x_{r} } \right) - f_{D} \left( {x_{fine} } \right)} \right\|_{2}^{2} , $$
(7)

\( f_{D} \) is the output of last fully convolutional layer of discriminator network.

AU Classification Network:

AU classification network consists of four 5 × 5 convolutional layers with stride of 2 and one fully convolutional layer. Due to the fact that AU detection is a multi-label problem, the loss define of AU classifier will be introduce in Sect. 3.3. Here we aim to simplify the loss, we use the AUs knowledge as ground-truth label and the recognized AUs results as generated ones. Only one simple discriminator is utilized to distinguish the generated ones and the ground-truth, the losses are defined as:

$$ L_{D_{au\_c}}=E_{l_{r}\sim P_{r}} \left [\log D_{au}\left(l_{r} \right ) \right ] + E_{l_{coar}\sim P_{coar}} \left [\log \left( 1- D_{au}\left(l_{coar} \right ) \right ) \right ], $$
(8)
$$ L_{D_{au\_c}}=E_{l_{r}\sim P_{r}} \left [\log D_{au}\left(l_{r} \right ) \right ] + E_{l_{fine}\sim P_{fine}} \left [\log \left( 1- D_{au}\left(l_{fine} \right ) \right ) \right ], $$
(9)

Also, in order to generate better face image and lead to quick convergence for face inpainting framework, the feature map loss is defined as follows:

$$ L_{f\_c} = \frac{1}{2}\left\| {f_{C} \left( {x_{r} } \right) - f_{C} \left( {x_{coar} } \right)} \right\|_{2}^{2} + \frac{1}{2}\left\| {f_{C} \left( {x_{r} } \right) - f_{C} \left( {x_{fine} } \right)} \right\|_{2}^{2} , $$
(10)

In which, \( f_{C} \) is the output of last fully convolutional layer of AU classification network.

The total loss of our proposed face inpainting network is:

$$ \begin{aligned} L_{G} = & l_{{pixel_{c} }} + l_{{pixel_{f} }} + \lambda_{1} L_{{tv_{c} }} + \lambda_{1} L_{{tv_{f} }} + \lambda_{2} L_{{D_{c} }} + \lambda_{2} L_{{D_{f} }} \\ & + \lambda_{3} L_{f\_D} + \lambda_{2} L_{Dau\_c} + \lambda_{2} L_{Dau\_f} + \lambda_{3} L_{f\_c} , \\ \end{aligned} $$
(11)

where \( \lambda_{1} \), \( \lambda_{2} \) and \( \lambda_{3} \) are trade-off parameters.

3.2 AU Knowledge

Due to the fact that facial structure is relatively fixed, the network considering facial high-level structural information should be carefully designed. AU knowledge is statistically obtained from specific dataset. For each AU, the appearance or not is represented by 1 or 0. Assuming that most AUs are located in two facial regions: eye region and mouth region, the status of AUs around eye region could be utilized to predict the status of AUs around mouth region, and vice versa. Due to the co-existence of AUs in whole face image, take the situation of mouth region is masked and eye region is unmasked as an example: Firstly we should observe the accurate AUs around eye region. Supposing there are \( n \) AUs which are related to eye region, then there will be \( 2^{n} \) cases for \( n \) AUs around eye region. According each case, the conditional probability for each AU around mouth region can calculated in a specific dataset. All these AU conditional distribution cases are regarded as the AUs knowledge here. Mouth region to eye region is same with eye region to mouth region.

3.3 Pre-training of AU Classifier

To get accurate AU information, a pre-trained network on training dataset of each benchmark is utilized as the AU classifier in Fig. 1. AU Classifier is adopted to detect AUs from the whole input face image, coarse generated face image and fine generated face image. AU classifier is used to help generator to learn the facial AU distributions. The loss for training AUs classifier is defined as:

$$ L_{c}= E_{l_{r}\sim P_{r}} \left [\log D_{au} \left(l_{r} \right ) \right ] + E_{l_{xr}\sim P_{xr}} \left [\log \left ( 1- D_{au} \left ( C \left(x_{r} \right ) \right ) \right ) \right ] , $$
(12)

where C is the pre-trained AU classifier, which recognizes AUs to predict the AU information in masked region with the help of AU knowledge.

4 Experiment

4.1 Datasets and Settings

Datasets:

Our proposed face inpainting network is evaluated in two commonly used datasets for facial expression analysis: BP4D [20] and DISFA [21]. For BP4D, we split the data set into training/test sets based on the subjects. There are 28 subjects in the training set and 13 subjects in the test set. Each group contains 12 AUs whose AU is labeled as appearing or not appearing. A total of 100760 frames are used for training and 45809 frames are used for testing. In BP4D, the differences in color and background of the face image are small. For DISFA, the dataset is processed in the same way as BP4D, with 18 subjects in the training set and 9 subjects in the test set. Each group contains 12 AUs labeled as 0 or 1. A total of 87209 frames are used as the training set and 43605 frames are used for testing. Note that the color and background of the face image are very different in the DISFA data set, which makes it difficult to get good results.

Pre-processing:

For each facial image, we perform a similar transformation including rotation, uniform scaling and translation to obtain a 128 × 128 × 3 color face image. This deformation is shape-retaining and does not change the expression. A mask based eye or mouth is added, which is same as the input size, thereby generating an input ill-conditioned face image.

Implementation Details:

The AU classifier is pre-trained at first, then the generator is trained at one epoch and the discriminator is trained at every three epochs. For the AU classifier training, we got good indicators, which are slightly lower than the state of art. The settings of the trade-off parameters are \( \lambda_{1} = 0.0 \), \( \lambda_{2} = 0.05 \), and \( \lambda_{3} = 1.0 \). We use Adam for optimization. The learning rate is 0.0001, the batch size is 16. After the tanh layer, Poisson blending method is adopted as a post-process step.

4.2 Quality Results

We aim to use high-level facial AU information during face inpainting. Facial action units represent facial dynamic actions, which is useful for face-related works. To verify the effectiveness of facial action unit information in our face inpainting network, we compare the visual results with baseline and baseline + C. Baseline is a simple autoencoder network to generate face image without any facial AU information and baseline + C is baseline network with AU classifier network to detect AUs. Coarse face images are generated by the coarse network and fine face images are generated by refinement network. The comparison results in BP4D and DISFA datasets are shown in Figs. 3 and 4 respectively. It can be observed that the results of our proposed method are better than others, such as the first row in Fig. 3, our fine result achieves opened eyes which is same as the ground-truth, but baseline achieves closed eyes. In addition, we get good quality of face image in DISFA dataset, which is shown in Fig. 4. More differences can be observed in Figs. 3 and 4.

Fig. 3.
figure 3

The results of testing set in BP4D, top three rows show the comparison in eye-masked region and others in mouth-masked region. Zoom in for better view on details.

Fig. 4.
figure 4

The results of testing set in DISFA, top three rows show the comparison in eye-masked region and others in mouth-masked region. Zoom in for better view on details.

4.3 Quantity Results

When it comes to face generation task, the scores of structural similarity (SSIM) and Peak Signal to Noise Ratio (PSNR) are usually considered to evaluate image qualities. The quantity results of our proposed method can be observed in Table 1. It can be observed that the fine results achieve the best metrics, which outperforms the baseline about 0.024 and 0.031 for SSIM, about 0.92 and 1.79 for PSNR respectively in BP4D and DISFA dataset. High SSIM and PSNR with high quality of generative face images. These quantity results demonstrate the effectiveness of our proposed method on face inpainting task with dynamic AUs information. Note that there are few improvements in DISFA datasets due to its extreme unbalance distribution.

Table 1. SSIM and PSNR on BP4D and DISFA

Most generative works often generate face images which cannot retain the identity information. The landmarks distance can be used to measure whether the identities are same or not between generated face image and ground-truth. Facial landmarks represents the facial actions and the identity, the comparison of landmarks distance can be observed in Table 2. The fine result achieve 4.131 and 9.694 improvements in BP4D and DISFA respectively, which demonstrated the effectiveness of our proposed method in facial identity preserving.

Table 2. Landmarks distance on BP4D and DISFA

The AU detection results of the face inpainting results of different methods are shown in Table 3. It can be observed that fine results outperform which of the baseline in BP4D dataset. Fine results bring significant increments of 4.3% and 1.0% respectively for average accuracy of eye-masked and mouth-masked than baseline, and bring increments of 1.4% and 0.9% than baseline + C. For the DISFA dataset, fine results bring significant increments of 4.3% and 3.6% respectively for average accuracy of eye-masked and mouth-masked than baseline, and bring increments of 0.9% and 3.4% than baseline + C. All these results demonstrate the effectiveness of our proposed model in facial expression restoration during face inpainting.

Table 3. AU detection performance of inpainting results on BP4D and DISFA

5 Conclusion

In this paper, we have proposed a novel face inpainting method with dynamic structural information of facial action units like AUs knowledge. The proposed method is beneficial to facial expression analysis under occlusion environments and so on. Extensive qualitative and quantitative evaluations conducted on BP4D and DISFA have demonstrated the effectiveness of our method for face inpainting. The proposed framework is also promising to be applied for other face restoration tasks and other multi-task problems, i.e. face recognition, facial attribute analysis, and so on. Also we can use generator to restore face images, then facial expression analysis for restored face image.