Abstract
In recent years, the deep learning based face inpainting methods have achieved some promising results, mainly related to the generative adversarial networks (GAN). Based on a large number of training samples, GAN generates a false true from a known training sample, and cannot use the trained parameters to generate image other than training samples. In addition, most of the previous works did not take into account high-level facial structure information, such as the co-existence of facial action units. In order to better to exploit facial structural knowledge, this paper proposes a method that combines prior knowledge based on high-level dynamic structural information of facial action units and GAN to complete face images. We primarily validate the effectiveness of our approach in facial expression restoration during face inpainting on the two datasets of BP4D and DISFA.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Face inpainting aims to restore missing facial region, which is catching more and more attentions in the field of facial recognition, especially with the emergence of convolutional neural networks (CNN) [1] and generative adversarial networks (GAN) [2]. Most of previous works contain two branches: conventional methods and deep learning based methods. Conventional methods often calculate the similarity of two facial patches and using most similar facial patch to replace the missing facial patch by searching nearest neighbor [3, 4]. Deep learning based methods usually are end to end models to complete face image, which cannot retain the facial attributes or actions of ground-truth face image in most cases [5, 6]. Though deep learning based methods have achieved a great success, the generated face image often has distortion and fake trace. Some works also use low-level facial structure information to generate the whole face, such as landmarks and face parsing [13, 15], which are captured easily from the contents of face images. But these methods cannot guarantee the facial action comparing to ground-truth. In order to better learn the relation between occluded face region and non-occluded face region, we exploit co-existence of AUs to help complete face image.
Face inpainting is beneficial to the study of facial emotion analysis, which is easily affected by challenging environment, i.e. low resolution, occlusion, and so on. In the field of facial expression analysis, facial action units (AUs) refer to a unique set of basic facial muscle actions at certain facial locations defined by Facial Action Coding System(FACS) [7], which is one of the most comprehensive and objective systems for describing facial expressions. Considering our facial structure and patterns of facial expression are relatively fixed, it should be beneficial for face inpainting if taking relations of different AUs into consideration under occlusion. However, it is rare to see such face inpainting study by exploring the relations of different facial regions under different facial expressions in previous literature.
In this paper, we propose a novel face inpainting framework by exploiting the correlations of occluded and non-occluded facial region. Based on statistic methods, we can get AUs knowledge. According AUs knowledge, we proposed an AUs discriminator to help the classifier to generate the distribution of AUs in one dataset. The AUs information of coarse face image is connected in refinement network. This weak-supervised design can help generator learn the face image under different distribution of AUs. More accurate AUs results, the generated face image of refinement network will better retain the details corresponding to the ground-truth. The contributions of this paper are three-fold. First, a novel coarse-to-fine face inpainting network is proposed by exploiting the dynamic structure information of facial action units. Second, the weak-supervised design with the help of an AU discriminator is efficient for face inpainting with different facial expressions. Third, AU knowledge is combined into the generative model to recover facial expression information during inpainting, to the best of our knowledge, which has not been done before.
2 Related Work
Our proposed framework is closely related to generative adversarial networks, existing face inpainting methods, and facial action unit detection methods.
2.1 Generative Adversarial Networks
As one of the most popular and significant image generative techniques, GAN [2] achieves promising results by utilizing the mix-max adversarial game to train a generator and a discriminator alternatively. Due to adversarial training, the generator generates realistic face image. Recently, different versions of extended GAN have achieved a great success on face inpainting [5, 8], super-resolution [9, 10], style transfer [11] and face rotation [12]. Motivated by these successful solutions, we develop the face inpainting based on GAN and AUs discriminator.
2.2 Face Inpainting
Human face inpainting is much more challenging than general image inpainting tasks because facial structure is relatively fixed but contain large appearance variations. In addition, face identity information preserving is also very important during face inpainting. Li etc. proposed to complete face with face parsing result [13]. Zhang etc. developed models to complete face images with wavy lines [14]. Both face parsing and facial landmarks were utilized to supervise model learning [15]. However, these low-level geometry information is difficult to recover facial details with semantic relations of different facial regions. To take advantage of the co-existence of different AUs under different facial expressions and to restore the actions of filled region is still an open challenge, which motivate us to complete face image with high-level AU information.
2.3 Facial Action Units Detector
Automatic facial AUs detection plays an important role in describing facial actions. To recognize facial action units under complex environment, many works have been devoted to explore various features and classifier. With the development of deep learning, recent works try to exploit deep representations to recognize AUs, such as [16]. Due to the correction between different AUs, [17] attempts to attention the region centered at facial landmarks, which is efficient for model learning and [18] proposes region layer help learn deep representations. AUs knowledge becomes a possibility that model learning via a weak-supervised way in [19]. These methods achieve good performance for AUs detection, which motivates us to exploit the correlation between occlusion facial region and non-occlusion facial region.
3 Our Method
In our section, the overview of the proposed model is firstly introduce in Sect. 3.1, which exploits prior knowledge of facial action unit to complete face images. Then, the AU knowledge is presented in Sect. 3.2. Lastly, the pre-training of AUs classifier is explained in Sect. 3.3.
3.1 Proposed Method
In this section, the overview structure of our proposed model is introduced from three parts: the first network is the face inpainting generator network; the second network is the discriminator network, which is mainly used to distinguish the real original image and the fake repaired image; the third network is an AUs recognition network. It is worth noting that the discriminator network and the AU recognition network are not used during the time of testing. The overview of our proposed face inpainting framework is shown in Fig. 1.
In the first part, a coarse-to-fine face inpainting generator consisting of a coarse network and a refinement network is proposed. Given a facial image with mask, the coarse network with auto-encode network is used to generate a coarse face image at first. In the refinement network, by taking the facial action unit information of the coarse face image as a condition, refined facial image with corresponding AUs details is generated. The input is a 128 × 128 × 3 face image with a mask and the output is a restored 128 × 128 × 3 face image.
Coarse Network:
The structure of the coarse network is shown in Fig. 2(a), in which, four 5 × 5 convolutional layers with strides 2 are used in the down-sample network, and four deconvolutional of 5 × 5 are used in the symmetric up-sample network. During the last deconvolutional layer, the tanh activation function is utilized in the output layer, and relu activation function is utilized in the other layers. To get a better coarse result, pixel-wise loss is adopted to supervise model learning as defined as:
in which, \( x_{r} \) is the ground-truth face image without mask, \( x_{coar} \) is the output coarse face image.
Refinement Network:
The structure of the refinement network is shown in Fig. 2(b), in which, four 5 × 5 convolutional layers with stride 2 are used in the down-sample network, which is followed by a layer of fully convolutional layer. The condition information here is the combination of the output latent feature of the encoder and a one hot vector of the AU classification results. The1024-dimensional concatenate vector is regarded as the bottle-neck of the refinement network, and utilized as the first layer of fully convolutional layer in the up-sample network with the output of an 8 × 8 × 512 dimensional vector. Then the output is reshaped into a tensor of (8, 8, 512). Finally, four 5 × 5 deconvolutional layers are used in decoder network. Same as the coarse network, the activation of the last layer is tanh, and similar pixel-wise loss is defined as:
in which, \( x_{fine} \) is the output fine face image of the refinement network.
Due to the fact that the input face image is partially occluded, which will lead to the obvious boundary between the occlusion area and the non-occlusion area in the output. To solve this problem, the total variation (TV) loss are defined in both coarse network and refinement network as following:
Discriminator Network:
Discriminator network consists of five 5 × 5 convolutional layers with stride of 2 and one fully convolutional layer. Following GAN loss is utilized to train the generator network and discriminator network:
In which, \( D \) is the discriminator network utilized to distinguish the restored face image and the ground-truth face image. In order to generate better face image and lead to quick convergence for face inpainting generative model, feature map loss \( L_{f\_D} \) is defined as following:
\( f_{D} \) is the output of last fully convolutional layer of discriminator network.
AU Classification Network:
AU classification network consists of four 5 × 5 convolutional layers with stride of 2 and one fully convolutional layer. Due to the fact that AU detection is a multi-label problem, the loss define of AU classifier will be introduce in Sect. 3.3. Here we aim to simplify the loss, we use the AUs knowledge as ground-truth label and the recognized AUs results as generated ones. Only one simple discriminator is utilized to distinguish the generated ones and the ground-truth, the losses are defined as:
Also, in order to generate better face image and lead to quick convergence for face inpainting framework, the feature map loss is defined as follows:
In which, \( f_{C} \) is the output of last fully convolutional layer of AU classification network.
The total loss of our proposed face inpainting network is:
where \( \lambda_{1} \), \( \lambda_{2} \) and \( \lambda_{3} \) are trade-off parameters.
3.2 AU Knowledge
Due to the fact that facial structure is relatively fixed, the network considering facial high-level structural information should be carefully designed. AU knowledge is statistically obtained from specific dataset. For each AU, the appearance or not is represented by 1 or 0. Assuming that most AUs are located in two facial regions: eye region and mouth region, the status of AUs around eye region could be utilized to predict the status of AUs around mouth region, and vice versa. Due to the co-existence of AUs in whole face image, take the situation of mouth region is masked and eye region is unmasked as an example: Firstly we should observe the accurate AUs around eye region. Supposing there are \( n \) AUs which are related to eye region, then there will be \( 2^{n} \) cases for \( n \) AUs around eye region. According each case, the conditional probability for each AU around mouth region can calculated in a specific dataset. All these AU conditional distribution cases are regarded as the AUs knowledge here. Mouth region to eye region is same with eye region to mouth region.
3.3 Pre-training of AU Classifier
To get accurate AU information, a pre-trained network on training dataset of each benchmark is utilized as the AU classifier in Fig. 1. AU Classifier is adopted to detect AUs from the whole input face image, coarse generated face image and fine generated face image. AU classifier is used to help generator to learn the facial AU distributions. The loss for training AUs classifier is defined as:
where C is the pre-trained AU classifier, which recognizes AUs to predict the AU information in masked region with the help of AU knowledge.
4 Experiment
4.1 Datasets and Settings
Datasets:
Our proposed face inpainting network is evaluated in two commonly used datasets for facial expression analysis: BP4D [20] and DISFA [21]. For BP4D, we split the data set into training/test sets based on the subjects. There are 28 subjects in the training set and 13 subjects in the test set. Each group contains 12 AUs whose AU is labeled as appearing or not appearing. A total of 100760 frames are used for training and 45809 frames are used for testing. In BP4D, the differences in color and background of the face image are small. For DISFA, the dataset is processed in the same way as BP4D, with 18 subjects in the training set and 9 subjects in the test set. Each group contains 12 AUs labeled as 0 or 1. A total of 87209 frames are used as the training set and 43605 frames are used for testing. Note that the color and background of the face image are very different in the DISFA data set, which makes it difficult to get good results.
Pre-processing:
For each facial image, we perform a similar transformation including rotation, uniform scaling and translation to obtain a 128 × 128 × 3 color face image. This deformation is shape-retaining and does not change the expression. A mask based eye or mouth is added, which is same as the input size, thereby generating an input ill-conditioned face image.
Implementation Details:
The AU classifier is pre-trained at first, then the generator is trained at one epoch and the discriminator is trained at every three epochs. For the AU classifier training, we got good indicators, which are slightly lower than the state of art. The settings of the trade-off parameters are \( \lambda_{1} = 0.0 \), \( \lambda_{2} = 0.05 \), and \( \lambda_{3} = 1.0 \). We use Adam for optimization. The learning rate is 0.0001, the batch size is 16. After the tanh layer, Poisson blending method is adopted as a post-process step.
4.2 Quality Results
We aim to use high-level facial AU information during face inpainting. Facial action units represent facial dynamic actions, which is useful for face-related works. To verify the effectiveness of facial action unit information in our face inpainting network, we compare the visual results with baseline and baseline + C. Baseline is a simple autoencoder network to generate face image without any facial AU information and baseline + C is baseline network with AU classifier network to detect AUs. Coarse face images are generated by the coarse network and fine face images are generated by refinement network. The comparison results in BP4D and DISFA datasets are shown in Figs. 3 and 4 respectively. It can be observed that the results of our proposed method are better than others, such as the first row in Fig. 3, our fine result achieves opened eyes which is same as the ground-truth, but baseline achieves closed eyes. In addition, we get good quality of face image in DISFA dataset, which is shown in Fig. 4. More differences can be observed in Figs. 3 and 4.
4.3 Quantity Results
When it comes to face generation task, the scores of structural similarity (SSIM) and Peak Signal to Noise Ratio (PSNR) are usually considered to evaluate image qualities. The quantity results of our proposed method can be observed in Table 1. It can be observed that the fine results achieve the best metrics, which outperforms the baseline about 0.024 and 0.031 for SSIM, about 0.92 and 1.79 for PSNR respectively in BP4D and DISFA dataset. High SSIM and PSNR with high quality of generative face images. These quantity results demonstrate the effectiveness of our proposed method on face inpainting task with dynamic AUs information. Note that there are few improvements in DISFA datasets due to its extreme unbalance distribution.
Most generative works often generate face images which cannot retain the identity information. The landmarks distance can be used to measure whether the identities are same or not between generated face image and ground-truth. Facial landmarks represents the facial actions and the identity, the comparison of landmarks distance can be observed in Table 2. The fine result achieve 4.131 and 9.694 improvements in BP4D and DISFA respectively, which demonstrated the effectiveness of our proposed method in facial identity preserving.
The AU detection results of the face inpainting results of different methods are shown in Table 3. It can be observed that fine results outperform which of the baseline in BP4D dataset. Fine results bring significant increments of 4.3% and 1.0% respectively for average accuracy of eye-masked and mouth-masked than baseline, and bring increments of 1.4% and 0.9% than baseline + C. For the DISFA dataset, fine results bring significant increments of 4.3% and 3.6% respectively for average accuracy of eye-masked and mouth-masked than baseline, and bring increments of 0.9% and 3.4% than baseline + C. All these results demonstrate the effectiveness of our proposed model in facial expression restoration during face inpainting.
5 Conclusion
In this paper, we have proposed a novel face inpainting method with dynamic structural information of facial action units like AUs knowledge. The proposed method is beneficial to facial expression analysis under occlusion environments and so on. Extensive qualitative and quantitative evaluations conducted on BP4D and DISFA have demonstrated the effectiveness of our method for face inpainting. The proposed framework is also promising to be applied for other face restoration tasks and other multi-task problems, i.e. face recognition, facial attribute analysis, and so on. Also we can use generator to restore face images, then facial expression analysis for restored face image.
References
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Goodfellow, I., Pouget-Abadie, J., Mirza, M., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)
Hays, J., Efros, A.A.: Scene completion using millions of photographs. ACM Trans. Graph. (TOG) 26(3), 4 (2007)
Oliva, A., Torralba, A.: Building the gist of a scene: the role of global image features in recognition. Prog. Brain Res. 155, 23–36 (2006)
Yeh, R.A., Chen, C., Yian Lim, T., et al.: Semantic image inpainting with deep generative models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5485–5493 (2017)
Pathak, D., Krahenbuhl, P., Donahue, J., et al.: Context encoders: feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544 (2016)
Ekman, R.: What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (FACS). Oxford University Press, USA (1997)
Yu, J., Lin, Z., Yang, J., et al.: Generative image inpainting with contextual attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5505–5514 (2018)
Ledig, C., Theis, L., Huszár, F., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4681–4690 (2017)
Chen, Y., Tai, Y., Liu, X., et al.: Fsrnet: end-to-end learning face super-resolution with facial priors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2492–2501 (2018)
Chang, H., Lu, J., Yu, F., et al.: Pairedcyclegan: asymmetric style transfer for applying and removing makeup. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 40–48 (2018)
Huang, R., Zhang, S., Li, T., et al.: Beyond face rotation: global and local perception GAN for photorealistic and identity preserving frontal view synthesis. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2439–2448 (2017)
Li, Y., Liu, S., Yang, J., et al.: Generative face completion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3911–3919 (2017)
Zhang, S., He, R., Sun, Z., et al.: DeMeshNet: Blind Face Inpainting for Deep MeshFace Verification (2018)
Song, L., Cao, J., Song, L., et al.: Geometry-aware face completion and editing. arXiv preprint arXiv:1809.02967 (2018)
Han, S., Meng, Z., Khan, A.S., et al.: Incremental boosting convolutional neural network for facial action unit recognition. In: Advances in Neural Information Processing Systems, pp. 109–117 (2016)
Li, W., Abtahi, F., Zhu, Z., et al.: Eac-net: deep nets with enhancing and cropping for facial action unit detection. IEEE Trans. Pattern Anal. Mach. Intell. 40(11), 2583–2596 (2018)
Zhao, K., Chu, W.S., Zhang, H.: Deep region and multi-label learning for facial action unit detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3391–3399 (2016)
Peng, G., Wang, S.: Weakly supervised facial action unit recognition through adversarial training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2188–2196 (2018)
Zhang, X., Yin, L., Cohn, J.F., et al.: A high-resolution spontaneous 3D dynamic facial expression database. In: 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–6. IEEE (2013)
Mavadati, S.M., Mahoor, M.H., Bartlett, K., et al.: Disfa: a spontaneous facial action intensity database. IEEE Trans. Affect. Comput. 4(2), 151–160 (2013)
Acknowledgements
This work is supported by the National Natural Science Foundation of China under Grants of 41806116 and 61503277.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Li, L., Liu, Z., Zhang, C. (2019). Face Inpainting with Dynamic Structural Information of Facial Action Units. In: Zhao, Y., Barnes, N., Chen, B., Westermann, R., Kong, X., Lin, C. (eds) Image and Graphics. ICIG 2019. Lecture Notes in Computer Science(), vol 11903. Springer, Cham. https://doi.org/10.1007/978-3-030-34113-8_55
Download citation
DOI: https://doi.org/10.1007/978-3-030-34113-8_55
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34112-1
Online ISBN: 978-3-030-34113-8
eBook Packages: Computer ScienceComputer Science (R0)