Face inpainting based on GAN by facial prediction and fusion as guidance information

doi:10.1016/j.asoc.2021.107626

Applied Soft Computing

Volume 111, November 2021, 107626

https://doi.org/10.1016/j.asoc.2021.107626 Get rights and content

Highlights

•
FIFPNet predicts and fuses face semantic information to complete face inpainting.
•
we use a discriminator instead of KL loss to enhances its learning ability in VAE.
•
Guidance information can enhance the robustness of the generators in FIFPNet.

Abstract

Face inpainting, a special case of image inpainting, aims to complete the occluded facial regions with unconstrained pose and orientation. However, existing methods generate unsatisfying results with easily detectable flaws. There are often fuzzy boundaries and details near the holes. Especially, for face inpainting, the face region semantic information (face structure, contour, and content information) has not been fully utilized, which leads to unnatural face images, such as asymmetry eyebrow and different sizes of eyes. This is unrealistic in many practical applications. To solve the problems, a new generative adversarial network by facial prediction and fusion as guidance information, is proposed for large missing regions of face inpainting. In the proposed method, two stages are adopted to complete coarse inpainting and refinement of the face. In Stage-I, we combine generator with a new encoder–decoder network with variational autoencoder-based backbone to predict the face region semantic information (including face structure, contour and content information) and do facial fusion for face inpainting. This could fully explore face region semantic information and generate coordinated coarse face images. Stage-II builds upon Stage-I results to refine face image. Both global and patch discriminators are used to synthesize high-quality photo-realistic inpainting. Experimental results on both CelebA and CelebA-HQ datasets demonstrate the effectiveness and efficiency of our method.

Introduction

Image inpainting aims to fill in the missing part of an image with visually plausible contents. Recent advances in deep generative models have shown promising potential [1], [2], [3], [4], [5]. Barnes et al. [1] utilized the continuity of images to reduce the search scopes of patch similarity and introduced a fast nearest-neighbor field algorithm called PatchMatch (PM) for image editing applications. Iizuka et al. [2] used both local discriminator and global discriminator (GLCIC) to work on the missing area and the overall image respectively. Yu et al. [3] proposed two-stage generative image inpainting with contextual attention (CA) to refine low-resolution images generated by stage one. Zheng et al. [4] proposed a pluralistic image completion network (PICNet) with a reconstructive path and a generative path to creating multiple plausible results. Zeng et al. [5] used a pyramid-context encoder network (PEN) to complete image inpainting. Liu et al. [6] designed coherent semantic attention (CSA) to complete image inpainting by correcting the similarity between inpainting region and neighborhood. Yu et al. [7] proposed region normalization (RN) to repair the missing areas. Based on these frameworks, there is a growing body of work on image inpainting, such as [8] and [9].

However, existing methods seriously slow down performance or generate unsatisfying results with easily detectable flaws. Moreover, there is often perceivable discontinuity near the holes and requires further post-processing to refine the results. Especially, for face inpainting, the face region semantic information (face structure, contour, and content information) has not been fully explored, which leads to unnatural face images, such as asymmetry eyebrows and different sizes of eyes. This is unrealistic in many practical applications. Unlike a conventional image inpainting method, face inpainting requires content, contour, and structure information about the target object for realistic outputs. While these general image inpainting methods did not consider these special particularities of the human face and did not fully explore and utilize the information contained in the face. This would lead to unnatural and fuzzy face images. Consequently, face inpainting remains a challenging problem.

Furthermore, only few papers are dedicated to face inpainting task [10], [11]. They generally incorporate simple face features into the generator for human face completion. However, the advantage of face region semantic information has not been fully explored and utilized, which also leads to unnatural face images, especially for the corrupted face image with large holes. In this paper, we focus on a large missing face inpainting model. There remain three key difficulties in large missing face completion. First, for the corrupted face image with large holes, it is inadvisable to complete the missing facial region according to other facial regions. Because a large missing area with square mask is more difficult to complete than that with an irregular mask or small square mask such that some image inpainting methods generally inpaint cropped images with irregular and small square masks. The main reason is that the receptive field of the convolution kernel is square, and the convolution kernel cannot capture any information once it reaches the big square missing region. While for the irregular or small mask, the convolution kernel can catch useful information in the receptive field from the background or the missing region. Second, the content of the missing area is very different from the content of the background area such that it is very difficult to generate the natural and harmonious face from the background image. For example, some image inpainting methods use the attention idea to find similar blocks from the background to repair the missing area. While similarity matching between each missing block of each image with the surrounding background block would take a long time to train and easily generate distorted facial features. Third, large missing face information completion should focus on reconstructing facial parts with natural and harmonious features instead of focusing on generating a better restoration. Therefore, face inpainting remains a challenging problem as it requires generating semantically new pixels for the missing key components with consistency on structures and appearance. How to improve the adaptability of the repairing network and the correctness of repairing results still requires further studies.

In this paper, for large missing face inpainting, a generative adversarial network (GAN) with facial prediction and fusion as guidance information (FIFPNet), is proposed to generate high-resolution face images with photo-realistic (natural) details. It adopts two stages and decomposes into the prior face image generative process and high-resolution face inpainting. In Stage-I, we construct a new encoder–decoder network with variational autoencoder (VAE)-based backbone to predict face semantic knowledge (including contour, structure, and content information) and do information fusion for face inpainting. Specifically, we use cropped images to reconstruct face landmark, face part, and face mask to achieve face structure, content, and contour information, and merge these information as a guide to complete the face inpainting. This would fully explore face region semantic information and generate coordinated coarse face images. Based on the Stage-I result, Stage-II uses the generated adversarial network to refine the face image. Both global and patch discriminators are used to synthesize high-quality photo-realistic inpainting. The main contributions of this paper are summarized as follows:

•
To fully explore the semantic information of the face, FIFPNet uses two encoders and three decoders to obtain face structure, content, and contour information from the face landmark, face part, and face mask respectively and complete the feature fusion. Meanwhile, we reconstruct the landmark to ensure the fusion relationship with structural information as the leading factor, and contour and content information as the auxiliary to complete natural and harmonious face completion.
•
Traditional VAE uses the Kullback–Leibler (KL) divergence triad as the distribution constraint for the intermediate sampling, but the large cardinality of KL divergence would cause a great disturbance to the reconstruction loss. For the sampling of feature extraction, we use a discriminator instead of KL divergence as the constraint. This simultaneously fights with the encoder and enhances its learning ability.
•
Fusing the face region semantic information into both Stage-I generator and Stage-II generator not only guides the face inpainting but also enhances the robustness of the generator.

The remainder of this paper is organized as follows. Section 2 describes the current mainstream image inpainting methods and face inpainting methods. FIFPNet is introduced in Section 3. In Section 4, experimental results and ablation study demonstrate the effectiveness and efficiency of FIFPNet. Finally, our conclusions are presented in Section 5.

Section snippets

Variational Auto-Encoder (VAE).

AutoEncoder is a process of reconstruction in which the encoder extracts data features and the decoder decodes them directly. Kingma et al. [12] thought that data itself is subject to a normal distribution, so they proposed Variational Auto-Encoder (VAE) to limit the distribution of the potential feature vector $z$ extracted by the encoder. Given a real sample $x_{k}$ , VAE supposes there is a distribution $p (z | x_{k})$ dedicated to $x_{k}$ , and further assumes that $p (z | x_{k})$ follows Gaussian distribution. Thus,

Proposed algorithm

Existing face inpainting methods usually ignore or fail to take full advantage of the face region semantic information (i.e. structural information). They just only use this related information as loss items to constrain the completed face image. This yields weak influence in constructing facial structure information and optimizing facial texture information. For example, symmetrical relations may disappear in these methods. Therefore, we consider fusing face structure, content, and contour

Experiment

Conclusion

In this paper, we proposed a new generative framework with GAN-based backbone by predicting and fusing facial semantic information to guide large missing face inpainting. It mainly consists of two stages for face inpainting. Stage-I aims to embed the face region semantic information into latent variables as guidance information for face inpainting. It would generate low-resolution coordinated face images. Stage-II combines generator, both global and patch discriminators and yields clear and

CRediT authorship contribution statement

Xian Zhang: Conceptualization, Methodology, Software. Canghong Shi: Validation, Resourecs. Xin Wang: Writing - review & editing. Xi Wu: Writing - review & editing. Xiaojie Li: Writing - original draft, Funding acquisition. Jiancheng Lv: Formal analysis, Investigation. Imran Mumtaz: Formal analysis, Investigation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the Sichuan Science and Technology Program, China (2019JDJQ0002, 2018GZ0184, and 2018RZ 0072), the Key Project of Natural Science of Sichuan Provincial Education Department, China (17ZA0063), and the Natural Science Foundation For Young Scientists of Chengdu University of Information Technology, China (J201704).

References (59)

LoY.-C. et al.
Cycle-consistent GAN-based stain translation of renal pathology images with glomerulus detection application
Appl. Soft Comput.
(2021)
ChenZ.-S. et al.
A virtual sample generation approach based on a modified conditional GAN and centroidal voronoi tessellation sampling to cope with small sample size problems: Application to soft sensing for chemical process
Appl. Soft Comput.
(2021)
WangY.-r. et al.
Imbalanced sample fault diagnosis of rotating machinery using conditional variational auto-encoder generative adversarial network
Appl. Soft Comput.
(2020)
HuangJ.-B. et al.
Image completion using planar structure guidance
ACM Trans. Graph.
(2014)
JiaoL. et al.
Multi-scale semantic image inpainting with residual learning and GAN
Neurocomputing
(2019)
BarnesC. et al.
Patchmatch: A randomized correspondence algorithm for structural image editing
IizukaS. et al.
Globally and locally consistent image completion
ACM Trans. Graph.
(2017)
J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, T.S. Huang, Generative image inpainting with contextual attention, in:...
C. Zheng, T.-J. Cham, J. Cai, Pluralistic image completion, in: Proceedings of the IEEE Conference on Computer Vision...
Y. Zeng, J. Fu, H. Chao, B. Guo, Learning pyramid-context encoder network for high-quality image inpainting, in:...

H. Liu, B. Jiang, Y. Xiao, C. Yang, Coherent semantic attention for image inpainting, in: IEEE International Conference...

YuT. et al.

Region normalization for image inpainting

G. Liu, F.A. Reda, K.J. Shih, T.-C. Wang, A. Tao, B. Catanzaro, Image inpainting for irregular holes using partial...

J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, T.S. Huang, Free-form image inpainting with gated convolution, in: Proceedings...

LiaoH. et al.

Face completion with semantic knowledge and collaborative adversarial learning

Y. Li, S. Liu, J. Yang, M.-H. Yang, Generative face completion, in: Proceedings of the IEEE Conference on Computer...

KingmaD.P. et al.

Auto-encoding variational bayes

(2013)

RazaviA. et al.

Generating diverse high-fidelity images with vq-vae-2

(2019)

HigginsI. et al.

Beta-vae: Learning basic visual concepts with a constrained variational framework

(2016)

Lopez-MartinM. et al.

Conditional variational autoencoder for prediction and feature recovery applied to intrusion detection in IoT

Sensors

(2017)

GoodfellowI.J. et al.

Generative adversarial nets

HarmsJ. et al.

Paired cycle-GAN-based image correction for quantitative cone-beam computed tomography

Med. Phys.

(2019)

H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, D.N. Metaxas, Stackgan: Text to photo-realistic image synthesis...

T. Karras, S. Laine, T. Aila, A style-based generator architecture for generative adversarial networks, in: Proceedings...

DemirayB.Z. et al.

D-SRGAN: DEM super-resolution with generative adversarial networks

SN Comput. Sci.

(2021)

WenjieJ. et al.

Research on super-resolution reconstruction algorithm of remote sensing image based on generative adversarial networks

WuX. et al.

A survey of image synthesis and editing with generative adversarial networks

Tsinghua Sci. Technol.

(2017)

LinJ. et al.

Anycost GANs for interactive image synthesis and editing

(2021)

ZhangJ. et al.

PISE: Person image synthesis and editing with decoupled GAN

(2021)

Cited by (13)

E2F-Net: Eyes-to-face inpainting via StyleGAN latent space
2024, Pattern Recognition
Face inpainting, the technique of restoring missing or damaged regions in facial images, is pivotal for applications like face recognition in occluded scenarios and image analysis with poor-quality captures. This process not only needs to produce realistic visuals but also preserve individual identity characteristics. The aim of this paper is to inpaint a face given periocular region (eyes-to-face) through a proposed new Generative Adversarial Network (GAN)-based model called Eyes-to-Face Network (E2F-Net). The proposed approach extracts identity and non-identity features from the periocular region using two dedicated encoders have been used. The extracted features are then mapped to the latent space of a pre-trained StyleGAN generator to benefit from its state-of-the-art performance and its rich, diverse and expressive latent space without any additional training. We further improve the StyleGAN's output to find the optimal code in the latent space using a new optimization for GAN inversion technique. Our E2F-Net requires a minimum training process reducing the computational complexity as a secondary benefit. Through extensive experiments, we show that our method successfully reconstructs the whole face with high quality, surpassing current techniques, despite significantly less training and supervision efforts. We have generated seven eyes-to-face datasets based on well-known public face datasets for training and verifying our proposed methods. The code and datasets are publicly available.¹
SLMGAN: Single-layer metasurface design with symmetrical free-form patterns using generative adversarial networks
2022, Applied Soft Computing
Citation Excerpt :
Third, the current tandem network can only output the results which are memorized rather than composite new solutions by learning the design mechanisms. The generative adversarial network (GAN) [30–34] provides a promising solution to alleviate above problems. Although some works attempt to generate more free-form patterns by using GANs, their output patterns are still some simple customized geometries [29] or conventional basic shapes [23].
The metasurfaces offering the required spectral responses have ushered in a revolution of manipulating the light in a prescribed manner. A single-layer metasurface design is more appealing than a multi-layer one due to the fabrication complexities. To date, various research groups have explored on architected metasurfaces with general shapes of cubes, crosses, or octothorpes, while a few works utilized evolutionary algorithms to search for metasurfaces with free-form patterns, which relied on the quality of the initial guess. In this paper, a solution is presented to replace the intuition-based approach with generative adversarial networks. The constructed generative networks mathematically formulate the virtual mappings between the pairs of optical spectra and symmetrical patterns with user-defined geometric structures. When fed a time sequence of spectra, the designed networks assimilate the physical property and generate on-demand patterns to match the desired responses. The output patterns are proved to yield matching optical responses with an average accuracy of 0.9. Generative Adversarial Networks are firstly applied to single-layer metasurface designs with symmetrical free-form patterns for desired optical spectra in an inverse-design system.
Face image inpainting via latent features reconstruction and mask awareness
2022, Computers and Electrical Engineering
Citation Excerpt :
Motivated by the consensus that the face reconstruction model is able to decompose a given face image into multiple face hidden vectors in the feature space [8], and then the face image can be reconstructed, we attempt to apply the semantic editing of latent features to the image inpainting task. Meanwhile, inspired by the fact that the difference between faces lies in two levels: structural information and texture details [5], we propose a two-stage model for image inpainting. In the first stage, the face style features are extracted through our structure reconstructor.
To achieve reasonable structural inpainting and satisfactory texture reconstruction for missing regions in images, inspired by the fact that the face image can be decoupled into multi-level latent semantic features, a two-stage face image inpainting framework with the aid of latent feature reconstruction and mask awareness is proposed. In the first stage, a series of reconstructed style features are directly generated and input into the pre-trained StyleGAN generator to obtain the preliminary restoration results. We further design a latent cosine similarity loss to obtain better results in structural reconstruction. In the second stage, we employ a well-designed hierarchical attention mechanism between the encoder and decoder. Comprehensive experiments show that the proposed approach achieves competitive performance compared with state-of-the-art methods on the canonical datasets. A number of ablation studies demonstrate the effectiveness of critical components in our method.
Face deblurring based on regularized structure and enhanced texture information
2024, Complex and Intelligent Systems
E2F-Net: Eyes-to-Face Inpainting via StyleGAN Latent Space
2024, arXiv
Generative adversarial networks with attentional multimodal for human face synthesis
2024, Indonesian Journal of Electrical Engineering and Computer Science

View all citing articles on Scopus

View full text

Face inpainting based on GAN by facial prediction and fusion as guidance information

Highlights

Abstract

Introduction

Section snippets

Variational Auto-Encoder (VAE).

Proposed algorithm

Experiment

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Appl. Soft Comput.

Appl. Soft Comput.

Appl. Soft Comput.

ACM Trans. Graph.

Neurocomputing

Patchmatch: A randomized correspondence algorithm for structural image editing

Globally and locally consistent image completion

ACM Trans. Graph.

Region normalization for image inpainting

Face completion with semantic knowledge and collaborative adversarial learning

Auto-encoding variational bayes

Generating diverse high-fidelity images with vq-vae-2

Beta-vae: Learning basic visual concepts with a constrained variational framework

Conditional variational autoencoder for prediction and feature recovery applied to intrusion detection in IoT

Sensors

Generative adversarial nets

Paired cycle-GAN-based image correction for quantitative cone-beam computed tomography

Med. Phys.

D-SRGAN: DEM super-resolution with generative adversarial networks

SN Comput. Sci.

Research on super-resolution reconstruction algorithm of remote sensing image based on generative adversarial networks

A survey of image synthesis and editing with generative adversarial networks

Tsinghua Sci. Technol.

Anycost GANs for interactive image synthesis and editing

PISE: Person image synthesis and editing with decoupled GAN