Full length article
Multi-scale gradient attention guidance and adaptive style fusion for image inpainting

https://doi.org/10.1016/j.jvcir.2022.103681Get rights and content

Abstract

Image inpainting aims to fill in the missing regions of damaged images with plausible content. Existing inpainting methods tend to produce ambiguous artifacts and implausible structures. To address the above issues, our method aims to fully utilize the information of known regions to provide style and structural guidance for missing regions. Specifically, the Adaptive Style Fusion (ASF) module reduces artifacts by transferring visual style features from known regions to missing regions. The Gradient Attention Guidance (GAG) module generates accurate structures by aggregating semantic information along gradient boundary regions. In addition, the Multi-scale Attentional Feature Extraction (MAFE) module extracts global contextual information and enhances the representation of image features. The sufficient experimental results on the three datasets demonstrate that our proposed method has superior performance in terms of visual plausibility and structural consistency compared to state-of-the-art inpainting methods.

Introduction

Image inpainting aims to generate visually credible contents and structures for damaged images, which is widely applied in image processing tasks, such as image restoration [1], face editing [2], and object removal [3], [4]. Although breakthrough progress has been made in image inpainting, it still faces great challenges in generating a consistent visual style and clear textural details.

Existing image inpainting methods can be divided into two categories including traditional methods and learning-based methods. Traditional inpainting methods typically use the local texture similarity to perform pixel interpolation, and that include diffusion-based [5], [6], [7] and patch-based [8], [9], [10], [11] methods. However, the traditional methods lack a high-level understanding for image content and structure, and it is difficult to generate meaningful semantic content. The learning-based methods use convolutional neural networks to generate high-level semantic features to fill in missing content. However, they always produce distorted structures and blurred textures in the face of large missing regions or complex contexts.

With the deepening study of learning-based inpainting methods, the researchers found that the regular convolution cannot distinguish between missing foreground pixels and known background pixels, which will lead to the drift of the mean and variance of features in the normalization process. To solve the problem, Partial convolution (Pconv) [12] was proposed instead of regular convolution, where only background pixels are convolved. To address the limitations of partial convolution, Deepfillv2 [13] applied gated convolution and channel soft masks to treat different background pixels equally to improve inpainting performance. These methods still cannot intrinsically solve the mean and variance shift problems during normalization. Regional Normalization (RN) [14] can normalize missing foreground region and background region respectively, which replace instance normalization (IN) [15] in convolutional networks. However, since the missing foreground and background regions are processed independently, the semantic information of the background cannot guide the missing regions construction. As a result, there are no visual style characteristics transferred from the background to foreground region, and the edge connection transition between two regions is not smooth.

Recently, structural information has been increasingly studied to guide image inpainting and has play an important role in generating reasonably realistic images of missing content, including edges [16], [17], salient map [18], [19], and gradient maps [20]. The key of assisting is to make full use of relatively accurate and complete structures to guide the network to generate missing content. Edge Connect (EC) [16] proposed a two-stage model, including an edge generator and an image generator, with the relatively complete edge map predicted in the first stage serving as a condition for the second stage. Foreground-aware inpainting [18] embedded a salient map to guide the generator to reconstruct plausible content. Compared with edge and salient maps, the gradient map contains richer texture information. Zhang [20] designed a Gradient Augmented Inpainting Network (GAIN) used an image gradient map instead of an edge map and proposed a dual-branch structure for completing the image and gradient map separately. However, the above methods simply concatenate image features and structural information to guide the network during network inference. The network will weaken and forget structural information with the network deepen and structure sparsity increasing, which is difficult to fully understand structural information.

In the process of network reasoning and inpainting, the missing regions and the known regions lack the necessary visual style connection. The Adaptive Style Fusion Residual Block (ASF ResBlock) transmits statistical information internally and performs style transfer in the feature space, effectively combining the content of the former with the style of the latter, thereby reducing artifacts significantly. Furthermore, the Gradient Attention Guidance (GAG) module employs an attention mechanism to encourage information sharing between image feature maps and gradient feature maps in the network to generate accurate semantic structures. Finally, the Multi-scale Attentional Feature Extraction (MAFE) adopts the structure of a spatial pyramid to capture and enhance the deep features from different receptive fields of GAG modules. Therefore, the network alleviates artifacts and inaccurate structures under the interaction of the above modules. The main contributions of this paper are summarized as follows:

  • (1)

    The ASF module introduces the perspective of style transfer, which can explicitly describe style characteristics from known region and transfer to the missing region. It can effectively reduce artifacts and color differences through adaptive fusion between known and missing regions.

  • (2)

    We propose a multi-scale gradient guidance framework, which applies gradient attention to guide the network generates trustworthy visual structures and textures on multi-scale layers of decoder.

  • (3)

    Experiments on three public datasets, Paris Street View [21], CelebA-HQ [22], and Places2 [23], demonstrate qualitatively and quantitatively the significant performance of the proposed method.

Section snippets

Traditional inpainting

Traditional image inpainting methods extract low-level image features from of the image to obtain plausible visual content for filling in missing region. Traditional inpainting methods are roughly divided into two categories including diffusion-based methods [5], [6], [7] and patch-based methods [8], [9], [10], [11]. The diffusion-based methods smoothly propagate the image content from the known foreground region to inside missing background region and synthesize new textures to fill the holes.

Method

We propose a new effective inpainting algorithm in Fig. 1, it adopts the image generator from Edge Connect [15] as the backbone network. This architecture contains an image encoder, followed by eight residual blocks and an image decode. We first replace instance normalization with RN_B [14] in the encoder of the generator to ensure the independence of the foreground and background data information. Then, eight ASF ResBlocks are applied to transfer the visual style from the foreground to the

Datasets

We evaluate all the methods on the three public image datasets, including Paris StreetView [21], CelebA-HQ [22], and Places2 [23] datasets. Furthermore, in the image inpainting field, we need to identify the location of missing regions. Since the irregular masks are more challenging, closer to real-world applications, so the missing region are simulated using the irregular mask dataset provided from Pconv [12]. The irregular mask dataset falls into six intervals of mask region, i.e., 0–10%, 10

Conclusion

In this paper, we propose a new and effective inpainting algorithm. To inpaint damaged images with better visual credibility, the network must not only ensure that the foreground and background styles of the images remain consistent, but also extract more global contextual information to generate reasonably accurate structural and texture information. Therefore, we designed the following modules: the ASF ResBlocks, GAG and MAFE modules. The ASF module considers the visual style connection

CRediT authorship contribution statement

Ye Zhu: Writing-Review and Editing, Supervision. Chao Wang: Writing-Original Draft, Investigation. Shuze Geng: Validation and Editing. Yang Yu: Data Curation, Formal analysis. Xiaoke Hao: Conceptualization, Methodology, Software.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grant 62102129 and Grant 62276088, the Natural Science Foundation of Hebei Province under Grant F2021202030, Grant F2019202381 and Grant F2019202464, Science and the Technology Project of Hebei Education Department under Grant QN2020185.

References (42)

  • BaiX. et al.

    Adaptive hash retrieval with kernel based similarity

    Pattern Recognit.

    (2018)
  • HelbertD. et al.

    Patch graph-based wavelet inpainting for color images

    J. Vis. Commun. Image Represent.

    (2019)
  • ShaoH. et al.

    Generative image inpainting with salient prior and relative total variation

    J. Vis. Commun. Image Represent.

    (2021)
  • CriminisiA. et al.

    Region filling and object removal by exemplar-based image inpainting

    IEEE Trans. Image Process.

    (2004)
  • Y. Jo, J. Park, Sc-fegan: Face editing generative adversarial network with user’s sketch and color, in: Proceedings of...
  • CriminisiA. et al.

    Object removal by exemplar-based inpainting

  • Q. Sun, L. Ma, S.J. Oh, L. Van Gool, B. Schiele, M. Fritz, Natural and effective obfuscation by head inpainting, in:...
  • LiK. et al.

    Image inpainting algorithm based on TV model and evolutionary algorithm

    Soft Comput.

    (2016)
  • LiH. et al.

    Localization of diffusion-based inpainting in digital images

    IEEE Trans. Inf. Forensics Secur.

    (2017)
  • SrideviG. et al.

    Image inpainting based on fractional-order nonlinear diffusion for image reconstruction

    Circuits Systems Signal Process.

    (2019)
  • DarabiS. et al.

    Image melding: Combining inconsistent images using patch-based synthesis

    ACM Trans. Graph.

    (2012)
  • BarnesC. et al.

    PatchMatch: A randomized correspondence algorithm for structural image editing

    ACM Trans. Graph.

    (2009)
  • G. Liu, F.A. Reda, K.J. Shih, T.-C. Wang, A. Tao, B. Catanzaro, Image inpainting for irregular holes using partial...
  • J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, T.S. Huang, Free-form image inpainting with gated convolution, in: Proceedings...
  • T. Yu, Z. Guo, X. Jin, S. Wu, Z. Chen, W. Li, Z. Zhang, S. Liu, Region normalization for image inpainting, in:...
  • UlyanovD. et al.

    Instance normalization: The missing ingredient for fast stylization

    (2016)
  • K. Nazeri, E. Ng, T. Joseph, F. Qureshi, M. Ebrahimi, Edgeconnect: Structure guided image inpainting using edge...
  • J. Li, F. He, L. Zhang, B. Du, D. Tao, Progressive reconstruction of visual structure for image inpainting, in:...
  • XiongW. et al.

    Foreground-aware image inpainting

  • J. Zhang, L. Niu, D. Yang, L. Kang, Y. Li, W. Zhao, L. Zhang, GAIN: Gradient augmented inpainting network for irregular...
  • DoerschC. et al.

    What makes paris look like Paris?

    ACM Trans. Graph.

    (2012)
  • This paper has been recommended for acceptance by Dr. Zicheng Liu.

    View full text