Elsevier

Pattern Recognition

Volume 134, February 2023, 109046
Pattern Recognition

Deep learning for image inpainting: A survey

https://doi.org/10.1016/j.patcog.2022.109046Get rights and content

Highlights

  • We summarize existing deep learning-based image inpainting algorithms in three aspects, including inpainting strategies, network structures and loss functions.

  • We introduce the open source codes, popular used datasets, evaluation metrics, and application scenarios.

  • We compare the inpainting algorithms with released source codes.

  • We outline the challenges and possible future directions.

Abstract

Image inpainting has been widely exploited in the field of computer vision and image processing. The main purpose of image inpainting is to produce visually plausible structure and texture for the missing regions of damaged images. In the past decade, the success of deep learning has brought new opportunities to many vision tasks, which promoted the development of a large number of deep learning-based image inpainting methods. Although these methods have many similarities, they also have their own characteristics due to the differences in data types, application scenarios, computing platforms, etc. It is necessary to classify and summarize these methods to provide a reference for the research community. In this survey, we present a comprehensive overview of recent advances in deep learning-based image inpainting. First, we categorize the deep learning-based techniques from multiple perspectives: inpainting strategies, network structures, and loss functions. Second, we summarize the open source codes and representative public datasets, and introduce the evaluation metrics for quantitative comparisons. Third, we summarize the real-world applications of image inpainting in different scenarios, and give a detailed analysis on the performance of different inpainting algorithms. At last, we conclude the survey and discuss about the future directions.

Introduction

In the past few decades, digital images have become one of the most important carriers of information recording and dissemination. Various image processing technologies have been proposed to meet the requirement of image-based applications, such as image denoising, image super-resolution, image colorization, image inpainting, and so on. Among them, image inpainting aims to provide a visually plausible restoration for the missing regions of damaged images. Due to different reasons, images get corrupted or some of their regions go missing. If the missing regions are small, such as regular 2-by-2 sampling patterns, image interpolation can solve the problem; otherwise, image inpainting is required. For example, a scanned old photo with cracks, a captured image with unwanted objects, a mural image with damaged paintings, etc.

Image inpainting is a very challenging problem. First, the inputs are very complex. Except for the traditional gray and color images of nature scenes [1], [2], [3], line drawings/sketches [4], [5], textures [6], texts [7] and depth images [8], [9], [10] are also common and important inputs. Different types of inputs may lead to different inpainting strategies or algorithms. Second, the damage to the images may be very large, which commonly leads to unsatisfactory results for traditional patch-based algorithms [11], [12], [13], partial differential equations-based algorithms [2], [3], [14], or interpolation algorithms [15], [16]. Third, inpainting is an ill-posed problem, which means the inpainting results are not unique, while most algorithms consider only one possible result [17], [18].

With the rapid development of deep learning in computer vision, deep learning-based algorithms demonstrate high effectiveness in inpainting tasks. Compared to traditional algorithms, deep learning-based algorithms can capture better high-level semantics and obtain significantly improved results [19], [20]. Although these methods have many similarities, they also have different characteristics due to the varieties of data types, application scenarios, and computing platforms, etc. It is necessary to classify and summarize these methods to provide a reference for the research community. We track the development of deep learning-based inpainting algorithms and select the state-of-the-arts and typical ones that present new solutions to the inpainting problems as the review focus. The new solutions could be the use of different inpainting strategies, or the modifications in network structures and loss functions. First, we consider inpainting strategies. For example, inpainting in a progressive fashion is a common-used mode [21], [22]. Structural information, attention modules, and special convolutions are also worthy of attention when the missing regions become large and complex in damaged images [21], [22], [23]. Moreover, pluralistic inpainting is a new research direction for the ill-posed problem [17]. Second, we explore the development of network structures. As the research further develops, complex networks are progressively applied, from autoencoders [1], [23], [24] to variational autoencoders [17] and GAN-inversion networks [25]. Third, we focus on the use of loss functions in the training process. Among them, reconstruction loss and adversarial loss are two basic loss functions [1], and get improved by the later algorithms [21], [24]. Based on the original design of the state-of-the-arts and typical algorithms, we survey extended algorithms to indicate how these design works better. In this paper, we choose 42 deep learning-based methods in total for evaluation, and categorize them from three perspectives: inpainting strategies, network structures, and loss functions. Fig. 1 illustrates the framework.

Inpainting strategies present different solutions to the problems of inpainting. According to different problems, we classify the deep learning-based algorithms into progressive, structural information-guided, attention-based, convolutions-based, and pluralistic inpainting.

Progressive inpainting fills images in a step-wise manner, because inpainting tasks are essentially a puzzle solving process that is hard to accomplish in an action. Especially when the missing regions become larger, visible information is not sufficient for recovering all the pixels in one step. Progressive inpainting covers five types: coarse-to-fine, part-to-full, low-to-high-resolution, structure-to-content and mask-to-image inpainting. Yu et al. [21], Sagong et al. [26] and Guo et al. [27] employ the coarse-to-fine strategy, which makes a coarse prediction first, and then takes the coarse prediction as input to predict more refined results. Li et al. [28], Zhang et al. [29], and Zeng et al. [30] exploit the part-to-full strategy, inferring hole boundaries and then using the results as clues for further inference. Yang et al. [31] and Yi et al. [32] fill the missing regions of damaged images at each scale and upsample the images for the next scale, so we call them low-to-high-resolution inpainting. Nazeri et al. [22], Liao et al. [33], and Xiong et al. [34] adopt the structure-to-content strategy, which hallucinates the structure of missing regions, and fills them using the hallucinated structure as a priori. The last one is the mask-to-image strategy. Wang et al. [35] first estimate where to fill and then generate what to fill.

Structural information-guided inpainting relies on the structure of the known regions, such as edges, segmentation, etc. When damaged images contain sharp details, the lack of fine structure in the missing regions is a giveaway that something is amiss [22]. Typical structural information-guided inpainting covers five types: edge, segmentation, landmark, voxel and gradient guided inpainting. It is worth noting that occasionally structural information-guided inpainting can also be progressive, however, others demonstrate a small difference: structural information serves as guidance in training. We focus on the latter situation.

Attention-based inpainting considers the information from distant spatial locations, to solve the problem encountered by inpainting tasks where CNNs are not effective for borrowing distant features. According to different attention mechanisms, attention-based inpainting presents five different types: contextual attention-based, attention transfer-based, cross attention-based, patch swap-based, and transformer-based inpainting. Yu et al. [21], Sagong et al. [26] and Wang et al. learn to borrow feature information from contextual patches to generate missing regions, so we call this contextual attention-based inpainting. Zeng et al. [36], Yi et al. [32], and Zeng et al. [30] employ the attention transfer-based strategy, which obtain attention scores through another feature map to guide the missing patches generating. Zhao et al. [18] propose the cross attention-based strategy, computing the attention scores of instance patches with contextual patches to guide the generation of missing regions. Song et al. [37], Liu et al. [38], Wang et al. [39], and Wang et al. [40] exploit the patch swap-based strategy, where each patch literally searches for the most similar patch, and swaps with that patch. Wan et al. [41] and Yu et al. [42] adopt transformers [43], a more complex attention mechanism, to model the underlying distribution of reconstructed images.

Convolutions-aware inpainting employs masks to indicate the missing regions [44] and control the way how information is propagated across multiple regions [45], [46]. As for damaged images, not all the information is useful. Traditional convolutions are conditioned on both valid pixels as well as substitute values in the missing regions, leading to artifacts such as color discrepancy and blurriness [23]. Liu et al. [23] propose the partial convolutions-based strategy, automatically updating masks to distinguish the missing regions. However, partial convolutions-based inpainting updates the mask with hard rules, which would limit the flexibility. Yu et al. [47] provide a learnable mechanism for the mask updating, named gated convolutions-based inpainting. Besides, partial convolutions-based inpainting only updates masks in the encoding network, ignoring the decoding network. Xie et al. [48] employ the bidirectional convolutions-based strategy, not only using a learnable attention map module for the mask updating, but also implementing it in the decoding network. The last is the region-wise convolutions-based strategy. Ma et al. [49] do not design new convolutions, but treat the known and missing regions with different convolution filters. For convolutions-aware inpainting, a critical issue is how to generate irregular holes. Accordingly, holes generation algorithms are proposed.

Pluralistic inpainting generates multiple results for a single damaged image, because inpainting is an ill-posed problem, where a number of visually plausible results can satisfy the constraints of image restoration [18]. Typical pluralistic inpainting covers three types: GAN, VAE, and transformer based pluralistic inpainting. Cai and Wei [50] and Liu et al. [51] employ GANs to generate real inpainting results and input random noise to improve the variety of the results. Zheng et al. [17], Zhao et al. [18] and Peng et al. [52] exploit variational autoencoders and GANs to generate more than one possible results. Wan et al. [41] and Yi et al. [42] adopt transformers [43] to model the underlying distribution of reconstructed images, and each sampled vector corresponds to one result.

Concerning various network structures, deep learning-based algorithms can be broadly classified as autoencoder-based, variational autoencoder-based, and GAN-inversion structure.

Autoencoder-based structure trains a convolutional neural network to regress the missing pixel values. Typical autoencoder-based structure performs two steps: (1) an encoder capturing the context of an image into a compact latent feature representation; and (2) a decoder that uses the representation to produce the missing image content [1]. The progress of autoencoder-based structure relies on the development in image processing field. Pathak et al. [1] derive the network structure from the AlexNet [53], a classical CNN structure, which is suitable for image classification tasks. Iizuka et al. [24] refer to image segmentation tasks, transforming the fully-connected layers in the CNN structure into convolutional layers to accept input images of any size. Liu et al. [23] design U-Net-based structure from the U-Net [54], which is widely used in image segmentation [54] and image translation tasks [55].

Although existing deep learning-based algorithms are able to produce visually realistic and semantically correct results, they produce only one result for each corrupted input [18]. Zheng et al. [17] and Zhao et al. [18] adopt variational autoencoder-based structure, which sets limitations on the encoding stage to force latent vectors to roughly follow a standard normal distribution. The sampled latent vector contains information of the missing regions, and each latent vector corresponds to one result.

GAN-inversion structure is independent of autoencoder-based and variational autoencoder-based structure. Generally, GANs serve as adversarial loss in the training process. However, Yeh et al. [56], Vitoria et al. [57], Lahiri et al. [58] and Pan et al. [25] propose GAN-inversion structure that aims to find a vector in the latent space that best reconstructs the given image, where the GAN generator is learned in advance and fixed.

Loss functions play an important role in the process of network training. Here we briefly discuss two basic loss functions. Other special loss functions will be explored in Section 4.

Reconstruction loss is responsible for capturing the overall structure of the missing regions and contextual coherence [1]. Pixel-wise distance between the original input and the final output is computed. Also, two variants, weighted reconstruction loss and multi-scale reconstruction loss are widely used. However, reconstruction loss provides a blurry solution, failing to restore any high-frequency details. To ameliorate the problem, adversarial loss is added.

Adversarial loss tries to make predictions look real, based on GANs [59]. Briefly speaking, the learning procedure is a two-player game where the discriminator D takes both the predictions of the generator G and the ground truth samples as input, and tries to distinguish them; while G tries to fool D by producing samples that appear as real as possible [1]. With the development of GANs, the modified versions of adversarial loss, such as WGAN-based, LSGAN-based, global and local, and PatchGAN-based adversarial loss, are proposed, making the training process faster and more stable.

There are several survey papers related to image inpainting, e.g., image synthesis [60], face image inpainting [61], [62], traditional inpainting [63], [64], deep learning-based inpainting [65], [66], [67], [68], [69], [70], and the compound of traditional inpainting and deep learning-based inpainting [71], [72], [73]. For example, Coloma et al. [70] focus on pluralistic inpainting that generate multiple results for a single damaged image, and analyse the underlying theory and the recent proposals. However, other surveys do not provide a comprehensive or structured overview of deep learning-based inpainting algorithms. In this survey, representative and advanced inpainting algorithms will be discussed from multiple perspectives, covering all components mentioned in Fig. 1.

The remainder of this survey is organized as follows. Sections 2–4 give the summarization from the perspectives of inpainting strategies, network structures, and loss functions, respectively; Section 5 introduces the codes and datasets in this field; Section 6 gives the evaluation metrics; Section 7 introduces the application scenarios; Section 8 compares the performance of different methods; Section 9 discusses on the challenges and future directions, and Section 10 concludes the survey.

Section snippets

Inpainting strategies

Early deep learning-based algorithms [1] perform well only for small and regular holes, while inpainting with specific strategies demonstrates the superiority as the case gets more complicated. In this section, we review the current inpainting algorithms from the inpainting strategy perspective: (1) progressive inpainting; (2) structural information-guided inpainting; (3) attention-based inpainting; (4) convolutions-aware inpainting; (5) pluralistic inpainting.

Network structures

Network structure is the core of inpainting algorithms. Essentially, it follows an encoder-decoder pipeline: (1) an encoder capturing the context of an image into a compact feature representation; and (2) a decoder using the representation to generate the missing content [1]. In this section, we classify inpainting algorithms from the network structure perspective as follows: (1) autoencoder-based structure; (2) variational autoencoder-based structure; and (3) GAN-inversion structure.

Loss functions

Loss functions, the training objective of networks, penalize the deviation between network predictions and true data labels. In this section, we review common-used loss functions in inpainting tasks. There are two basic loss functions: (1) reconstruction loss; and (2) adversarial loss.

Codes and datasets

Open source codes are made freely available for possible modification and redistribution. Based on the codes, the original algorithms can be applied in relative tasks. In Table 1, we summarize the current image inpainting algorithms whose source codes are open.

Public datasets are the integral part of machine learning. In inpainting tasks, the categories of public datasets include objects, scenes, faces, etc. In this section, we introduce some representative datasets in each category. Table 2

Evaluation metrics

Evaluation metrics indicate the effectiveness of the proposed algorithms. On the one hand, mean squared error (MSE), peak signal to noise ratio (PSNR), structural similarity index (SSIM), and learned perceptual image patch similarity (LPIPS) [131] are used to measure the quality of reconstruction. On the other hand, inception score (IS) [132], Frèchet inception distance (FID) [133], the standard metrics for assessing the quality of GANs, are used to measure the quality of the generated image

Applications

Compared to traditional inpainting algorithms, deep learning-based algorithms have demonstrated the superiority as the missing regions get more complicated. Through continuous improvements, deep learning-based algorithms have played an important role in user-guided face editing, privacy protection, pose-guided image synthesis, digitization of cultural heritage, remote sensing, and other fields.

Performance

Most papers have done qualitative and quantitative evaluations. However, the algorithms to be compared and the evaluation metrics to be used are quite distinct. It is not reasonable to collect results from different papers for the comparison, as the testing sets in each paper are also different. To make a fair comparison, we use the same testing sets on the inpainting algorithms whose source codes are open and pretrained models are released, 19 in total.

In this section, we first introduce our

Challenges and future directions

In the past decade, great progress has been made in the inpainting-strategy selection, network-structure design, loss-function optimization, etc. However, there are still many challenges in this field.

Algorithms: The current inpainting algorithms achieve better results in processing simple scenes, small holes, and low-resolution images [65]. Otherwise, these algorithms can hardly obtain satisfactory results. For example, it is still difficult to predict the structure of a missing object. Here,

Conclusions

This survey presented a comprehensive review about recent advances in deep learning-based image inpainting. Others in the field could benefit from the survey: (1) Develop new inpainting algorithms based on common-used inpainting strategies, network structures, and loss functions. (2) Acquire representative public datasets to conduct experiments. (3) Understand the challenges and future directions in the field, and try to achieve breakthrough research results.

In this survey, we concluded that:

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This research was supported by the National Key Research and Development Program of China under grant 2020YFC1522703, and the National Natural Science Foundation of China under grants 62171324, 61872277.

Hanyu Xiang was born in 1998. He is currently pursuing the PhD degree in Photogrammetry and Remote Sensing from Wuhan University, Wuhan, China. His research interests include computer vision and digitalization of cultural heritage.

References (149)

  • H. Wang et al.

    Inpainting of dunhuang murals by sparsely modeling the texture similarity and structure continuity

    J. Comput. Cult. Heritage (JOCCH)

    (2019)
  • Q. Zou et al.

    Automatic inpainting by removing fence-like structures in RGBD images

    Mach. Vis. Appl.

    (2014)
  • X. Han et al.

    Deep reinforcement learning of volume-guided progressive view inpainting for 3D point scene completion from a single depth image

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    (2019)
  • A. Criminisi et al.

    Region filling and object removal by exemplar-based image inpainting

    IEEE Trans. Image Process.

    (2004)
  • C. Barnes et al.

    PatchMatch: a randomized correspondence algorithm for structural image editing

    ACM Trans. Graph.

    (2009)
  • M. Bertalmio et al.

    Image inpainting

    Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques

    (2000)
  • J. Ji et al.

    Image interpolation using multi-scale attention-aware inception network

    IEEE Trans. Image Process.

    (2020)
  • L. Yu, K. Liu, M.T. Orchard, Manifold-inspired single image interpolation, arXiv preprint...
  • C. Zheng et al.

    Pluralistic image completion

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    (2019)
  • L. Zhao et al.

    UCTGAN: diverse image inpainting based on unsupervised cross-space translation

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    (2020)
  • J. Xie et al.

    Image denoising and inpainting with deep neural networks

    Advances in Neural Information Processing Systems (NIPS)

    (2012)
  • A. Fawzi et al.

    Image inpainting through neural networks hallucinations

    Image, Video, and Multidimensional Signal Processing Workshop (IVMSP)

    (2016)
  • J. Yu et al.

    Generative image inpainting with contextual attention

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    (2018)
  • K. Nazeri et al.

    EdgeConnect: Structure guided image inpainting using edge prediction

    IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

    (2019)
  • G. Liu et al.

    Image inpainting for irregular holes using partial convolutions

    Proceedings of the European Conference on Computer Vision (ECCV)

    (2018)
  • S. Iizuka et al.

    Globally and locally consistent image completion

    ACM Trans. Graph.

    (2017)
  • X. Pan et al.

    Exploiting deep generative prior for versatile image restoration and manipulation

    Proceedings of the European Conference on Computer Vision (ECCV)

    (2020)
  • M.-c. Sagong et al.

    PEPSI: fast image inpainting with parallel decoding network

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    (2019)
  • Q. Guo et al.

    JPGNet: joint predictive filtering and generative network for image inpainting

    Proceedings of the 29th ACM International Conference on Multimedia

    (2021)
  • J. Li et al.

    Recurrent feature reasoning for image inpainting

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    (2020)
  • H. Zhang et al.

    Semantic image inpainting with progressive generative networks

    Proceedings of the 26th ACM International Conference on Multimedia (ACM MM)

    (2018)
  • Y. Zeng et al.

    High-resolution image inpainting with iterative confidence feedback and guided upsampling

    Proceedings of the European Conference on Computer Vision (ECCV)

    (2020)
  • C. Yang et al.

    High-resolution image inpainting using multi-scale neural patch synthesis

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2017)
  • Z. Yi et al.

    Contextual residual aggregation for ultra high-resolution image inpainting

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    (2020)
  • L. Liao et al.

    Edge-aware context encoder for image inpainting

    IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (2018)
  • W. Xiong et al.

    Foreground-aware image inpainting

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    (2019)
  • Y. Wang et al.

    VCNet: a robust approach to blind image inpainting

    Proceedings of the European Conference on Computer Vision (ECCV)

    (2020)
  • Y. Zeng et al.

    Learning pyramid-context encoder network for high-quality image inpainting

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    (2019)
  • Y. Song et al.

    Contextual-based image inpainting: infer, match, and translate

    Proceedings of the European Conference on Computer Vision (ECCV)

    (2018)
  • H. Liu et al.

    Coherent semantic attention for image inpainting

    IEEE/CVF International Conference on Computer Vision (ICCV)

    (2019)
  • N. Wang et al.

    Musical: multi-scale image contextual attention learning for inpainting

    IJCAI

    (2019)
  • Z. Wan et al.

    High-fidelity pluralistic image completion with transformers

    IEEE/CVF International Conference on Computer Vision (ICCV)

    (2021)
  • Y. Yu et al.

    Diverse image inpainting with bidirectional and autoregressive transformers

    Proceedings of the 29th ACM International Conference on Multimedia (ACM MM)

    (2021)
  • A. Vaswani et al.

    Attention is all you need

    Advances in Neural Information Processing Systems (NIPS)

    (2017)
  • R. Köhler et al.

    Mask-specific inpainting with deep neural networks

    German Conference on Pattern Recognition

    (2014)
  • J.S. Ren et al.

    Shepard convolutional neural networks

    Advances in Neural Information Processing Systems (NIPS)

    (2015)
  • A. Dapogny et al.

    The missing data encoder: Cross-channel image completion with hide-and-seek adversarial network

    Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)

    (2020)
  • J. Yu et al.

    Free-form image inpainting with gated convolution

    IEEE/CVF International Conference on Computer Vision (ICCV)

    (2019)
  • C. Xie et al.

    Image inpainting with learnable bidirectional attention maps

    IEEE/CVF International Conference on Computer Vision (ICCV)

    (2019)
  • Y. Ma et al.

    Coarse-to-fine image inpainting via region-wise convolutions and non-local correlation

    IJCAI

    (2019)
  • Cited by (54)

    • Nonlocal Cahn-Hilliard type model for image inpainting

      2024, Computers and Mathematics with Applications
    View all citing articles on Scopus

    Hanyu Xiang was born in 1998. He is currently pursuing the PhD degree in Photogrammetry and Remote Sensing from Wuhan University, Wuhan, China. His research interests include computer vision and digitalization of cultural heritage.

    Qin Zou received the B.E. degree in information engineering and the Ph.D. degree in computer vision from Wuhan University, China, in 2004 and 2012, respectively. From 2010 to 2011, he was a visiting PhD student at the Computer Vision Lab, University of South Carolina, USA. Currently, he is an associate professor with the School of Computer Science, Wuhan University. He is a co-recipient of the National Technology Invention Award of China 2015. His research activities involve computer vision, pattern recognition, and machine learning. He is serving as an Associate Editor of IEEE Transactions on Intelligent Vehicles. He is a senior member of the IEEE.

    Muhammad Ali Nawaz was born in 1991. He is currently pursuing the PhD degree in Computer Science and Technology from Wuhan University, Wuhan, China. His research interests include deep learning and computer vision.

    Xianfeng Huang received the PhD degree from Wuhan University, Wuhan, China. He is currently a Full Professor with State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing. His research interests include photogrammetry, computer vision, and digitalization of cultural heritage.

    Fan Zhang received the PhD degree from Wuhan University, Wuhan, China. He is currently an Associate Professor with State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing. His research interests include photogrammetry and digitalization of cultural heritage.

    Hongkai Yu received the PhD in Computer Science and Engineering from University of South Carolina. He is currently an assistant professor in the Department of Electrical Engineering and Computer Science at the Cleveland State University. His research interests include computer vision, machine learning, deep Learning, artificial Intelligence and data Science.

    View full text