research-article

Text-Guided Image Inpainting

Authors:

Jing YuanAuthors Info & Claims

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 4079 - 4087

https://doi.org/10.1145/3394171.3413939

Published: 12 October 2020 Publication History

Abstract

Given a partially masked image, image inpainting aims to complete the missing region and output a plausible image. Most existing image inpainting methods complete the missing region by expanding or borrowing information from the surrounding source region, which work well when the original content in the missing region is similar to the surrounding source region. Unsatisfactory results will be generated if there is no sufficient contextual information can be referenced from source region. Besides, the inpainting results should be diverse and this kind of diversity should be controllable. Based on these observations, we propose a new inpainting problem that introduces text as a kind of guidance to direct and control the inpainting process. The main difference between this problem and previous works is that we need ensure the result to be consistent with not only the source region but also the textual guidance during inpainting. By this way, we want to avoid the unreasonable completion and meanwhile make it controllable. We propose a progressively coarse-to-fine cross-modal generative network and adopt the text-image-text training schema to generate visually consistent and semantically coherent images. Extensive quantitative and qualitative experiments on two public datasets with captions demonstrate the effectiveness of our method.

Supplementary Material

MP4 File (3394171.3413939.mp4)

Most existing image inpainting methods complete the missing region by expanding or borrowing information from the surrounding source region, which work well when the original content in the missing region is similar to the surrounding source region. Unsatisfactory results will be generated if there is no sufficient contextual information can be referenced from source region. Besides, the inpainting results should be diverse and this kind of diversity should be controllable. Based on these observations, we propose a new inpainting problem that introduces text as a kind of guidance to direct and control the inpainting process.

Download
5.66 MB

References

[1]

Michael Ashikhmin. 2001. Synthesizing natural textures. SI3D, Vol. 1 (2001), 217--226.

[2]

Coloma Ballester, Marcelo Bertalmio, Vicent Caselles, Guillermo Sapiro, and Joan Verdera. 2000. Filling-in by joint interpolation of vector fields and gray levels. (2000).

[3]

Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. 2009. PatchMatch: A randomized correspondence algorithm for structural image editing. In ACM Transactions on Graphics (ToG), Vol. 28. ACM, 24.

Digital Library

[4]

Marcelo Bertalmio, Andrea L Bertozzi, and Guillermo Sapiro. 2001. Navier-stokes, fluid dynamics, and image and video inpainting. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, Vol. 1. IEEE, I--I.

[5]

Marcelo Bertalmio, Guillermo Sapiro, Vincent Caselles, and Coloma Ballester. 2000. Image inpainting. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., 417--424.

Digital Library

[6]

Marcelo Bertalmio, Luminita Vese, Guillermo Sapiro, and Stanley Osher. 2003. Simultaneous structure and texture image inpainting. IEEE transactions on image processing, Vol. 12, 8 (2003), 882--889.

Digital Library

[7]

Raphaël Bornard, Emmanuelle Lecan, Louis Laborelli, and Jean-Hugues Chenot. 2002. Missing data correction in still images and image sequences. In Proceedings of the tenth ACM international conference on Multimedia. ACM, 355--361.

Digital Library

[8]

Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. 2016. Neural photo editing with introspective adversarial networks. arXiv preprint arXiv:1609.07093 (2016).

[9]

Miriam Cha, Youngjune Gwon, and HT Kung. 2017. Adversarial nets with perceptual losses for text-to-image synthesis. In 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 1--6.

[10]

Antonio Criminisi, Patrick Pérez, and Kentaro Toyama. 2004. Region filling and object removal by exemplar-based image inpainting. IEEE Transactions on image processing, Vol. 13, 9 (2004), 1200--1212.

Digital Library

[11]

Hao Dong, Simiao Yu, Chao Wu, and Yike Guo. 2017. Semantic image synthesis via adversarial learning. In Proceedings of the IEEE International Conference on Computer Vision. 5706--5714.

[12]

John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, Vol. 12, Jul (2011), 2121--2159.

Digital Library

[13]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672--2680.

[14]

James Hays and Alexei A Efros. 2007. Scene completion using millions of photographs. ACM Transactions on Graphics (TOG), Vol. 26, 3 (2007), 4.

Digital Library

[15]

Jia-Bin Huang, Johannes Kopf, Narendra Ahuja, and Sing Bing Kang. 2013. Transformation guided image completion. In IEEE International Conference on Computational Photography (ICCP). IEEE, 1--9.

[16]

Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. 2017. Globally and locally consistent image completion. ACM Transactions on Graphics (ToG), Vol. 36, 4 (2017), 107.

Digital Library

[17]

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1125--1134.

[18]

Justin Johnson, Agrim Gupta, and Li Fei-Fei. 2018. Image generation from scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1219--1228.

[19]

Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).

[20]

Guillaume Lample, Neil Zeghidour, Nicolas Usunier, Antoine Bordes, Ludovic Denoyer, and Marc'Aurelio Ranzato. 2017. Fader networks: Manipulating images by sliding attributes. In Advances in Neural Information Processing Systems. 5967--5976.

[21]

Zhijie Lin, Zhou Zhao, Zhu Zhang, Qi Wang, and Huasheng Liu. 2019. Weakly-Supervised Video Moment Retrieval via Semantic Completion Network. arXiv preprint arXiv:1911.08199 (2019).

[22]

Guilin Liu, Fitsum A Reda, Kevin J Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro. 2018. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV). 85--100.

Digital Library

[23]

Edward Loper and Steven Bird. 2002. NLTK: the natural language toolkit. arXiv preprint cs/0205028 (2002).

Digital Library

[24]

Seonghyeon Nam, Yunji Kim, and Seon Joo Kim. 2018. Text-Adaptive Generative Adversarial Networks: Manipulating Images with Natural Language. In Advances in Neural Information Processing Systems (NeurIPS) .

[25]

Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing. IEEE, 722--729.

Digital Library

[26]

Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. 2016. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759 (2016).

[27]

Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. 2016. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2536--2544.

[28]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.

[29]

Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. 2019. MirrorGAN: Learning Text-to-image Generation by Redescription. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1505--1514.

[30]

Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015).

[31]

Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396 (2016).

Digital Library

[32]

Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082 (2014).

[33]

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234--241.

[34]

Leonid I Rudin, Stanley Osher, and Emad Fatemi. 1992. Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena, Vol. 60, 1--4 (1992), 259--268.

[35]

Jian Sun, Lu Yuan, Jiaya Jia, and Heung-Yeung Shum. 2005. Image completion with structure propagation. In ACM Transactions on Graphics (ToG), Vol. 24. ACM, 861--868.

Digital Library

[36]

Gongbo Tang, Mathias Müller, Annette Rios, and Rico Sennrich. 2018. Why self-attention? a targeted evaluation of neural machine translation architectures. arXiv preprint arXiv:1808.08946 (2018).

[37]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.

[38]

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. 2011. The caltech-ucsd birds-200--2011 dataset. (2011).

[39]

Wei Xiong, Jiahui Yu, Zhe Lin, Jimei Yang, Xin Lu, Connelly Barnes, and Jiebo Luo. 2019. Foreground-aware image inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5840--5848.

[40]

Jin Xu and Yee Whye Teh. 2018. Controllable semantic image inpainting. arXiv preprint arXiv:1806.05953 (2018).

[41]

Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1316--1324.

[42]

Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. 2018. Generative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5505--5514.

[43]

Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. 2019. Free-form image inpainting with gated convolution. In Proceedings of the IEEE International Conference on Computer Vision. 4471--4480.

[44]

Haoran Zhang, Zhenzhen Hu, Changzhi Luo, Wangmeng Zuo, and Meng Wang. 2018. Semantic image inpainting with progressive generative networks. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 1939--1947.

Digital Library

[45]

Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. 2017. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision. 5907--5915.

[46]

Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, and Lianli Gao. 2020. Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10668--10677.

[47]

Bo Zhao, Lili Meng, Weidong Yin, and Leonid Sigal. 2019 a. Image generation from layout. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8584--8593.

[48]

Yinan Zhao, Brian Price, Scott Cohen, and Danna Gurari. 2019 b. Guided image inpainting: Replacing an image region by pulling content from another image. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1514--1523.

[49]

Chuanxia Zheng, Tat-Jen Cham, and Jianfei Cai. 2019. Pluralistic Image Completion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1438--1447.

[50]

Jun-Yan Zhu, Philipp Krahenbühl, Eli Shechtman, and Alexei A Efros. 2016. Generative visual manipulation on the natural image manifold. In European Conference on Computer Vision. Springer, 597--613.

[51]

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision. 2223--2232.

Cited By

Huang MWang CLuo YZhang LLarson K(2024)Eliminating the cross-domain misalignment in text-guided image inpaintingProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/97(875-883)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/97
Lei YLi JLi ZCao YShan H(2024)Prompt learning in computer vision: a survey计算机视觉中的提示学习:综述Frontiers of Information Technology & Electronic Engineering10.1631/FITEE.230038925:1(42-63)Online publication date: 8-Feb-2024
https://doi.org/10.1631/FITEE.2300389
Wei JChang CYang XIgarashi T(2024)CanvasPic: An Interactive Tool for Freely Generating Facial Images Based on Spatial LayoutExtended Abstracts of the CHI Conference on Human Factors in Computing Systems10.1145/3613905.3650952(1-8)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613905.3650952
Show More Cited By

Index Terms

Text-Guided Image Inpainting
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Reconstruction

Recommendations

Text-Guided Neural Image Inpainting
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Image inpainting task requires filling the corrupted image with contents coherent with the context. This research field has achieved promising progress by using neural image inpainting methods. Nevertheless, there is still a critical challenge in ...
Edge-Guided Image Inpainting with Transformer
Advances in Visual Computing
Abstract
Image inpainting aims to complete missing regions by extracting the features of the image through the information of the known region. Traditional image inpainting approaches like patch-based and diffusion-based methods are robust for simple ...
Image Retrieval Using Digital Image Inpainting Techniques

Image retrieval is an inverse problem in digital image processing. In this paper, the authors deal with restoration of image using digitally image inpainting methods. In this inpainting technique, one can extract a missing an important part or can ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

October 2020

4889 pages

ISBN:9781450379885

DOI:10.1145/3394171

General Chairs:
Chang Wen Chen
Chinese University of Hong Kong, Shenzhen, China
,
Rita Cucchiara
UNIMORE, Italy
,
Xian-Sheng Hua
Alibaba Group, China
,
Program Chairs:
Guo-Jun Qi
Futurewei Technologies, USA
,
Elisa Ricci
UNITN & Fondazione Bruno Kessler, Italy
,
Zhengyou Zhang
Tencent, China
,
Roger Zimmermann
National University of Singapore, Singapore

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key R&D Program of China
Fundamental Research Funds for the Central Universities
National Natural Science Foundation of China
Zhejiang Natural Science Foundation

Conference

MM '20

Sponsor:

SIGMM

MM '20: The 28th ACM International Conference on Multimedia

October 12 - 16, 2020

WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
582
Total Downloads

Downloads (Last 12 months)82
Downloads (Last 6 weeks)5

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Huang MWang CLuo YZhang LLarson K(2024)Eliminating the cross-domain misalignment in text-guided image inpaintingProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/97(875-883)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/97
Lei YLi JLi ZCao YShan H(2024)Prompt learning in computer vision: a survey计算机视觉中的提示学习:综述Frontiers of Information Technology & Electronic Engineering10.1631/FITEE.230038925:1(42-63)Online publication date: 8-Feb-2024
https://doi.org/10.1631/FITEE.2300389
Wei JChang CYang XIgarashi T(2024)CanvasPic: An Interactive Tool for Freely Generating Facial Images Based on Spatial LayoutExtended Abstracts of the CHI Conference on Human Factors in Computing Systems10.1145/3613905.3650952(1-8)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613905.3650952
Fu CWang MHu QZhao L(2024)Text-Guided Co-Modulated Generative Adversarial Network for Image Inpainting2024 9th International Conference on Big Data Analytics (ICBDA)10.1109/ICBDA61153.2024.10607370(92-97)Online publication date: 16-Mar-2024
https://doi.org/10.1109/ICBDA61153.2024.10607370
Li HZhang YWang W(2024)Enhanced blind face inpainting via structured mask predictionPattern Recognition Letters10.1016/j.patrec.2024.02.004179(101-107)Online publication date: Mar-2024
https://doi.org/10.1016/j.patrec.2024.02.004
Wu XZhao KHuang QWang QYang ZHao G(2024)MISL: Multi-grained image-text semantic learning for text-guided image inpaintingPattern Recognition10.1016/j.patcog.2023.109961145(109961)Online publication date: Jan-2024
https://doi.org/10.1016/j.patcog.2023.109961
Quan WChen JLiu YYan DWonka P(2024)Deep Learning-Based Image and Video Inpainting: A SurveyInternational Journal of Computer Vision10.1007/s11263-023-01977-6132:7(2367-2400)Online publication date: 19-Jan-2024
https://doi.org/10.1007/s11263-023-01977-6
Fu ZLiu HBi XWang X(2023)Text Guided Image Inpainting Based on Generative Adversarial Network2023 8th International Conference on Computational Intelligence and Applications (ICCIA)10.1109/ICCIA59741.2023.00031(128-132)Online publication date: 23-Jun-2023
https://doi.org/10.1109/ICCIA59741.2023.00031
Megahed MMohammed A(2023)A comprehensive review of generative adversarial networks: Fundamentals, applications, and challengesWIREs Computational Statistics10.1002/wics.162916:1Online publication date: 2-Aug-2023
https://doi.org/10.1002/wics.1629
Li JXiong ZLiu D(2022)Reference-Guided Landmark Image Inpainting With Deep Feature MatchingIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.319389332:12(8422-8435)Online publication date: Dec-2022
https://doi.org/10.1109/TCSVT.2022.3193893
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten