Abstract
Text-guided image manipulation aims to modify the visual attributes of images according to textual descriptions. Existing works either mismatch between generated images and textual descriptions or may pollute the text-irrelevant image regions. In this paper, we propose a dilated residual aggregation network (denoted as DRA) for text-guided image manipulation, which exploits a long-distance residual with dilated convolutions (RD) to aggregate the encoded visual content and style features and the textual features of the guiding descriptions. In particular, the dilated convolutions increase the receptive field without sacrificing spatial resolutions of intermediate features, benefiting to reconstructing the texture details matching with the textual descriptions. Furthermore, we propose an attention-guided injection module (AIM) to inject textual semantics into feature maps of DRA without polluting the text-irrelevant image regions by combining triplet attention mechanism and central biasing instance normalization. Quantitative and qualitative experiments conducted on the CUB-200-2011 and Oxford-102 datasets demonstrate the superior performance of the proposed DRA.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Dong, H., Yu, S., Wu, C., Guo, Y.: Semantic image synthesis via adversarial learning. In: Proceedings of the IEEE International Conference on Computer Vision (2017)
Nam, S., Kim, Y., Kim, S. J.: Text-adaptive generative adversarial networks: manipulating images with natural language. arXiv preprint arXiv:1810.11919(2018)
Goodfellow, I.J., et al.: Generative adversarial networks. arXiv preprint arXiv:1406.2661 (2014)
Vo, D.M., Sugimoto, A.: Paired-D GAN for semantic image synthesis. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11364, pp. 468–484. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20870-7_29
Yu, X., Chen, Y., Li, T., Liu, S., Li, G.: Multi-mapping image-to-image translation via learning disentanglement. arXiv preprint arXiv:1909.07877 (2019)
Yu, X., Ying, Z., Li, T., Liu, S., Li, G.: Multi-mapping image-to-image translation with central biasing normalization. arXiv preprint arXiv:1806.10050 (2018)
Zhang, Y., Tian, Y.: Residual dense network for image super-resolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Liu, J., Zhang, W., Tang, Y., Tang, J., Wu, G.: Residual feature aggregation network for image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Misra, D., Nalamada, T., Arasanipalai, A.U., Hou, Q.: Rotate to attend: convolutional triplet attention module. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (2021)
Huang, X., Liu, M.Y., Belongie, S.: Multimodal unsupervised image-to-image translation. In: Proceedings of the European Conference on Computer Vision (2018)
Anokhin, I., Solovev, P., Korzhenkov, D., Kharlamov, A., Khakhulin, T.: High-resolution daytime translation without domain labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020)
Lee, H.Y., Tseng, H.Y., Huang, J.B., Singh, M., Yang, M.H.: Diverse image-to-image translation via disentangled representations. In: Proceedings of the European Conference on Computer Vision (2018)
Lin, Q., Yan, B., Li, J., Tan, W.: MMFL: multimodal fusion learning for text-guided image inpainting. In: ACM MM (2020)
Chen, S., Huang, K., Xiong, D., Li, B., Claesen, L.: Fine-grained channel pruning for deep residual neural networks. In: Farkaš, I., Masulli, P., Wermter, S. (eds.) ICANN 2020. LNCS, vol. 12397, pp. 3–14. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61616-8_1
Zhang, H., Goodfellow, I., Metaxas, D., Odena, A.: Self-attention generative adversarial networks. In: International Conference on Machine Learning (2019)
Li, B., Qi, X.: Manigan: text-guided image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020)
Li, B., Qi, X.: Lightweight Generative Adversarial Networks for Text-Guided Image Manipulation. arXiv preprint arXiv:2010.12136 (2020)
Acknowledgment
This work is supported by the National Natural Science Foundation of China (No. 62076073, No. 61902077), the Guangdong Basic and Applied Basic Research Foundation (No. 2020A1515010616), Science and Technology Program of Guangzhou (No. 202102020524, No. 202007040005), the Guangdong Innovative Research Team Program (No. 2014ZT05G157), the Key-Area Research and Development Program of Guangdong Province (2019B010136001), and the Science and Technology Planning Project of Guangdong Province (LZC0023), and Hong Kong RGC CRF Project C1031-18G.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Lu, S., Luo, D., Yang, Z., Hao, T., Li, Q., Liu, W. (2021). Dilated Residual Aggregation Network for Text-Guided Image Manipulation. In: Farkaš, I., Masulli, P., Otte, S., Wermter, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2021. ICANN 2021. Lecture Notes in Computer Science(), vol 12893. Springer, Cham. https://doi.org/10.1007/978-3-030-86365-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-86365-4_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86364-7
Online ISBN: 978-3-030-86365-4
eBook Packages: Computer ScienceComputer Science (R0)