Person image generation with attention-based injection network

doi:10.1016/j.neucom.2021.06.077

Neurocomputing

Volume 460, 14 October 2021, Pages 345-359

https://doi.org/10.1016/j.neucom.2021.06.077 Get rights and content

Abstract

Person image generation becomes a challenging problem due to the content ambiguity and style inconsistency. In this paper, we propose a novel Attention-based Injection Network (AIN) to address this issue. Instead of directly learning the relationship between the source and target image, we decompose the process into two accessible modules, namely Semantic-guided Attention Network (SAN) and Pose-guided Attention Network (PAN). SAN is proposed to capture the semantic information which can embed the human attributes into the latent space via the semantic layout. PAN enables a natural re-coupling of the pose and appearance, which can selectively integrate features to complete the human pose transformation. Additionally, a semantic layout loss is proposed to focus on the semantic content similarity between the source and generated images. Compared with other methods, our networks can enforce the local textures and styles consistency between the source and generated image. Experiments show that superior both qualitative and quantitative results are obtained on Market-1501 and DeepFashion datasets. On the basis of AIN, our network can further achieve the data augmentation for person re-identification (Re-ID) with dramatically improving the person Re-ID accuracy.

Introduction

Person image generation aims to transfer a person image with arbitrary pose while retaining the appearance details of the source image. This task, first introduced in [1], has become an emerging popular topic in the computer vision community. It has shown great potential applications in many tasks, such as video generation [2], [3], [4], virtual clothes try-on [5], [6], [7], data augmentation for person-related vision task [8], [9], [10], [56]], etc. The key challenges of this task include the following two aspects: (i) Limited by the non-rigid human body deformation structure, it is difficult to directly transform the spatial misaligned body, particularly when only given local observations of the human body. (ii) Preserving clothing attributes including the texture and style imposes an unexpected difficulty in the generation process.

Notable results of deep learning have provided significant tools to achieve pose transfer task [11], [12], [13], [57]. Early studies in [10], [14], [15], [16] directly adopt a global predictive strategy to transfer pose by utilizing the U-net [17] structure to propagate low-level features. However, considering the non-rigid human body deformation structure, U-net based global method often fail to address the spatial misaligned problem between the source and the target pose. This leads to the generated image suffering from detail deficiency, which always results in over-smooth clothes. In order to improve the performance, more and more researches are devoted to better modeling the body deformation and the local feature transfer. Some methods in [8], [18], [19], [20] adopt the pose attention network to perform a local transfer to fit for the target body topology. Others in [21], [22], [23], [24] fuse the feature representations of appearance and pose by controlling the attention-based decoder. Although the aforementioned approaches can achie-ve better performance, the generated images still encounter some quality problems such as missing texture details and blurry boundary. That results from the model relying on the corresponding extracted feature for the detail reconstruction. Each intermediate image representation is difficult to reveal the region to be transferred. Besides, these approaches in [8], [18], [19], [20], [21], [22], [23], [24] only consider the source image and target pose as inputs, where the clothing textures and the human outline information may be ignored. It should be noted that to preserve the semantic information and replenish fine-grained appearance details is crucial from the viewpoint of the human visual perception.

To preserve the appearance and shape simultaneously, the semantic parsing map in [25], [26], [27], [28], [29], [30], [31] has received considerable attention to serve as an intermediate representation between the source and the synthesized images. Different from the key-point based pose representation, the semantic parsing map automatically provides a foreground mask and captures valuable prior information for the person image generation. However, these generators in [25], [26], [27], [28], [29], [30], [31] directly concatenate the source image, semantic parsing map and target pose as inputs into the basic U-Net, then dedicate in learning a mapping from the concatenated conditions to the target image in a forced manner. It inevitably leads to difficulty in directly transforming the spatial misaligned human body due to the inherent nature of the non-rigid human body.

Motivated by the aforementioned discussion, we propose a novel Attention-based Injection Network (AIN) to address the above challenging problems. Our generator is decomposed into two interconnected aware network, called Seman-tic-guided Attention Network (SAN) and Pose-guided Attention Network (PAN). SAN is designed for semantic encoding, and PAN is constructed for enabling a natural re-coupling of the pose and appearance. Specifically, our model first utilize SAN to capture clothing attributes automatically from the source image via semantic layouts. Then, the component clothing information represented by the semantic code is injected into the designed each block of PAN. Inside each block of PAN, we create an attention mechanism to infer the image regions of interest based on the target pose, and integrate the appearance and pose information with the help of the Adaptive Instance Normalization (AdaIN) layer in the affine transformation form. Each intermediate pose representation characterize the ability to better guide the image generation and retain texture details substantially. The desired image can be reconstructed following target features. Our approach can render textures from the source image into the synthesized map by learning the feature-level mapping, and deal well with the information missing and self-occlusion problems. Moreover, the proposed network benefits in alleviating the ambiguities in inferring unobserved pixels. Experiments on Market-1501 and DeepFashion datasets show that our network can achieve superiority performance on both qualitative and quantitative results.

Existing methods in [13], [16], [22], [30], [32] adopt style loss to enforce the textures of the target and the generated image to be similar around the corresponding pose joints. However, person images with the different pose can be drastically different in appearance under different views. This inevitably demands textures around joints changing with the different pose. Therefore, a new semantic layout loss is proposed to focus on the semantic content similarity between the source and the generated image. The representation of the semantic layout loss is captured by the correlation among the different channels of the image. We first separate component attributes from the person image via semantic layouts. Then it can be implemented by computing the Gram matrix [33] for each patch. Our loss can not only alleviate the influence of the background in pixel-level, but also enforce the local textures and styles consistency between the source and generated image. In summary, the main contributions of this paper can be summarized as follows:

•
A novel Attention-based Injection Network (AIN) is proposed to tackle the challenge of pose transfer task with content and style inconsistency. Our method has the ability to preserve and replenish the details of the fine-grained semantic and appearance. It benefits in alleviating ambiguities in inferring unobserved pixels and self-occlusion problems. Experiments demonstrate the rationality and effectiveness of the proposed method.
•
We create two attention network which called Semant-ic-guided Attention Network (SAN) and Pose-guided Attention Network (PAN). SAN is designed to automatically capture semantic information from the sou-rce image via semantic layouts. By effectively selecting and deforming important regions of the image code, PAN is proposed to enable a natural re-couping of the image and pose.
•
A new semantic layout loss is designed to deal with the spatial misaligned of the source and target image. Our loss is able to enforce the component content and style consistency with well retaining the texture details and attributes.
•
Our method exhibits superior performance both in the body shape and keeping clothing textures on benchmarks. Experiments show that it can enrich the pose variations by our method and improve the person re-identification accuracy.

In particular, the structure of the paper is as follows. In Section 2, we present related works on person image generation. In Section 3, we introduce the proposed method in detail. Subsequently, experiments results and extensive analysis are presented on two datasets (Market-1501 [34] and DeepFashion [35]) in Section 4. The application and conclusion are given in Section 5 and Section 6, respectively.

Section snippets

Pose-guided person image generation

The early attempt on pose transfer is presented by Ma et al. [1], which can only coarsely generate image under the target pose in Stage- $I$ and refine details of the appearance in Stage- $II$ . To further improve the appearance and the shape of generated images, Ma et al. [11] disentangle the source image into the appearance and pose, followed by encoding them into embedding features. Similar method is adopted in [15], which can disentangle appearance and pose of the source image by VAE [36] and

The proposed method

Given a target pose $P_{t}$ and a source image $I_{p_{s}}$ under the pose $P_{s}$ , our goal is to generate an output image ${\hat{I}}_{p_{t}}$ , which follows the clothing appearance of $I_{p_{s}}$ and under the target pose $P_{t}$ . In this paper, we design a novel end-to-end Attention-based Injection Network (AIN) to address this challenging task. The overall framework can be shown in Fig. 1, where inputs of our network are the source image $I_{p_{s}}$ , the corresponding semantic map $S_{p_{s}}$ , and the target pose $P_{t}$ . We feed inputs into AIN to capture

Experiments

In this section, we describe the datasets and metrics (Section 4.1), followed by the implementation details of the proposed method (Section 4.2). And we conduct extensive experiments to verify the design rationalities and effectiveness of the proposed network (Section 4.3 and Section 4.4). The experiment results can demonstrate the superiority of our method in both objective quantitative scores and subjective visual realness.

Application to person re-identification

An effective person pose transfer network can synthesis realistic-looking human images to augment the datasets of person-related vision tasks, which could improve the network performance especially in the situation of limited pose variations and insufficient training data. Person re-identifica- tion (Re-ID) has seen a boost in applications and been an active research field in the computer vision, which aims to match the person across non-overlapping video cameras. Like more recently method in

Conclusion

In this paper, we propose a novel Attention-based Injection Network for the person image generation. To address the complexity of directly learning the mapping, the pose transfer process is decomposed into two accessible modules, SAN and PAN. SAN has the ability to extract the semantic information automatically from the source image via human layout. PAN enables a natural re-coupling of pose and appearance. Moreover, a new semantic layout loss is proposed to well retain the semantic content,

CRediT authorship contribution statement

Meichen Liu: Conceptualization, Methodology, Software, Validation, Formal analysis. Kejun Wang: Supervision, Writing - original draft, Writing - review & editing, Formal analysis, Funding acquisition. Ruihang Ji: Software, Validation, Writing - review & editing. Shuzhi Sam Ge: Writing - review & editing, Supervision, Investigation. Jing Chen: Data curation.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The work is supported by National Natural Science Foundation of China (61573114). This work is also supported by College of Intelligent Systems Science and Engineering, Harbin Engineering University.

Meichen Liu is pursuing the Ph.D. degree in Harbin Engineering University, China. She is currently an exchange Ph.D. student in the Department of Electrical and Computer Engineering, National University of Singapore, Singapore, supported by College of Intelligent Systems Science and Engineering, Harbin Engineering University. Her current research interests include Machine Learning, and Deep Neural Networks, especially on Generate Adversarial Networks and Person Re-identification.

References (57)

L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool, Pose guided person image generation, in Advances in...
S. Tulyakov et al.
Mocogan: Decomposing motion and content for video generation
P. Zablotskaia, A. Siarohin, B. Zhao, and L. Sigal, Dwnet: Dense warp-based network for pose-guided human video...
Q. Chen et al.
Scripted video generation with a bottom-up generative adversarial network
IEEE Trans. Image Process.
(2020)
H. Dong et al.
Towards multi-pose guided virtual try-on network
H. Yang et al.
Towards photo-realistic virtual try-on by adaptively generating-preserving image content
S. Song et al.
Unpaired person image generation with semantic parsing transformation
IEEE Trans. Pattern Anal. Mach. Intell.
(2020)
Z. Zhu et al.
Progressive pose attention transfer for person image generation
M. Liu et al.
Segmentation mask-guided person image generation
Appl. Intell.
(2021)
A. Siarohin et al.
Appearance and pose-conditioned human image generation using deformable gans
IEEE Trans. Pattern Anal. Mach. Intell.
(2019)

L. Ma et al.

Disentangled person image generation

Z. Liu et al.

Adcm: attention dropout convolutional module

Neurocomputing

(2020)

L. Yang et al.

Towards fine-grained human pose transfer with detail replenishing network

IEEE Trans. Image Process.

(2021)

X. Mao et al.

Least squares generative adversarial networks

P. Esser et al.

A variational u-net for conditional appearance and shape generation

A. Pumarola et al.

Unsupervised person image synthesis in arbitrary poses

P. Isola et al.

Image-to-image translation with conditional adversarial networks

Y. Chen et al.

Person image synthesis through siamese generative adversarial network

Neurocomputing

(2020)

B. Chen et al.

Pman: Progressive multi-attention network for human pose transfer

IEEE Trans. Circuits Syst. Video Technol.

(2021)

K. Li et al.

Pona: Pose-guided non-local attention for human pose transfer

IEEE Trans. Image Process.

(2020)

S. Lathuilière et al.

Attention-based fusion for multi-source human image generation

Y. Ren et al.

Deep image spatial transformation for person image generation

S. Huang, H. Xiong, Z.-Q. Cheng, Q. Wang, X. Zhou, B. Wen, J. Huan, and D. Dou, Generating person images with...

P. Ge et al.

Focus and retain: Complement the broken pose in human image synthesis

H. Dong et al.

Soft-gated warping-gan for pose-guided person image synthesis

X. Han et al.

Viton: An image-based virtual try-on network

S. Hong et al.

Learning hierarchical semantic image manipulation through structured representations

S. Song et al.

Unsupervised person image generation with semantic parsing transformation

Cited by (8)

Appearance flow estimation for online virtual clothing warping via optimal feature linear assignment
2024, Image and Vision Computing
Clothing warping to spatially align source garments with the corresponding body parts is crucial in clothing media tasks such as virtual try-on and pose-guided person generation. Recent pioneering work has utilized flow fields with additional dimensions of freedom to simulate clothing warping flexibly. However, current methods for estimating appearance flow typically rely on calculating just the local cost volume which contains multiple noisy matching points, potentially leading to a mismatch between clothing and body parts. To address this issue, we propose a novel appearance flow estimation network(Warping-Flow) for clothing warping based on optimal features linear assignment. Specially, two remarkable contributions are made to improve feature matching precision. First, a local context feature aggregation module is proposed to enhance the semantic feature distinction of the source cloth and target pose. Second, Warping-Flow estimates a hard attention mask through the cost volume to filter irrelevant features, followed by the optimal linear assignment algorithm to normalize the cost volume to a discrete permutation matrix that explicitly models the most contributing bipartite matches. Experiments conducted on the VITON and VITON-HD datasets demonstrate that Warping-Flow outperforms existing state-of-the-art algorithms, particularly in cases involving complex clothing deformation. Furthermore, Warping-Flow can serve as a plug-in to improve existing garment media technologies.
Multi-scale attention guided pose transfer
2023, Pattern Recognition
Pose transfer refers to the probabilistic image generation of a person with a previously unseen novel pose from another image of that person having a different pose. Due to potential academic and commercial applications, pose transfer has been extensively studied in recent years. Among the various approaches to the problem, attention guided progressive generation is shown to produce state-of-the-art results in most cases. This paper presents an improved network architecture for pose transfer by introducing attention links at every resolution level of the encoder and decoder. By utilizing such dense multi-scale attention guided approach, we are able to achieve significant improvement over the existing methods both visually and analytically. We conclude our findings with extensive qualitative and quantitative comparisons against several existing methods on the DeepFashion dataset. We also show the generality of the proposed network architecture by extending it to multiple application domains, such as semantic reconstruction, virtual try-on and style manipulation.¹
An Improved Nested U-Net Network for Fluorescence In Situ Hybridization Cell Image Segmentation
2024, Sensors
Multi2human: Controllable Human Image Generation with Multimodal Controls
2023, SSRN
Posediffusion: a Pose-Guided Human Image Generator Via Cascaded Diffusion Models
2023, SSRN
PCFN: Progressive Cross-Modal Fusion Network for Human Pose Transfer
2023, IEEE Transactions on Circuits and Systems for Video Technology

View all citing articles on Scopus

Kejun Wang received the B.E in automation from the northeast china heavy machinery institute, China, in 1984, the ME degree in automation control theory and application from Harbin Engineering University, China, in 1987, and the Ph.D. degree in Ship and marine special auxiliary devices and systems from the Harbin Engineering University, China, in 1995. He is a professor of College of Intelligent Systems Science and Engineering at Harbin Engineering University. He does research in the area of Biometrics and intelligent monitoring technology (venous, fingerprint, gait recognition, moving target detection and tracking), Computational bioinformatics, Fitness Internet of Things technology. He has received many research funding from agencies including the National Science Foundation, The National High Technology Research and Development Program of China and province in Heilongjiang outstanding youth science fund. He has published about 80 conference and journal papers, and 6 patent of invention.

Ruihang Ji received the B.S. degree in automation engineering from the Harbin Institute of Technology, Harbin, China, in 2016, where he is currently pursuing the Ph.D. degree in control science and engineering. He is currently an exchange Ph.D. student supported by the China Scholarship Council with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore. His current research interests include finite-time control, Unmanned Aerial Vehicle and computer vision.

Shuzhi Sam Ge received the B.Sc. degree from the Beijing University of Aeronautics and Astronautics, Beijing, China, in 1986 and the Ph.D. degree from the Imperial College London, London, U.K., in 1993. He is the Director with the Social Robotics Laboratory of Interactive Digital Media Institute, Singapore and the Centre for Robotics, Chengdu, China, and a Professor with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore, on leave from the School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu. He has co-authored four books and over 300 international journal and conference papers. His current research interests include social robotics, adaptive control, intelligent systems, and artificial intelligence. Dr. Ge is the Editor-in-Chief of the International Journal of Social Robotics (Springer). He has served/been serving as an Associate Editor for a number of flagship journals, including the IEEE Transactions on Automation Control, the IEEE Transactions on Control Systems Technology, the IEEE TRANSACTIONS ON NEURAL NETWORKS, and Automatica. He served as the Vice President for Technical Activities from 2009 to 2010 and Membership Activities from 2011 to 2012, and a member of the Board of Governors from 2007 to 2009 at the IEEE Control Systems Society. He is a fellow of the International Federation of Automatic Control, the Institution of Engineering and Technology, and the Society of Automotive Engineering.

Jing Chen received the B.E degree in automation from Yanshan University, QinHuangDao, China, in 2016. She is currently pursuing her Ph.D. degree in College of Intelligent Systems Science and Engineering at Harbin Engineering University. Her research interests include Computer Vision, Machine Learning, and Deep Neural Networks, especially on multimodal emotion analysis.

^☆: This document is the results of the research project funded by the National Science Foundation.

View full text

Person image generation with attention-based injection network☆

Abstract

Introduction

Section snippets

Pose-guided person image generation

The proposed method

Experiments

Application to person re-identification

Conclusion

CRediT authorship contribution statement

Declaration of competing interest

Acknowledgments

Mocogan: Decomposing motion and content for video generation

Scripted video generation with a bottom-up generative adversarial network

IEEE Trans. Image Process.

Towards multi-pose guided virtual try-on network

Towards photo-realistic virtual try-on by adaptively generating-preserving image content

Unpaired person image generation with semantic parsing transformation

IEEE Trans. Pattern Anal. Mach. Intell.

Progressive pose attention transfer for person image generation

Segmentation mask-guided person image generation

Appl. Intell.

Appearance and pose-conditioned human image generation using deformable gans

IEEE Trans. Pattern Anal. Mach. Intell.

Disentangled person image generation

Adcm: attention dropout convolutional module

Neurocomputing

Towards fine-grained human pose transfer with detail replenishing network

IEEE Trans. Image Process.

Least squares generative adversarial networks

A variational u-net for conditional appearance and shape generation

Unsupervised person image synthesis in arbitrary poses

Image-to-image translation with conditional adversarial networks

Person image synthesis through siamese generative adversarial network

Neurocomputing

Pman: Progressive multi-attention network for human pose transfer

IEEE Trans. Circuits Syst. Video Technol.

Pona: Pose-guided non-local attention for human pose transfer

IEEE Trans. Image Process.

Attention-based fusion for multi-source human image generation

Deep image spatial transformation for person image generation

Focus and retain: Complement the broken pose in human image synthesis

Soft-gated warping-gan for pose-guided person image synthesis

Viton: An image-based virtual try-on network

Learning hierarchical semantic image manipulation through structured representations

Unsupervised person image generation with semantic parsing transformation