Person image generation with attention-based injection network☆
Introduction
Person image generation aims to transfer a person image with arbitrary pose while retaining the appearance details of the source image. This task, first introduced in [1], has become an emerging popular topic in the computer vision community. It has shown great potential applications in many tasks, such as video generation [2], [3], [4], virtual clothes try-on [5], [6], [7], data augmentation for person-related vision task [8], [9], [10], [56]], etc. The key challenges of this task include the following two aspects: (i) Limited by the non-rigid human body deformation structure, it is difficult to directly transform the spatial misaligned body, particularly when only given local observations of the human body. (ii) Preserving clothing attributes including the texture and style imposes an unexpected difficulty in the generation process.
Notable results of deep learning have provided significant tools to achieve pose transfer task [11], [12], [13], [57]. Early studies in [10], [14], [15], [16] directly adopt a global predictive strategy to transfer pose by utilizing the U-net [17] structure to propagate low-level features. However, considering the non-rigid human body deformation structure, U-net based global method often fail to address the spatial misaligned problem between the source and the target pose. This leads to the generated image suffering from detail deficiency, which always results in over-smooth clothes. In order to improve the performance, more and more researches are devoted to better modeling the body deformation and the local feature transfer. Some methods in [8], [18], [19], [20] adopt the pose attention network to perform a local transfer to fit for the target body topology. Others in [21], [22], [23], [24] fuse the feature representations of appearance and pose by controlling the attention-based decoder. Although the aforementioned approaches can achie-ve better performance, the generated images still encounter some quality problems such as missing texture details and blurry boundary. That results from the model relying on the corresponding extracted feature for the detail reconstruction. Each intermediate image representation is difficult to reveal the region to be transferred. Besides, these approaches in [8], [18], [19], [20], [21], [22], [23], [24] only consider the source image and target pose as inputs, where the clothing textures and the human outline information may be ignored. It should be noted that to preserve the semantic information and replenish fine-grained appearance details is crucial from the viewpoint of the human visual perception.
To preserve the appearance and shape simultaneously, the semantic parsing map in [25], [26], [27], [28], [29], [30], [31] has received considerable attention to serve as an intermediate representation between the source and the synthesized images. Different from the key-point based pose representation, the semantic parsing map automatically provides a foreground mask and captures valuable prior information for the person image generation. However, these generators in [25], [26], [27], [28], [29], [30], [31] directly concatenate the source image, semantic parsing map and target pose as inputs into the basic U-Net, then dedicate in learning a mapping from the concatenated conditions to the target image in a forced manner. It inevitably leads to difficulty in directly transforming the spatial misaligned human body due to the inherent nature of the non-rigid human body.
Motivated by the aforementioned discussion, we propose a novel Attention-based Injection Network (AIN) to address the above challenging problems. Our generator is decomposed into two interconnected aware network, called Seman-tic-guided Attention Network (SAN) and Pose-guided Attention Network (PAN). SAN is designed for semantic encoding, and PAN is constructed for enabling a natural re-coupling of the pose and appearance. Specifically, our model first utilize SAN to capture clothing attributes automatically from the source image via semantic layouts. Then, the component clothing information represented by the semantic code is injected into the designed each block of PAN. Inside each block of PAN, we create an attention mechanism to infer the image regions of interest based on the target pose, and integrate the appearance and pose information with the help of the Adaptive Instance Normalization (AdaIN) layer in the affine transformation form. Each intermediate pose representation characterize the ability to better guide the image generation and retain texture details substantially. The desired image can be reconstructed following target features. Our approach can render textures from the source image into the synthesized map by learning the feature-level mapping, and deal well with the information missing and self-occlusion problems. Moreover, the proposed network benefits in alleviating the ambiguities in inferring unobserved pixels. Experiments on Market-1501 and DeepFashion datasets show that our network can achieve superiority performance on both qualitative and quantitative results.
Existing methods in [13], [16], [22], [30], [32] adopt style loss to enforce the textures of the target and the generated image to be similar around the corresponding pose joints. However, person images with the different pose can be drastically different in appearance under different views. This inevitably demands textures around joints changing with the different pose. Therefore, a new semantic layout loss is proposed to focus on the semantic content similarity between the source and the generated image. The representation of the semantic layout loss is captured by the correlation among the different channels of the image. We first separate component attributes from the person image via semantic layouts. Then it can be implemented by computing the Gram matrix [33] for each patch. Our loss can not only alleviate the influence of the background in pixel-level, but also enforce the local textures and styles consistency between the source and generated image. In summary, the main contributions of this paper can be summarized as follows:
- •
A novel Attention-based Injection Network (AIN) is proposed to tackle the challenge of pose transfer task with content and style inconsistency. Our method has the ability to preserve and replenish the details of the fine-grained semantic and appearance. It benefits in alleviating ambiguities in inferring unobserved pixels and self-occlusion problems. Experiments demonstrate the rationality and effectiveness of the proposed method.
- •
We create two attention network which called Semant-ic-guided Attention Network (SAN) and Pose-guided Attention Network (PAN). SAN is designed to automatically capture semantic information from the sou-rce image via semantic layouts. By effectively selecting and deforming important regions of the image code, PAN is proposed to enable a natural re-couping of the image and pose.
- •
A new semantic layout loss is designed to deal with the spatial misaligned of the source and target image. Our loss is able to enforce the component content and style consistency with well retaining the texture details and attributes.
- •
Our method exhibits superior performance both in the body shape and keeping clothing textures on benchmarks. Experiments show that it can enrich the pose variations by our method and improve the person re-identification accuracy.
In particular, the structure of the paper is as follows. In Section 2, we present related works on person image generation. In Section 3, we introduce the proposed method in detail. Subsequently, experiments results and extensive analysis are presented on two datasets (Market-1501 [34] and DeepFashion [35]) in Section 4. The application and conclusion are given in Section 5 and Section 6, respectively.
Section snippets
Pose-guided person image generation
The early attempt on pose transfer is presented by Ma et al. [1], which can only coarsely generate image under the target pose in Stage- and refine details of the appearance in Stage-. To further improve the appearance and the shape of generated images, Ma et al. [11] disentangle the source image into the appearance and pose, followed by encoding them into embedding features. Similar method is adopted in [15], which can disentangle appearance and pose of the source image by VAE [36] and
The proposed method
Given a target pose and a source image under the pose , our goal is to generate an output image , which follows the clothing appearance of and under the target pose . In this paper, we design a novel end-to-end Attention-based Injection Network (AIN) to address this challenging task. The overall framework can be shown in Fig. 1, where inputs of our network are the source image , the corresponding semantic map , and the target pose . We feed inputs into AIN to capture
Experiments
In this section, we describe the datasets and metrics (Section 4.1), followed by the implementation details of the proposed method (Section 4.2). And we conduct extensive experiments to verify the design rationalities and effectiveness of the proposed network (Section 4.3 and Section 4.4). The experiment results can demonstrate the superiority of our method in both objective quantitative scores and subjective visual realness.
Application to person re-identification
An effective person pose transfer network can synthesis realistic-looking human images to augment the datasets of person-related vision tasks, which could improve the network performance especially in the situation of limited pose variations and insufficient training data. Person re-identifica- tion (Re-ID) has seen a boost in applications and been an active research field in the computer vision, which aims to match the person across non-overlapping video cameras. Like more recently method in
Conclusion
In this paper, we propose a novel Attention-based Injection Network for the person image generation. To address the complexity of directly learning the mapping, the pose transfer process is decomposed into two accessible modules, SAN and PAN. SAN has the ability to extract the semantic information automatically from the source image via human layout. PAN enables a natural re-coupling of pose and appearance. Moreover, a new semantic layout loss is proposed to well retain the semantic content,
CRediT authorship contribution statement
Meichen Liu: Conceptualization, Methodology, Software, Validation, Formal analysis. Kejun Wang: Supervision, Writing - original draft, Writing - review & editing, Formal analysis, Funding acquisition. Ruihang Ji: Software, Validation, Writing - review & editing. Shuzhi Sam Ge: Writing - review & editing, Supervision, Investigation. Jing Chen: Data curation.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The work is supported by National Natural Science Foundation of China (61573114). This work is also supported by College of Intelligent Systems Science and Engineering, Harbin Engineering University.
Meichen Liu is pursuing the Ph.D. degree in Harbin Engineering University, China. She is currently an exchange Ph.D. student in the Department of Electrical and Computer Engineering, National University of Singapore, Singapore, supported by College of Intelligent Systems Science and Engineering, Harbin Engineering University. Her current research interests include Machine Learning, and Deep Neural Networks, especially on Generate Adversarial Networks and Person Re-identification.
References (57)
- L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool, Pose guided person image generation, in Advances in...
- et al.
Mocogan: Decomposing motion and content for video generation
- P. Zablotskaia, A. Siarohin, B. Zhao, and L. Sigal, Dwnet: Dense warp-based network for pose-guided human video...
- et al.
Scripted video generation with a bottom-up generative adversarial network
IEEE Trans. Image Process.
(2020) - et al.
Towards multi-pose guided virtual try-on network
- et al.
Towards photo-realistic virtual try-on by adaptively generating-preserving image content
- et al.
Unpaired person image generation with semantic parsing transformation
IEEE Trans. Pattern Anal. Mach. Intell.
(2020) - et al.
Progressive pose attention transfer for person image generation
- et al.
Segmentation mask-guided person image generation
Appl. Intell.
(2021) - et al.
Appearance and pose-conditioned human image generation using deformable gans
IEEE Trans. Pattern Anal. Mach. Intell.
(2019)
Disentangled person image generation
Adcm: attention dropout convolutional module
Neurocomputing
Towards fine-grained human pose transfer with detail replenishing network
IEEE Trans. Image Process.
Least squares generative adversarial networks
A variational u-net for conditional appearance and shape generation
Unsupervised person image synthesis in arbitrary poses
Image-to-image translation with conditional adversarial networks
Person image synthesis through siamese generative adversarial network
Neurocomputing
Pman: Progressive multi-attention network for human pose transfer
IEEE Trans. Circuits Syst. Video Technol.
Pona: Pose-guided non-local attention for human pose transfer
IEEE Trans. Image Process.
Attention-based fusion for multi-source human image generation
Deep image spatial transformation for person image generation
Focus and retain: Complement the broken pose in human image synthesis
Soft-gated warping-gan for pose-guided person image synthesis
Viton: An image-based virtual try-on network
Learning hierarchical semantic image manipulation through structured representations
Unsupervised person image generation with semantic parsing transformation
Cited by (8)
Appearance flow estimation for online virtual clothing warping via optimal feature linear assignment
2024, Image and Vision ComputingMulti-scale attention guided pose transfer
2023, Pattern RecognitionPCFN: Progressive Cross-Modal Fusion Network for Human Pose Transfer
2023, IEEE Transactions on Circuits and Systems for Video Technology
Meichen Liu is pursuing the Ph.D. degree in Harbin Engineering University, China. She is currently an exchange Ph.D. student in the Department of Electrical and Computer Engineering, National University of Singapore, Singapore, supported by College of Intelligent Systems Science and Engineering, Harbin Engineering University. Her current research interests include Machine Learning, and Deep Neural Networks, especially on Generate Adversarial Networks and Person Re-identification.
Kejun Wang received the B.E in automation from the northeast china heavy machinery institute, China, in 1984, the ME degree in automation control theory and application from Harbin Engineering University, China, in 1987, and the Ph.D. degree in Ship and marine special auxiliary devices and systems from the Harbin Engineering University, China, in 1995. He is a professor of College of Intelligent Systems Science and Engineering at Harbin Engineering University. He does research in the area of Biometrics and intelligent monitoring technology (venous, fingerprint, gait recognition, moving target detection and tracking), Computational bioinformatics, Fitness Internet of Things technology. He has received many research funding from agencies including the National Science Foundation, The National High Technology Research and Development Program of China and province in Heilongjiang outstanding youth science fund. He has published about 80 conference and journal papers, and 6 patent of invention.
Ruihang Ji received the B.S. degree in automation engineering from the Harbin Institute of Technology, Harbin, China, in 2016, where he is currently pursuing the Ph.D. degree in control science and engineering. He is currently an exchange Ph.D. student supported by the China Scholarship Council with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore. His current research interests include finite-time control, Unmanned Aerial Vehicle and computer vision.
Shuzhi Sam Ge received the B.Sc. degree from the Beijing University of Aeronautics and Astronautics, Beijing, China, in 1986 and the Ph.D. degree from the Imperial College London, London, U.K., in 1993. He is the Director with the Social Robotics Laboratory of Interactive Digital Media Institute, Singapore and the Centre for Robotics, Chengdu, China, and a Professor with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore, on leave from the School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu. He has co-authored four books and over 300 international journal and conference papers. His current research interests include social robotics, adaptive control, intelligent systems, and artificial intelligence. Dr. Ge is the Editor-in-Chief of the International Journal of Social Robotics (Springer). He has served/been serving as an Associate Editor for a number of flagship journals, including the IEEE Transactions on Automation Control, the IEEE Transactions on Control Systems Technology, the IEEE TRANSACTIONS ON NEURAL NETWORKS, and Automatica. He served as the Vice President for Technical Activities from 2009 to 2010 and Membership Activities from 2011 to 2012, and a member of the Board of Governors from 2007 to 2009 at the IEEE Control Systems Society. He is a fellow of the International Federation of Automatic Control, the Institution of Engineering and Technology, and the Society of Automotive Engineering.
Jing Chen received the B.E degree in automation from Yanshan University, QinHuangDao, China, in 2016. She is currently pursuing her Ph.D. degree in College of Intelligent Systems Science and Engineering at Harbin Engineering University. Her research interests include Computer Vision, Machine Learning, and Deep Neural Networks, especially on multimodal emotion analysis.
- ☆
This document is the results of the research project funded by the National Science Foundation.