Elsevier

Neurocomputing

Volume 460, 14 October 2021, Pages 345-359
Neurocomputing

Person image generation with attention-based injection network

https://doi.org/10.1016/j.neucom.2021.06.077Get rights and content

Abstract

Person image generation becomes a challenging problem due to the content ambiguity and style inconsistency. In this paper, we propose a novel Attention-based Injection Network (AIN) to address this issue. Instead of directly learning the relationship between the source and target image, we decompose the process into two accessible modules, namely Semantic-guided Attention Network (SAN) and Pose-guided Attention Network (PAN). SAN is proposed to capture the semantic information which can embed the human attributes into the latent space via the semantic layout. PAN enables a natural re-coupling of the pose and appearance, which can selectively integrate features to complete the human pose transformation. Additionally, a semantic layout loss is proposed to focus on the semantic content similarity between the source and generated images. Compared with other methods, our networks can enforce the local textures and styles consistency between the source and generated image. Experiments show that superior both qualitative and quantitative results are obtained on Market-1501 and DeepFashion datasets. On the basis of AIN, our network can further achieve the data augmentation for person re-identification (Re-ID) with dramatically improving the person Re-ID accuracy.

Introduction

Person image generation aims to transfer a person image with arbitrary pose while retaining the appearance details of the source image. This task, first introduced in [1], has become an emerging popular topic in the computer vision community. It has shown great potential applications in many tasks, such as video generation [2], [3], [4], virtual clothes try-on [5], [6], [7], data augmentation for person-related vision task [8], [9], [10], [56]], etc. The key challenges of this task include the following two aspects: (i) Limited by the non-rigid human body deformation structure, it is difficult to directly transform the spatial misaligned body, particularly when only given local observations of the human body. (ii) Preserving clothing attributes including the texture and style imposes an unexpected difficulty in the generation process.

Notable results of deep learning have provided significant tools to achieve pose transfer task [11], [12], [13], [57]. Early studies in [10], [14], [15], [16] directly adopt a global predictive strategy to transfer pose by utilizing the U-net [17] structure to propagate low-level features. However, considering the non-rigid human body deformation structure, U-net based global method often fail to address the spatial misaligned problem between the source and the target pose. This leads to the generated image suffering from detail deficiency, which always results in over-smooth clothes. In order to improve the performance, more and more researches are devoted to better modeling the body deformation and the local feature transfer. Some methods in [8], [18], [19], [20] adopt the pose attention network to perform a local transfer to fit for the target body topology. Others in [21], [22], [23], [24] fuse the feature representations of appearance and pose by controlling the attention-based decoder. Although the aforementioned approaches can achie-ve better performance, the generated images still encounter some quality problems such as missing texture details and blurry boundary. That results from the model relying on the corresponding extracted feature for the detail reconstruction. Each intermediate image representation is difficult to reveal the region to be transferred. Besides, these approaches in [8], [18], [19], [20], [21], [22], [23], [24] only consider the source image and target pose as inputs, where the clothing textures and the human outline information may be ignored. It should be noted that to preserve the semantic information and replenish fine-grained appearance details is crucial from the viewpoint of the human visual perception.

To preserve the appearance and shape simultaneously, the semantic parsing map in [25], [26], [27], [28], [29], [30], [31] has received considerable attention to serve as an intermediate representation between the source and the synthesized images. Different from the key-point based pose representation, the semantic parsing map automatically provides a foreground mask and captures valuable prior information for the person image generation. However, these generators in [25], [26], [27], [28], [29], [30], [31] directly concatenate the source image, semantic parsing map and target pose as inputs into the basic U-Net, then dedicate in learning a mapping from the concatenated conditions to the target image in a forced manner. It inevitably leads to difficulty in directly transforming the spatial misaligned human body due to the inherent nature of the non-rigid human body.

Motivated by the aforementioned discussion, we propose a novel Attention-based Injection Network (AIN) to address the above challenging problems. Our generator is decomposed into two interconnected aware network, called Seman-tic-guided Attention Network (SAN) and Pose-guided Attention Network (PAN). SAN is designed for semantic encoding, and PAN is constructed for enabling a natural re-coupling of the pose and appearance. Specifically, our model first utilize SAN to capture clothing attributes automatically from the source image via semantic layouts. Then, the component clothing information represented by the semantic code is injected into the designed each block of PAN. Inside each block of PAN, we create an attention mechanism to infer the image regions of interest based on the target pose, and integrate the appearance and pose information with the help of the Adaptive Instance Normalization (AdaIN) layer in the affine transformation form. Each intermediate pose representation characterize the ability to better guide the image generation and retain texture details substantially. The desired image can be reconstructed following target features. Our approach can render textures from the source image into the synthesized map by learning the feature-level mapping, and deal well with the information missing and self-occlusion problems. Moreover, the proposed network benefits in alleviating the ambiguities in inferring unobserved pixels. Experiments on Market-1501 and DeepFashion datasets show that our network can achieve superiority performance on both qualitative and quantitative results.

Existing methods in [13], [16], [22], [30], [32] adopt style loss to enforce the textures of the target and the generated image to be similar around the corresponding pose joints. However, person images with the different pose can be drastically different in appearance under different views. This inevitably demands textures around joints changing with the different pose. Therefore, a new semantic layout loss is proposed to focus on the semantic content similarity between the source and the generated image. The representation of the semantic layout loss is captured by the correlation among the different channels of the image. We first separate component attributes from the person image via semantic layouts. Then it can be implemented by computing the Gram matrix [33] for each patch. Our loss can not only alleviate the influence of the background in pixel-level, but also enforce the local textures and styles consistency between the source and generated image. In summary, the main contributions of this paper can be summarized as follows:

  • A novel Attention-based Injection Network (AIN) is proposed to tackle the challenge of pose transfer task with content and style inconsistency. Our method has the ability to preserve and replenish the details of the fine-grained semantic and appearance. It benefits in alleviating ambiguities in inferring unobserved pixels and self-occlusion problems. Experiments demonstrate the rationality and effectiveness of the proposed method.

  • We create two attention network which called Semant-ic-guided Attention Network (SAN) and Pose-guided Attention Network (PAN). SAN is designed to automatically capture semantic information from the sou-rce image via semantic layouts. By effectively selecting and deforming important regions of the image code, PAN is proposed to enable a natural re-couping of the image and pose.

  • A new semantic layout loss is designed to deal with the spatial misaligned of the source and target image. Our loss is able to enforce the component content and style consistency with well retaining the texture details and attributes.

  • Our method exhibits superior performance both in the body shape and keeping clothing textures on benchmarks. Experiments show that it can enrich the pose variations by our method and improve the person re-identification accuracy.

In particular, the structure of the paper is as follows. In Section 2, we present related works on person image generation. In Section 3, we introduce the proposed method in detail. Subsequently, experiments results and extensive analysis are presented on two datasets (Market-1501 [34] and DeepFashion [35]) in Section 4. The application and conclusion are given in Section 5 and Section 6, respectively.

Section snippets

Pose-guided person image generation

The early attempt on pose transfer is presented by Ma et al. [1], which can only coarsely generate image under the target pose in Stage-I and refine details of the appearance in Stage-II. To further improve the appearance and the shape of generated images, Ma et al. [11] disentangle the source image into the appearance and pose, followed by encoding them into embedding features. Similar method is adopted in [15], which can disentangle appearance and pose of the source image by VAE [36] and

The proposed method

Given a target pose Pt and a source image Ips under the pose Ps, our goal is to generate an output image I^pt, which follows the clothing appearance of Ips and under the target pose Pt. In this paper, we design a novel end-to-end Attention-based Injection Network (AIN) to address this challenging task. The overall framework can be shown in Fig. 1, where inputs of our network are the source image Ips, the corresponding semantic map Sps, and the target pose Pt. We feed inputs into AIN to capture

Experiments

In this section, we describe the datasets and metrics (Section 4.1), followed by the implementation details of the proposed method (Section 4.2). And we conduct extensive experiments to verify the design rationalities and effectiveness of the proposed network (Section 4.3 and Section 4.4). The experiment results can demonstrate the superiority of our method in both objective quantitative scores and subjective visual realness.

Application to person re-identification

An effective person pose transfer network can synthesis realistic-looking human images to augment the datasets of person-related vision tasks, which could improve the network performance especially in the situation of limited pose variations and insufficient training data. Person re-identifica- tion (Re-ID) has seen a boost in applications and been an active research field in the computer vision, which aims to match the person across non-overlapping video cameras. Like more recently method in

Conclusion

In this paper, we propose a novel Attention-based Injection Network for the person image generation. To address the complexity of directly learning the mapping, the pose transfer process is decomposed into two accessible modules, SAN and PAN. SAN has the ability to extract the semantic information automatically from the source image via human layout. PAN enables a natural re-coupling of pose and appearance. Moreover, a new semantic layout loss is proposed to well retain the semantic content,

CRediT authorship contribution statement

Meichen Liu: Conceptualization, Methodology, Software, Validation, Formal analysis. Kejun Wang: Supervision, Writing - original draft, Writing - review & editing, Formal analysis, Funding acquisition. Ruihang Ji: Software, Validation, Writing - review & editing. Shuzhi Sam Ge: Writing - review & editing, Supervision, Investigation. Jing Chen: Data curation.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The work is supported by National Natural Science Foundation of China (61573114). This work is also supported by College of Intelligent Systems Science and Engineering, Harbin Engineering University.

Meichen Liu is pursuing the Ph.D. degree in Harbin Engineering University, China. She is currently an exchange Ph.D. student in the Department of Electrical and Computer Engineering, National University of Singapore, Singapore, supported by College of Intelligent Systems Science and Engineering, Harbin Engineering University. Her current research interests include Machine Learning, and Deep Neural Networks, especially on Generate Adversarial Networks and Person Re-identification.

References (57)

  • L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool, Pose guided person image generation, in Advances in...
  • S. Tulyakov et al.

    Mocogan: Decomposing motion and content for video generation

  • P. Zablotskaia, A. Siarohin, B. Zhao, and L. Sigal, Dwnet: Dense warp-based network for pose-guided human video...
  • Q. Chen et al.

    Scripted video generation with a bottom-up generative adversarial network

    IEEE Trans. Image Process.

    (2020)
  • H. Dong et al.

    Towards multi-pose guided virtual try-on network

  • H. Yang et al.

    Towards photo-realistic virtual try-on by adaptively generating-preserving image content

  • S. Song et al.

    Unpaired person image generation with semantic parsing transformation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2020)
  • Z. Zhu et al.

    Progressive pose attention transfer for person image generation

  • M. Liu et al.

    Segmentation mask-guided person image generation

    Appl. Intell.

    (2021)
  • A. Siarohin et al.

    Appearance and pose-conditioned human image generation using deformable gans

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2019)
  • L. Ma et al.

    Disentangled person image generation

  • Z. Liu et al.

    Adcm: attention dropout convolutional module

    Neurocomputing

    (2020)
  • L. Yang et al.

    Towards fine-grained human pose transfer with detail replenishing network

    IEEE Trans. Image Process.

    (2021)
  • X. Mao et al.

    Least squares generative adversarial networks

  • P. Esser et al.

    A variational u-net for conditional appearance and shape generation

  • A. Pumarola et al.

    Unsupervised person image synthesis in arbitrary poses

  • P. Isola et al.

    Image-to-image translation with conditional adversarial networks

  • Y. Chen et al.

    Person image synthesis through siamese generative adversarial network

    Neurocomputing

    (2020)
  • B. Chen et al.

    Pman: Progressive multi-attention network for human pose transfer

    IEEE Trans. Circuits Syst. Video Technol.

    (2021)
  • K. Li et al.

    Pona: Pose-guided non-local attention for human pose transfer

    IEEE Trans. Image Process.

    (2020)
  • S. Lathuilière et al.

    Attention-based fusion for multi-source human image generation

  • Y. Ren et al.

    Deep image spatial transformation for person image generation

  • S. Huang, H. Xiong, Z.-Q. Cheng, Q. Wang, X. Zhou, B. Wen, J. Huan, and D. Dou, Generating person images with...
  • P. Ge et al.

    Focus and retain: Complement the broken pose in human image synthesis

  • H. Dong et al.

    Soft-gated warping-gan for pose-guided person image synthesis

  • X. Han et al.

    Viton: An image-based virtual try-on network

  • S. Hong et al.

    Learning hierarchical semantic image manipulation through structured representations

  • S. Song et al.

    Unsupervised person image generation with semantic parsing transformation

  • Cited by (8)

    • PCFN: Progressive Cross-Modal Fusion Network for Human Pose Transfer

      2023, IEEE Transactions on Circuits and Systems for Video Technology
    View all citing articles on Scopus

    Meichen Liu is pursuing the Ph.D. degree in Harbin Engineering University, China. She is currently an exchange Ph.D. student in the Department of Electrical and Computer Engineering, National University of Singapore, Singapore, supported by College of Intelligent Systems Science and Engineering, Harbin Engineering University. Her current research interests include Machine Learning, and Deep Neural Networks, especially on Generate Adversarial Networks and Person Re-identification.

    Kejun Wang received the B.E in automation from the northeast china heavy machinery institute, China, in 1984, the ME degree in automation control theory and application from Harbin Engineering University, China, in 1987, and the Ph.D. degree in Ship and marine special auxiliary devices and systems from the Harbin Engineering University, China, in 1995. He is a professor of College of Intelligent Systems Science and Engineering at Harbin Engineering University. He does research in the area of Biometrics and intelligent monitoring technology (venous, fingerprint, gait recognition, moving target detection and tracking), Computational bioinformatics, Fitness Internet of Things technology. He has received many research funding from agencies including the National Science Foundation, The National High Technology Research and Development Program of China and province in Heilongjiang outstanding youth science fund. He has published about 80 conference and journal papers, and 6 patent of invention.

    Ruihang Ji received the B.S. degree in automation engineering from the Harbin Institute of Technology, Harbin, China, in 2016, where he is currently pursuing the Ph.D. degree in control science and engineering. He is currently an exchange Ph.D. student supported by the China Scholarship Council with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore. His current research interests include finite-time control, Unmanned Aerial Vehicle and computer vision.

    Shuzhi Sam Ge received the B.Sc. degree from the Beijing University of Aeronautics and Astronautics, Beijing, China, in 1986 and the Ph.D. degree from the Imperial College London, London, U.K., in 1993. He is the Director with the Social Robotics Laboratory of Interactive Digital Media Institute, Singapore and the Centre for Robotics, Chengdu, China, and a Professor with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore, on leave from the School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu. He has co-authored four books and over 300 international journal and conference papers. His current research interests include social robotics, adaptive control, intelligent systems, and artificial intelligence. Dr. Ge is the Editor-in-Chief of the International Journal of Social Robotics (Springer). He has served/been serving as an Associate Editor for a number of flagship journals, including the IEEE Transactions on Automation Control, the IEEE Transactions on Control Systems Technology, the IEEE TRANSACTIONS ON NEURAL NETWORKS, and Automatica. He served as the Vice President for Technical Activities from 2009 to 2010 and Membership Activities from 2011 to 2012, and a member of the Board of Governors from 2007 to 2009 at the IEEE Control Systems Society. He is a fellow of the International Federation of Automatic Control, the Institution of Engineering and Technology, and the Society of Automotive Engineering.

    Jing Chen received the B.E degree in automation from Yanshan University, QinHuangDao, China, in 2016. She is currently pursuing her Ph.D. degree in College of Intelligent Systems Science and Engineering at Harbin Engineering University. Her research interests include Computer Vision, Machine Learning, and Deep Neural Networks, especially on multimodal emotion analysis.

    This document is the results of the research project funded by the National Science Foundation.

    View full text