Elsevier

Neurocomputing

Volume 453, 17 September 2021, Pages 502-511
Neurocomputing

Unsupervised cycle-consistent person pose transfer

https://doi.org/10.1016/j.neucom.2020.10.059Get rights and content

Abstract

Person pose transfer, i.e., transferring the pose of a given person to a target pose, is a challenging task due to the complex interplay of appearance, pose, and background. Most of the previous works adopted the supervised framework and required paired person images with the same identity and different poses, which largely limits their applications. Besides, the background of the generated image may be altered from the original one due to some over-fitting issues, which is unfavorable for the pose transfer task. To tackle the above problems, we propose an unsupervised cycle-consistent person pose transfer approach. It is trained with unpaired cross-identity person images and can well preserve the background information. Compared with previous methods, our proposed approach achieves better results in the cross-identity person pose transfer task and similar results in self-identity one. Moreover, our method can serve as an effective data augmentation scheme for person recognition tasks, which is validated by extensive experiments on pedestrian re-identification and detection.

Introduction

This paper focuses on the person pose transfer problem, which is to transfer a person from one pose to another, as depicted in Fig. 1. It has several applications, such as image editing, video generation, and data augmentation. Meanwhile, it is exceptionally challenging due to the drastic appearance difference of the same person in different poses.

Recent years have witnessed several works dedicating to this problem, including some supervised methods [1], [2], [3], [4], [5] and unsupervised ones [6], [7], [8], most of which target at transferring person pose in self-identity scenarios during both train and test stage. However, it’s quite difficult to acquire image pairs with different poses of the same identity. Moreover, under the current generative adversarial network based pose transfer framework, pose transfer is conducted by inputting a given person image concatenated with key-point coordinates, which is obtained via a pre-trained pose estimation model, to the generative model. Considering the input pose key-point coordinates are from the same person during training, but in real scenarios, they are mostly from different persons, we presume those methods trained only in self-identity scenarios are sub-optimal for real-world person pose transfer applications. Because the key-point coordinates contain not only the pose information but also the body shape of one person, the existing self-identity person pose transfer methods can not well handle the cross-identity pose transfer task, which is the most common setting for pose transfer applications.

Compared with the self-identity pairs, we can easily acquire plenty of cross-identity training data without identity labels, i.e., person images with various poses but from different identities. Additionally, as shown in the top of Fig. 1, existing supervised methods can introduce unnecessary background change to the generated images because they force the output of the generator to be similar to the given ground truth image, whose background sometimes differs from the original person image. Considering the aim of the person pose transfer task, this background change is an over-fitting and is unfavorable. But we find there is currently no metric to evaluate the background change after the pose transfer.

Accordingly, we propose an unsupervised cycle-consistent person pose transfer framework which makes full use of the cross-identity training data and keeps the background unchanged, as illustrated in the bottom of Fig. 1. The target pose is extracted from another person image by the pretrained pose estimation model, which is OpenPose Human Pose Estimator (HPE) [9] in this paper. In existing supervised methods, the another person image must share the same identity with the input image, while it can belong to different identities in our proposed one. This difference dramatically enlarges the amount of target pose images and the diversity of person poses. What’s more, our method can even work on the raw images without the identity labels. During the training phase, we firstly transfer the source person to the target pose, then conduct the reverse pose transfer and obtain the reconstruction of the input image. By forcing the reconstruction to be consistent with the original one, we can avoid the background over-fitting for the generated image and preserve the background information in the input image.

Fig. 2 details the training pipeline of the proposed method. To fully exploit the training data, we take two images and their poses as the input and exchange their poses twice in each training iteration (firstly transferring to the other pose and then transferring back) [10], [11]. Meanwhile, we jointly optimize our model under the supervision of four kinds of losses during training to obtain the optimal generation effect. The first is the adversarial loss, which guarantees the visual quality of the generated person images. The second is the cycle-consistent loss, which minimizes the reconstruction error between the input person image and its reconstruction. The third is the identity-consistent loss (Fig. 3), which forces the generator to be an identity mapping function when the target pose is the same as the original one. What’s more, we propose a novel pose-consistent loss, which forces the pose of the generated person images to be consistent with the target pose and can effectively improve the pose transfer precision.

In summary, this paper makes the following contributions:

  • We propose the unsupervised cycle-consistent person pose transfer framework. Different from the previous methods, it can get rid of the paired self-identity training data limitation and take any person images as training data, thus has much broader applications.

  • Our cycle-consistent framework can well preserve the background information of the input images. Besides, we present a novel evaluation criterion “back-SSIM” to assess the background change in the generated images quantitatively.

  • As demonstrated by extensive quantitative and qualitative experimental results, the proposed method achieves state-of-the-art performance in the cross-identity pose transfer task. Even if in the self-identity scenarios, it obtains similar results to those supervised ones.

  • Our method can serve as an effective data augmentation scheme for person recognition tasks, such as pedestrian re-identification and detection.

Section snippets

Related work

Generative Adversarial Networks (GANs) [12] have made great progresses in image generation problem [12], [13], [14], [10], [11]. Many works focused on applying GANs into generating images to satisfy some condition constraints. Conditional Generative Adversarial Networks (CGANs) [15] were built for that purpose and have achieved remarkable success.

Recently, more and more works focuses on person image generation. Lassner et al. [16] mapped the 3D model to the person images with different clothes

Methodology

We first introduce some notations. {(xi,pi)}i=1N denotes the set of person images and their poses in a dataset, where i is the person index, xi is the image, and pi is its pose. pi consists of an 18-channel heat map that encodes the locations of 18 key-points of a human body. We follow the practice in [1], [2], [4] and adopt HPE to estimate the 18 key-points.

Experiments

In this section, we conduct experiments to verify the effectiveness of our proposed framework. Firstly, we compare our method with the state-of-the-art on both self-identity and cross-identity person pose transfer scenarios. The input image and target pose are from the same person in the former situation but can be different ones in the latter. Then we utilize the proposed method as a data augmentation scheme to improve the performance of person re-identification on Market-1501 [28]. Moreover,

Conclusion

In this paper, we propose the unsupervised cycle-consistent person pose transfer framework which can be trained with unpaired cross-identity training data. Meanwhile, our method can maintain the background while pose transferring. What’s more, it can also be used as an effective data augmentation scheme for person recognition tasks.

CRediT authorship contribution statement

Songyan Liu: Conceptualization, Methodology, Investigation, Writing - original draft. Haiyun Guo: Validation, Writing - review & editing. Kuan Zhu: Validation, Writing - review & editing. Jinqiao Wang: Writing - review & editing. Ming Tang: Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was supported by National Natural Science Foundation of China (No. 61772527, 61806200 and 61976210) and China Postdoctoral Science Foundation (No. 2019M660859).

Songyan Liu received the B.E. degree in 2015 from Southeast University, Nanjing, China and the Ph.D. degree on pattern recognition and intelligence systems at the National Laboratory of Pattern Recognition, Chinese Academy of Sciences, Beijing, China, in 2020. His research interests include the analysis of deep learning networks and the application of generative adversarial networks.

References (43)

  • L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, L.V. Gool, Pose guided person image generation., in: Advances in...
  • A. Siarohin et al.

    Deformable GANs for pose-based human image generation., in

  • P. Esser et al.

    A variational u-net for conditional appearance and shape generation., in

  • Z. Zhu et al.

    Progressive pose attention transfer for person image generation

  • Y. Li, C. Huang, C.C. Loy, Dense intrinsic appearance flow for human pose transfer, in: The IEEE Conference on Computer...
  • L. Ma et al.

    Disentangled person image generation

  • A. Pumarola et al.

    Unsupervised person image synthesis in arbitrary poses

  • S. Song et al.

    Unsupervised person image generation with semantic parsing transformation, in

  • Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, Y. Sheikh, OpenPose: realtime multi-person 2D pose estimation using Part...
  • M.-Y. Liu, T. Breuel, J. Kautz, Unsupervised image-to-image translation networks., in: Advances in Neural Information...
  • H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M.K. Singh, M.-H. Yang, Diverse image-to-image translation via disentangled...
  • I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative...
  • P. Isola et al.

    Image-to-image translation with conditional adversarial networks

  • J. Zhu et al.

    Unpaired image-to-image translation using cycle-consistent adversarial networks

  • M. Mirza, S. Osindero, Conditional generative adversarial nets., arXiv preprint...
  • C. Lassner et al.

    A generative model of people in clothing

  • D.J. Rezende et al.

    Stochastic backpropagation and approximate inference in deep generative models

  • D.P. Kingma, M. Welling, Auto-encoding variational bayes., arXiv preprint...
  • B. Zhao, X. Wu, Z. Cheng, H. Liu, J. Feng, Multi-view image generation from a single-view., arXiv preprint...
  • Z. Zheng et al.

    Joint discriminative and generative learning for person re-identification., in

  • O. Ronneberger et al.

    U-net: convolutional networks for biomedical image segmentation

  • Cited by (0)

    Songyan Liu received the B.E. degree in 2015 from Southeast University, Nanjing, China and the Ph.D. degree on pattern recognition and intelligence systems at the National Laboratory of Pattern Recognition, Chinese Academy of Sciences, Beijing, China, in 2020. His research interests include the analysis of deep learning networks and the application of generative adversarial networks.

    Haiyun Guo received the B.E. degree from Wuhan University in 2013 and the Ph.D. degree in pattern recognition and intelligence systems from the Institute of Automation, University of Chinese Academy of Sciences, in 2018. She is currently an Assistant Researcher with the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. Her current research interests include pattern recognition and machine learning, image and video processing, and intelligent video surveillance.

    Kuan Zhu received the B.E. degree from Xiamen University, Xiamen, China, in 2018. He is currently pursuing the M.S. degree in pattern recognition and intelligence systems with the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. His current research interests include vehicle and person re-identification, pattern recognition and machine learning, and intelligent video surveillance.

    Jinqiao Wang received the B.E. degree in 2001 from Hebei University of Technology, China, and the M.S. degree in 2004 from Tianjin University, China. He received the Ph.D. degree in pattern recognition and intelligence systems from the National Laboratory of Pattern Recognition, Chinese Academy of Sciences, in 2008. He is currently a Professor with Chinese Academy of Sciences. His research interests include pattern recognition and machine learning, image and video processing, mobile multimedia, and intelligent video surveillance.

    Ming Tang received the B.S. degree in computer science and engineering and M.S. degree in artificial intelligence from Zhejiang University, Hangzhou, China, in 1984 and 1987, respectively, and the Ph.D. degree in pattern recognition and intelligent system from the Chinese Academy of Sciences, Beijing, China, in 2002. He is currently a Professor with the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. His current research interests include computer vision and machine learning.

    View full text