OAW-GAN: occlusion-aware warping GAN for unified human video synthesis

Wei, Dongxu; Huang, Kejie; Ma, Liyuan; Hua, Jiashen; Lai, Baisheng; Shen, Haibin

doi:10.1007/s10489-022-03527-y

OAW-GAN: occlusion-aware warping GAN for unified human video synthesis

Published: 20 April 2022

Volume 53, pages 616–633, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Dongxu Wei¹,
Kejie Huang¹,
Liyuan Ma¹,
Jiashen Hua²,
Baisheng Lai² &
…
Haibin Shen¹

573 Accesses
3 Citations
Explore all metrics

Abstract

In this paper, we propose Occlusion-Aware Warping GAN (OAW-GAN), a unified Human Video Synthesis (HVS) framework that can uniformly tackle human video motion transfer, attribute editing, as well as inpainting. This is the first work to our knowledge that can handle all these tasks within a one-time trained model. Although existing GAN-based HVS methods have achieved great success, they either can’t preserve appearance details due to the loss of spatial consistency between the synthesized target frames and the input source images, or generate incoherent video results due to the loss of temporal consistency among frames. Besides, most of them lack the ability to create new contents while keeping existing ones, failing especially when some regions in the target are invisible in the source due to self-occlusion. To address these limitations, we first introduce Coarse-to-Fine Flow Warping Network (C2F-FWN) to estimate spatial-temporal consistent transformation between source and target, as well as occlusion mask indicating which parts in the target are invisible in the source. Then, the flow and the mask are scaled and fed into the pyramidal stages of our OAW-GAN, guiding Occlusion-Aware Synthesis (OAS) that can be abstracted into visible part re-utilization and invisible part inpainting at the feature level, which effectively alleviates the self-occlusion problem. Extensive experiments conducted on both human video (i.e., iPER, SoloDance)Keywords are desired. please provide if necessary. and image (i.e., DeepFashion) datasets demonstrate the superiority of our approach to existing state-of-the-arts. We also show that, besides motion transfer task that previous works concern, our framework can further achieve attribute editing and texture inpainting, which paves the way towards unified HVS.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Flow-Guided Transformer for Video Inpainting

Learning Joint Spatial-Temporal Transformations for Video Inpainting

DANet: Deformable Alignment Network for Video Inpainting

References

Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, pp 2672–2680
Wang T-C, Liu M-Y, Zhu J-Y, Tao A, Kautz J, Catanzaro B (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 8798–8807
Wang T-C, Liu M-Y, Zhu J-Y, Liu G, Tao A, Kautz J, Catanzaro B (2018) Video-to-video synthesis. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 1152–1164
Park T, Liu M-Y, Wang T-C, Zhu J-Y (2019) Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2337–2346
Chan C, Ginosar S, Zhou T, Efros A A (2019) Everybody dance now. In: Proceedings of the IEEE International Conference on Computer Vision, pp 5933–5942
Liu L, Xu W, Zollhöfer M, Kim H, Bernard F, Habermann M, Wang W, Theobalt C (2019) Neural rendering and reenactment of human actor videos. ACM Trans Graph 38(5):1–14
Article Google Scholar
Aberman K, Shi M, Liao J, Liscbinski D, Chen B, Cohen-Or D (2019) Deep video-based performance cloning. In: Computer Graphics Forum, vol 38. Wiley Online Library, NJ, pp 219–233
Wang T-C, Liu M-Y, Tao A, Liu G, Catanzaro B, Kautz J (2019) Few-shot video-to-video synthesis. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp 5013–5024
Ren J, Chai M, Woodford O J, Olszewski K, Tulyakov S (2021) Flow guided transformable bottleneck networks for motion retargeting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 10795–10805
Yoon J S, Liu L, Golyanik V, Sarkar K, Park H S, Theobalt C (2021) Pose-guided human animation from a single image in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 15039–15048
Liu W, Piao Z, Tu Z, Luo W, Ma L, Gao S (2021) Liquid warping gan with attention: A unified framework for human image synthesis. IEEE Transactions on Pattern Analysis and Machine Intelligence
Wei D, Xu X, Shen H, Huang K (2020) Gac-gan: A general method for appearance-controllable human video motion transfer. IEEE Trans Multimed 23:2457–2470
Article Google Scholar
Liu W, Piao Z, Min J, Luo W, Ma L, Gao S (2019) Liquid warping gan: A unified framework for human motion imitation, appearance transfer and novel view synthesis. In: Proceedings of the IEEE International Conference on Computer Vision, pp 5904–5913
Wei D, Xu X, Shen H, Huang K (2021) C2f-fwn: Coarse-to-fine flow warping network for spatial-temporal consistent motion transfer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 35, pp 2852–2860
Liu Z, Luo P, Qiu S, Wang X, Tang X (2016) Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1096–1104
Jing B, Ding H, Yang Z, Li B, Liu Q (2021) Image generation step by step: animation generation-image translation. Appl Intell:1–14
Zhang J, Li K, Lai Y-K, Yang J (2021) Pise: Person image synthesis and editing with decoupled gan. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7982–7990
Lv Z, Li X, Li X, Li F, Lin T, He D, Zuo W (2021) Learning semantic person image generation by region-adaptive normalization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 10806–10815
Ma T, Peng B, Wang W, Dong J (2021) Must-gan: Multi-level statistics transfer for self-driven person image generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 13622–13631
Zhang T, Sun X, Li X, Yi Z (2021) Image generation and constrained two-stage feature fusion for person re-identification. Appl Intell 51:7679–7689
Article Google Scholar
Liu M, Yan X, Wang C, Wang K (2021) Segmentation mask-guided person image generation. Appl Intell 51(2):1161–1176
Article Google Scholar
Huang X, Belongie S (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1501–1510
Woo S, Park J, Lee J-Y, Kweon I S (2018) Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision, pp 3–19
Men Y, Mao Y, Jiang Y, Ma W-Y, Lian Z (2020) Controllable person image synthesis with attribute-decomposed gan. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5084–5093
Zhu Z, Huang T, Shi B, Yu M, Wang B, Bai X (2019) Progressive pose attention transfer for person image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2347–2356
Webster R, Rabin J, Simon L, Jurie F (2019) Detecting overfitting of deep generative networks via latent recovery. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 11273–11282
Dong H, Liang X, Gong K, Lai H, Zhu J, Yin J (2018) Soft-gated warping-gan for pose-guided person image synthesis. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 474–484
Ren Y, Yu X, Chen J, Li T H, Li G (2020) Deep image spatial transformation for person image generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7690–7699
Loper M, Mahmood N, Romero J, Pons-Moll G, Black M J (2015) Smpl: A skinned multi-person linear model. ACM Trans Graph 34(6):1–16
Article Google Scholar
Albahar B, Lu J, Yang J, Shu Z, Shechtman E, Huang J-B (2021) Pose with style: Detail-preserving pose-guided image synthesis with conditional stylegan. ACM Trans Graph (TOG) 40(6):1–11
Article Google Scholar
Sanyal S, Vorobiov A, Bolkart T, Loper M, Mohler B, Davis L S, Romero J, Black M J (2021) Learning realistic human reposing using cyclic self-supervision with 3d shape, pose, and appearance consistency. In: Proceedings of the IEEE International Conference on Computer Vision, pp 11138–11147
Wu X, Li C, Hu S-M, Tai Y-W (2021) Hierarchical generation of human pose with part-based layer representation. IEEE Trans Image Process 30:7856–7866
Article Google Scholar
Güler R A, Neverova N, Kokkinos I (2018) Densepose: Dense human pose estimation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7297–7306
Han X, Hu X, Huang W, Scott M R (2019) Clothflow: A flow-based model for clothed person generation. In: Proceedings of the IEEE International Conference on Computer Vision, pp 10471–10480
Karras T, Laine S, Aila T (2019) A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4401–4410
He Z, Zuo W, Kan M, Shan S, Chen X (2019) Attgan: Facial attribute editing by only changing what you want. IEEE Trans Image Process 28(11):5464–5478
Article MathSciNet MATH Google Scholar
Li T, Qian R, Dong C, Liu S, Yan Q, Zhu W, Lin L (2018) Beautygan: Instance-level facial makeup transfer with deep generative adversarial network. In: Proceedings of the 26th ACM International Conference on Multimedia, pp 645–653
Abdal R, Zhu P, Mitra N J, Wonka P (2021) Styleflow: Attribute-conditioned exploration of stylegan-generated images using conditional continuous normalizing flows. ACM Trans Graph 40(3):1–21
Article Google Scholar
Yi Z, Tang Q, Azizi S, Jang D, Xu Z (2020) Contextual residual aggregation for ultra high-resolution image inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7508–7517
Liu H, Jiang B, Song Y, Huang W, Yang C (2020) Rethinking image inpainting via a mutual encoder-decoder with feature equalizations. In: Proceedings of the European Conference on Computer Vision. Springer, pp 725–741
Li J, Wang N, Zhang L, Du B, Tao D (2020) Recurrent feature reasoning for image inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7760–7768
Chen Y, Zhang H, Liu L, Chen X, Zhang Q, Yang K, Xia R, Xie J (2021) Research on image inpainting algorithm of improved gan based on two-discriminations networks. Appl Intell 51 (6):3460–3474
Article Google Scholar
Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2117–2125
Rocco I, Arandjelovic R, Sivic J (2017) Convolutional neural network architecture for geometric matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6148–6157
Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution. In: European Conference on Computer Vision. Springer, pp 694–711
Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y (2017) Deformable convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp 764–773
Fan L, Huang W, Gan C, Ermon S, Gong B, Huang J (2018) End-to-end learning of motion representation for video understanding. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6016–6025
Li K, Zhang J, Liu Y, Lai Y-K, Dai Q (2020) Pona: Pose-guided non-local attention for human pose transfer. IEEE Trans Image Process 29:9584–9599
Article MATH Google Scholar
Cao Z, Simon T, Wei S-E, Sheikh Y (2017) Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7291–7299
Gong K, Liang X, Li Y, Chen Y, Yang M, Lin L (2018) Instance-level human parsing via part grouping network. In: Proceedings of the European Conference on Computer Vision, pp 770–785
Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp 234–241
Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A, Brox T (2017) Flownet 2.0: Evolution of optical flow estimation with deep networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2462–2470
Kingma D P, Ba J (2014) Adam: A method for stochastic optimization. arXiv:1412.6980
Wang Z, Bovik A C, Sheikh H R, Simoncelli E P (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612
Article Google Scholar
Zhang R, Isola P, Efros A A, Shechtman E, Wang O (2018) The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 586–595
Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp 6626–6637
Yao C-H, Chang C-Y, Chien S-Y (2017) Occlusion-aware video temporal consistency. In: Proceedings of the 25th ACM International Conference on Multimedia, pp 777–785

Download references

Author information

Authors and Affiliations

Department of Information Science, Electronic Engineering, Zhejiang University, Hangzhou, 310027, China
Dongxu Wei, Kejie Huang, Liyuan Ma & Haibin Shen
Alibaba Group, Hangzhou, 311100, China
Jiashen Hua & Baisheng Lai

Authors

Dongxu Wei
View author publications
You can also search for this author in PubMed Google Scholar
Kejie Huang
View author publications
You can also search for this author in PubMed Google Scholar
Liyuan Ma
View author publications
You can also search for this author in PubMed Google Scholar
Jiashen Hua
View author publications
You can also search for this author in PubMed Google Scholar
Baisheng Lai
View author publications
You can also search for this author in PubMed Google Scholar
Haibin Shen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kejie Huang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wei, D., Huang, K., Ma, L. et al. OAW-GAN: occlusion-aware warping GAN for unified human video synthesis. Appl Intell 53, 616–633 (2023). https://doi.org/10.1007/s10489-022-03527-y

Download citation

Accepted: 17 March 2022
Published: 20 April 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s10489-022-03527-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

OAW-GAN: occlusion-aware warping GAN for unified human video synthesis

Abstract

Access this article

Similar content being viewed by others

Flow-Guided Transformer for Video Inpainting

Learning Joint Spatial-Temporal Transformations for Video Inpainting

DANet: Deformable Alignment Network for Video Inpainting

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

OAW-GAN: occlusion-aware warping GAN for unified human video synthesis

Abstract

Access this article

Similar content being viewed by others

Flow-Guided Transformer for Video Inpainting

Learning Joint Spatial-Temporal Transformations for Video Inpainting

DANet: Deformable Alignment Network for Video Inpainting

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation