ABSTRACT
Video virtual try-on methods aim to generate coherent, smooth, and realistic try-on videos, it matches the target clothing with the person in the video in a spatiotemporally consistent manner. Existing methods can match the human body with the clothing and then present it by the way of video, however it will cause the problem of excessive distortion of the grid and poor display effect at last. Given the problem, we found that was due to the neglect of the relationship between inputs lead to the loss of some features, while the conventional convolution operation is difficult to establish the remote information that is crucial in generating globally consistent results, restriction on clothing texture detail can lead to excessive deformation in the process of TPS fitting, make a lot of the try-on method in the final video rendering is not real. For the above problems, we reduce the excessive distortion of the garment during deformation by using a constraint function to regularize the TPS parameters; it also proposes a multimodal two-stage combinatorial Transformer: in the first stage, an interaction module is added, in which the long-distance relationship between people and clothing can be simulated, and then a better remote relationship can be obtained as well as contribute to the performance of TPS; in the second stage, an activation module is added, which can establish a global dependency, and this dependency can make the input important regions in the data are more prominent, which can provide more natural intermediate inputs for subsequent U-net networks. This paper’s method can bring better results for video virtual fitting, and experiments on the VVT dataset prove that the method outperforms previous methods in both quantitative and qualitative aspects.
- Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.Google ScholarCross Ref
- Chieh-Yun Chen, Ling Lo, Pin-Jui Huang, Hong-Han Shuai, and Wen-Huang Cheng. 2021. Fashionmirror: Co-attention feature-remapping virtual try-on with sequential template poses. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13809–13818.Google ScholarCross Ref
- Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. 2021. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14131–14140.Google ScholarCross Ref
- Aiyu Cui, Daniel McKee, and Svetlana Lazebnik. 2021. Dressing in order: Recurrent person image generation for pose transfer, virtual try-on and outfit editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14638–14647.Google ScholarCross Ref
- Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bochao Wang, Hanjiang Lai, Jia Zhu, Zhiting Hu, and Jian Yin. 2019a. Towards multi-pose guided virtual try-on network. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9026–9035.Google ScholarCross Ref
- Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bowen Wu, Bing-Cheng Chen, and Jian Yin. 2019b. Fw-gan: Flow-navigated warping gan for video virtual try-on. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1161–1170.Google ScholarCross Ref
- Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929(2020).Google Scholar
- Chongjian Ge, Yibing Song, Yuying Ge, Han Yang, Wei Liu, and Ping Luo. 2021a. Disentangled cycle consistency for highly-realistic virtual try-on. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16928–16937.Google ScholarCross Ref
- Yuying Ge, Yibing Song, Ruimao Zhang, Chongjian Ge, Wei Liu, and Ping Luo. 2021b. Parser-free virtual try-on via distilling appearance flows. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8485–8493.Google ScholarCross Ref
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative adversarial networks. Commun. ACM 63, 11 (2020), 139–144.Google ScholarDigital Library
- Xintong Han, Xiaojun Hu, Weilin Huang, and Matthew R Scott. 2019. Clothflow: A flow-based model for clothed person generation. In Proceedings of the IEEE/CVF international conference on computer vision. 10471–10480.Google ScholarCross Ref
- Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S Davis. 2018. Viton: An image-based virtual try-on network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7543–7552.Google ScholarCross Ref
- Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6546–6555.Google ScholarCross Ref
- Shion Honda. 2019. Viton-gan: Virtual try-on image generator trained with adversarial loss. arXiv preprint arXiv:1911.07926(2019).Google Scholar
- Thibaut Issenhuth, Jérémie Mary, and Clément Calauzènes. 2020. Do not mask what you do not need to mask: a parser-free virtual try-on. In European Conference on Computer Vision. Springer, 619–635.Google ScholarDigital Library
- Nikolay Jetchev and Urs Bergmann. 2017. The conditional analogy gan: Swapping fashion articles on people images. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 2287–2292.Google ScholarCross Ref
- Boyi Jiang, Juyong Zhang, Yang Hong, Jinhao Luo, Ligang Liu, and Hujun Bao. 2020. Bcnet: Learning body and cloth shape from a single image. In European Conference on Computer Vision. Springer, 18–35.Google ScholarDigital Library
- Matiur Rahman Minar, Thai Thanh Tuan, Heejune Ahn, Paul Rosin, and Yu-Kun Lai. 2020. Cp-vton+: Clothing shape and texture preserving image-based virtual try-on. In CVPR Workshops.Google Scholar
- Aymen Mir, Thiemo Alldieck, and Gerard Pons-Moll. 2020. Learning to transfer texture from clothing images to 3d humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7023–7034.Google ScholarCross Ref
- Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784(2014).Google Scholar
- Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434(2015).Google Scholar
- Bin Ren, Hao Tang, Fanyang Meng, Runwei Ding, Ling Shao, Philip HS Torr, and Nicu Sebe. 2021. Cloth interactive transformer for virtual try-on. arXiv preprint arXiv:2104.05519(2021).Google Scholar
- Hao Tang, Song Bai, and Nicu Sebe. 2020a. Dual attention gans for semantic image synthesis. In Proceedings of the 28th ACM International Conference on Multimedia. 1994–2002.Google ScholarDigital Library
- Hao Tang, Song Bai, Philip HS Torr, and Nicu Sebe. 2020b. Bipartite graph reasoning gans for person image generation. arXiv preprint arXiv:2008.04381(2020).Google Scholar
- Hao Tang, Song Bai, Li Zhang, Philip HS Torr, and Nicu Sebe. 2020c. Xinggan for person image generation. In European Conference on Computer Vision. Springer, 717–734.Google ScholarDigital Library
- Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting, Vol. 2019. NIH Public Access, 6558.Google ScholarCross Ref
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).Google Scholar
- Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin Chen, Liang Lin, and Meng Yang. 2018c. Toward characteristic-preserving image-based virtual try-on network. In Proceedings of the European conference on computer vision (ECCV). 589–604.Google ScholarDigital Library
- Jiahang Wang, Wei Zhang, Weizhong Liu, and Tao Mei. 2019. Down to the last detail: Virtual try-on with detail carving. arXiv preprint arXiv:1912.06324(2019).Google Scholar
- Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018b. Video-to-video synthesis. arXiv preprint arXiv:1808.06601(2018).Google Scholar
- Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018a. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7794–7803.Google ScholarCross Ref
- Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 4 (2004), 600–612.Google ScholarDigital Library
- Han Yang, Ruimao Zhang, Xiaobao Guo, Wei Liu, Wangmeng Zuo, and Ping Luo. 2020a. Towards photo-realistic virtual try-on by adaptively generating-preserving image content. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7850–7859.Google ScholarCross Ref
- Han Yang, Ruimao Zhang, Xiaobao Guo, Wei Liu, Wangmeng Zuo, and Ping Luo. 2020b. Towards photo-realistic virtual try-on by adaptively generating-preserving image content. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7850–7859.Google ScholarCross Ref
- Honglun Zhang, Wenqing Chen, Hao He, and Yaohui Jin. 2019. Disentangled makeup transfer with generative adversarial network. arXiv preprint arXiv:1907.01144(2019).Google Scholar
- Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition. 586–595.Google ScholarCross Ref
- Xiaojing Zhong, Zhonghua Wu, Taizhe Tan, Guosheng Lin, and Qingyao Wu. 2021. Mv-ton: Memory-based video virtual try-on network. In Proceedings of the 29th ACM International Conference on Multimedia. 908–916.Google ScholarDigital Library
Index Terms
- MMTrans: MultiModal Transformer for realistic video virtual try-on
Recommendations
Progressive Limb-Aware Virtual Try-On
MM '22: Proceedings of the 30th ACM International Conference on MultimediaExisting image-based virtual try-on methods directly transfer specific clothing to a human image without utilizing clothing attributes to refine the transferred clothing geometry and textures, which causes incomplete and blurred clothing appearances. In ...
MV-TON: Memory-based Video Virtual Try-on network
MM '21: Proceedings of the 29th ACM International Conference on MultimediaWith the development of Generative Adversarial Network, image-based virtual try-on methods have made great progress. However, limited work has explored the task of video-based virtual try-on while it is important in real-world applications. Most ...
Customized 3D Clothes Modeling for Virtual Try-on System based on Multiple Kinects
GRAPP 2016: Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications: Volume 1: GRAPPMost existing 3D virtual try-on systems put clothes designed in one environment on a human captured in another environment, which cause the mismatching brightness problem. And also typical 3D clothes modeling starts with manually designed 2D patterns, ...
Comments