research-article

MMTrans: MultiModal Transformer for realistic video virtual try-on

Authors:
Xinrong Hu

Wuhan Textile University, China

Wuhan Textile University, China

0000-0001-6563-669X
View Profile

,
Ziyi Zhang

Wuhan Textile University, China

Wuhan Textile University, China

0000-0003-3371-7190
View Profile

,
Ruiqi Luo

Wuhan Textile University, China

Wuhan Textile University, China

0000-0002-1806-7452
View Profile

,
Junjie Huang

Wuhan Textile University, China

Wuhan Textile University, China

0000-0003-3388-0094
View Profile

,
Jinxing Liang

Wuhan Textile University, China

Wuhan Textile University, China

0000-0002-7570-1827
View Profile

,
Jin Huang

Wuhan Textile University, China

Wuhan Textile University, China

0000-0001-6214-9781
View Profile

,
Tao Peng

Wuhan Textile University, China

Wuhan Textile University, China

0000-0003-1085-7246
View Profile

,
Hao Cai

Wuhan Textile University, China

Wuhan Textile University, China

0000-0001-6076-5169
View Profile

VRCAI '22: Proceedings of the 18th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in IndustryDecember 2022Article No.: 17Pages 1–8https://doi.org/10.1145/3574131.3574431

Published:13 January 2023Publication History

VRCAI '22: Proceedings of the 18th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry

Pages 1–8

ABSTRACT

Video virtual try-on methods aim to generate coherent, smooth, and realistic try-on videos, it matches the target clothing with the person in the video in a spatiotemporally consistent manner. Existing methods can match the human body with the clothing and then present it by the way of video, however it will cause the problem of excessive distortion of the grid and poor display effect at last. Given the problem, we found that was due to the neglect of the relationship between inputs lead to the loss of some features, while the conventional convolution operation is difficult to establish the remote information that is crucial in generating globally consistent results, restriction on clothing texture detail can lead to excessive deformation in the process of TPS fitting, make a lot of the try-on method in the final video rendering is not real. For the above problems, we reduce the excessive distortion of the garment during deformation by using a constraint function to regularize the TPS parameters; it also proposes a multimodal two-stage combinatorial Transformer: in the first stage, an interaction module is added, in which the long-distance relationship between people and clothing can be simulated, and then a better remote relationship can be obtained as well as contribute to the performance of TPS; in the second stage, an activation module is added, which can establish a global dependency, and this dependency can make the input important regions in the data are more prominent, which can provide more natural intermediate inputs for subsequent U-net networks. This paper’s method can bring better results for video virtual fitting, and experiments on the VVT dataset prove that the method outperforms previous methods in both quantitative and qualitative aspects.

References

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.Google ScholarCross Ref
Chieh-Yun Chen, Ling Lo, Pin-Jui Huang, Hong-Han Shuai, and Wen-Huang Cheng. 2021. Fashionmirror: Co-attention feature-remapping virtual try-on with sequential template poses. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13809–13818.Google ScholarCross Ref
Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. 2021. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14131–14140.Google ScholarCross Ref
Aiyu Cui, Daniel McKee, and Svetlana Lazebnik. 2021. Dressing in order: Recurrent person image generation for pose transfer, virtual try-on and outfit editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14638–14647.Google ScholarCross Ref
Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bochao Wang, Hanjiang Lai, Jia Zhu, Zhiting Hu, and Jian Yin. 2019a. Towards multi-pose guided virtual try-on network. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9026–9035.Google ScholarCross Ref
Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bowen Wu, Bing-Cheng Chen, and Jian Yin. 2019b. Fw-gan: Flow-navigated warping gan for video virtual try-on. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1161–1170.Google ScholarCross Ref
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929(2020).Google Scholar
Chongjian Ge, Yibing Song, Yuying Ge, Han Yang, Wei Liu, and Ping Luo. 2021a. Disentangled cycle consistency for highly-realistic virtual try-on. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16928–16937.Google ScholarCross Ref
Yuying Ge, Yibing Song, Ruimao Zhang, Chongjian Ge, Wei Liu, and Ping Luo. 2021b. Parser-free virtual try-on via distilling appearance flows. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8485–8493.Google ScholarCross Ref
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative adversarial networks. Commun. ACM 63, 11 (2020), 139–144.Google ScholarDigital Library
Xintong Han, Xiaojun Hu, Weilin Huang, and Matthew R Scott. 2019. Clothflow: A flow-based model for clothed person generation. In Proceedings of the IEEE/CVF international conference on computer vision. 10471–10480.Google ScholarCross Ref
Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S Davis. 2018. Viton: An image-based virtual try-on network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7543–7552.Google ScholarCross Ref
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6546–6555.Google ScholarCross Ref
Shion Honda. 2019. Viton-gan: Virtual try-on image generator trained with adversarial loss. arXiv preprint arXiv:1911.07926(2019).Google Scholar
Thibaut Issenhuth, Jérémie Mary, and Clément Calauzènes. 2020. Do not mask what you do not need to mask: a parser-free virtual try-on. In European Conference on Computer Vision. Springer, 619–635.Google ScholarDigital Library
Nikolay Jetchev and Urs Bergmann. 2017. The conditional analogy gan: Swapping fashion articles on people images. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 2287–2292.Google ScholarCross Ref
Boyi Jiang, Juyong Zhang, Yang Hong, Jinhao Luo, Ligang Liu, and Hujun Bao. 2020. Bcnet: Learning body and cloth shape from a single image. In European Conference on Computer Vision. Springer, 18–35.Google ScholarDigital Library
Matiur Rahman Minar, Thai Thanh Tuan, Heejune Ahn, Paul Rosin, and Yu-Kun Lai. 2020. Cp-vton+: Clothing shape and texture preserving image-based virtual try-on. In CVPR Workshops.Google Scholar
Aymen Mir, Thiemo Alldieck, and Gerard Pons-Moll. 2020. Learning to transfer texture from clothing images to 3d humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7023–7034.Google ScholarCross Ref
Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784(2014).Google Scholar
Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434(2015).Google Scholar
Bin Ren, Hao Tang, Fanyang Meng, Runwei Ding, Ling Shao, Philip HS Torr, and Nicu Sebe. 2021. Cloth interactive transformer for virtual try-on. arXiv preprint arXiv:2104.05519(2021).Google Scholar
Hao Tang, Song Bai, and Nicu Sebe. 2020a. Dual attention gans for semantic image synthesis. In Proceedings of the 28th ACM International Conference on Multimedia. 1994–2002.Google ScholarDigital Library
Hao Tang, Song Bai, Philip HS Torr, and Nicu Sebe. 2020b. Bipartite graph reasoning gans for person image generation. arXiv preprint arXiv:2008.04381(2020).Google Scholar
Hao Tang, Song Bai, Li Zhang, Philip HS Torr, and Nicu Sebe. 2020c. Xinggan for person image generation. In European Conference on Computer Vision. Springer, 717–734.Google ScholarDigital Library
Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting, Vol. 2019. NIH Public Access, 6558.Google ScholarCross Ref
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).Google Scholar
Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin Chen, Liang Lin, and Meng Yang. 2018c. Toward characteristic-preserving image-based virtual try-on network. In Proceedings of the European conference on computer vision (ECCV). 589–604.Google ScholarDigital Library
Jiahang Wang, Wei Zhang, Weizhong Liu, and Tao Mei. 2019. Down to the last detail: Virtual try-on with detail carving. arXiv preprint arXiv:1912.06324(2019).Google Scholar
Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018b. Video-to-video synthesis. arXiv preprint arXiv:1808.06601(2018).Google Scholar
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018a. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7794–7803.Google ScholarCross Ref
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 4 (2004), 600–612.Google ScholarDigital Library
Han Yang, Ruimao Zhang, Xiaobao Guo, Wei Liu, Wangmeng Zuo, and Ping Luo. 2020a. Towards photo-realistic virtual try-on by adaptively generating-preserving image content. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7850–7859.Google ScholarCross Ref
Han Yang, Ruimao Zhang, Xiaobao Guo, Wei Liu, Wangmeng Zuo, and Ping Luo. 2020b. Towards photo-realistic virtual try-on by adaptively generating-preserving image content. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7850–7859.Google ScholarCross Ref
Honglun Zhang, Wenqing Chen, Hao He, and Yaohui Jin. 2019. Disentangled makeup transfer with generative adversarial network. arXiv preprint arXiv:1907.01144(2019).Google Scholar
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition. 586–595.Google ScholarCross Ref
Xiaojing Zhong, Zhonghua Wu, Taizhe Tan, Guosheng Lin, and Qingyao Wu. 2021. Mv-ton: Memory-based video virtual try-on network. In Proceedings of the 29th ACM International Conference on Multimedia. 908–916.Google ScholarDigital Library

Index Terms

MMTrans: MultiModal Transformer for realistic video virtual try-on
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

Progressive Limb-Aware Virtual Try-On
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Existing image-based virtual try-on methods directly transfer specific clothing to a human image without utilizing clothing attributes to refine the transferred clothing geometry and textures, which causes incomplete and blurred clothing appearances. In ...
Read More
MV-TON: Memory-based Video Virtual Try-on network
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

With the development of Generative Adversarial Network, image-based virtual try-on methods have made great progress. However, limited work has explored the task of video-based virtual try-on while it is important in real-world applications. Most ...
Read More
Customized 3D Clothes Modeling for Virtual Try-on System based on Multiple Kinects
GRAPP 2016: Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications: Volume 1: GRAPP

Most existing 3D virtual try-on systems put clothes designed in one environment on a human captured in another environment, which cause the mismatching brightness problem. And also typical 3D clothes modeling starts with manually designed 2D patterns, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
VRCAI '22: Proceedings of the 18th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry
December 2022
284 pages
ISBN:9798400700316
DOI:10.1145/3574131
Editors:
Enhua Wu
SKLCS, Chinese Academy of Sciences / FST, University of Macau / Guangzhou Greater Bay Area Virtual Reality Research Institute, China
,
Lionel Ming-Shuan Ni
The Hong Kong University of Science and Technology (Guangzhou) & The Hong Kong University of Science and Technology, China
,
Zhigeng Pan
Nanjing University of Information Science & Technology / Hangzhou Normal University, China
,
Daniel Thalmann
École Polytechnique Fédérale de Lausanne (EPFL), Switzerland
,
Ping Li
The Hong Kong Polytechnic University, Hong Kong, China
,
Charlie C.L. Wang
The University of Manchester, U.K.
,
Lei Zhu
The Hong Kong University of Science and Technology (Guangzhou) & The Hong Kong University of Science and Technology, China
,
Minghao Yang
Institute of Automation, Chinese Academy of Sciences, China
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 January 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
frame time consistency
global consistency
multimodal Transformer
video virtual try-on
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate51of107submissions,48%
Upcoming Conference
SIGGRAPH '24

Sponsor:

siggraph

Special Interest Group on Computer Graphics and Interactive Techniques Conference

July 27 - August 1, 2024

Denver , CO , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 88
  Total Downloads
- Downloads (Last 12 months)50
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

MMTrans: MultiModal Transformer for realistic video virtual try-on

VRCAI '22: Proceedings of the 18th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry

ABSTRACT

References

Cited By

Index Terms

Recommendations

Progressive Limb-Aware Virtual Try-On

MV-TON: Memory-based Video Virtual Try-on network

Customized 3D Clothes Modeling for Virtual Try-on System based on Multiple Kinects