skip to main content
10.1145/3574131.3574431acmconferencesArticle/Chapter ViewAbstractPublication PagessiggraphConference Proceedingsconference-collections
research-article

MMTrans: MultiModal Transformer for realistic video virtual try-on

Published:13 January 2023Publication History

ABSTRACT

Video virtual try-on methods aim to generate coherent, smooth, and realistic try-on videos, it matches the target clothing with the person in the video in a spatiotemporally consistent manner. Existing methods can match the human body with the clothing and then present it by the way of video, however it will cause the problem of excessive distortion of the grid and poor display effect at last. Given the problem, we found that was due to the neglect of the relationship between inputs lead to the loss of some features, while the conventional convolution operation is difficult to establish the remote information that is crucial in generating globally consistent results, restriction on clothing texture detail can lead to excessive deformation in the process of TPS fitting, make a lot of the try-on method in the final video rendering is not real. For the above problems, we reduce the excessive distortion of the garment during deformation by using a constraint function to regularize the TPS parameters; it also proposes a multimodal two-stage combinatorial Transformer: in the first stage, an interaction module is added, in which the long-distance relationship between people and clothing can be simulated, and then a better remote relationship can be obtained as well as contribute to the performance of TPS; in the second stage, an activation module is added, which can establish a global dependency, and this dependency can make the input important regions in the data are more prominent, which can provide more natural intermediate inputs for subsequent U-net networks. This paper’s method can bring better results for video virtual fitting, and experiments on the VVT dataset prove that the method outperforms previous methods in both quantitative and qualitative aspects.

References

  1. Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.Google ScholarGoogle ScholarCross RefCross Ref
  2. Chieh-Yun Chen, Ling Lo, Pin-Jui Huang, Hong-Han Shuai, and Wen-Huang Cheng. 2021. Fashionmirror: Co-attention feature-remapping virtual try-on with sequential template poses. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13809–13818.Google ScholarGoogle ScholarCross RefCross Ref
  3. Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. 2021. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14131–14140.Google ScholarGoogle ScholarCross RefCross Ref
  4. Aiyu Cui, Daniel McKee, and Svetlana Lazebnik. 2021. Dressing in order: Recurrent person image generation for pose transfer, virtual try-on and outfit editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14638–14647.Google ScholarGoogle ScholarCross RefCross Ref
  5. Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bochao Wang, Hanjiang Lai, Jia Zhu, Zhiting Hu, and Jian Yin. 2019a. Towards multi-pose guided virtual try-on network. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9026–9035.Google ScholarGoogle ScholarCross RefCross Ref
  6. Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bowen Wu, Bing-Cheng Chen, and Jian Yin. 2019b. Fw-gan: Flow-navigated warping gan for video virtual try-on. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1161–1170.Google ScholarGoogle ScholarCross RefCross Ref
  7. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929(2020).Google ScholarGoogle Scholar
  8. Chongjian Ge, Yibing Song, Yuying Ge, Han Yang, Wei Liu, and Ping Luo. 2021a. Disentangled cycle consistency for highly-realistic virtual try-on. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16928–16937.Google ScholarGoogle ScholarCross RefCross Ref
  9. Yuying Ge, Yibing Song, Ruimao Zhang, Chongjian Ge, Wei Liu, and Ping Luo. 2021b. Parser-free virtual try-on via distilling appearance flows. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8485–8493.Google ScholarGoogle ScholarCross RefCross Ref
  10. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative adversarial networks. Commun. ACM 63, 11 (2020), 139–144.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Xintong Han, Xiaojun Hu, Weilin Huang, and Matthew R Scott. 2019. Clothflow: A flow-based model for clothed person generation. In Proceedings of the IEEE/CVF international conference on computer vision. 10471–10480.Google ScholarGoogle ScholarCross RefCross Ref
  12. Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S Davis. 2018. Viton: An image-based virtual try-on network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7543–7552.Google ScholarGoogle ScholarCross RefCross Ref
  13. Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6546–6555.Google ScholarGoogle ScholarCross RefCross Ref
  14. Shion Honda. 2019. Viton-gan: Virtual try-on image generator trained with adversarial loss. arXiv preprint arXiv:1911.07926(2019).Google ScholarGoogle Scholar
  15. Thibaut Issenhuth, Jérémie Mary, and Clément Calauzènes. 2020. Do not mask what you do not need to mask: a parser-free virtual try-on. In European Conference on Computer Vision. Springer, 619–635.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Nikolay Jetchev and Urs Bergmann. 2017. The conditional analogy gan: Swapping fashion articles on people images. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 2287–2292.Google ScholarGoogle ScholarCross RefCross Ref
  17. Boyi Jiang, Juyong Zhang, Yang Hong, Jinhao Luo, Ligang Liu, and Hujun Bao. 2020. Bcnet: Learning body and cloth shape from a single image. In European Conference on Computer Vision. Springer, 18–35.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Matiur Rahman Minar, Thai Thanh Tuan, Heejune Ahn, Paul Rosin, and Yu-Kun Lai. 2020. Cp-vton+: Clothing shape and texture preserving image-based virtual try-on. In CVPR Workshops.Google ScholarGoogle Scholar
  19. Aymen Mir, Thiemo Alldieck, and Gerard Pons-Moll. 2020. Learning to transfer texture from clothing images to 3d humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7023–7034.Google ScholarGoogle ScholarCross RefCross Ref
  20. Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784(2014).Google ScholarGoogle Scholar
  21. Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434(2015).Google ScholarGoogle Scholar
  22. Bin Ren, Hao Tang, Fanyang Meng, Runwei Ding, Ling Shao, Philip HS Torr, and Nicu Sebe. 2021. Cloth interactive transformer for virtual try-on. arXiv preprint arXiv:2104.05519(2021).Google ScholarGoogle Scholar
  23. Hao Tang, Song Bai, and Nicu Sebe. 2020a. Dual attention gans for semantic image synthesis. In Proceedings of the 28th ACM International Conference on Multimedia. 1994–2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Hao Tang, Song Bai, Philip HS Torr, and Nicu Sebe. 2020b. Bipartite graph reasoning gans for person image generation. arXiv preprint arXiv:2008.04381(2020).Google ScholarGoogle Scholar
  25. Hao Tang, Song Bai, Li Zhang, Philip HS Torr, and Nicu Sebe. 2020c. Xinggan for person image generation. In European Conference on Computer Vision. Springer, 717–734.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting, Vol. 2019. NIH Public Access, 6558.Google ScholarGoogle ScholarCross RefCross Ref
  27. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).Google ScholarGoogle Scholar
  28. Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin Chen, Liang Lin, and Meng Yang. 2018c. Toward characteristic-preserving image-based virtual try-on network. In Proceedings of the European conference on computer vision (ECCV). 589–604.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Jiahang Wang, Wei Zhang, Weizhong Liu, and Tao Mei. 2019. Down to the last detail: Virtual try-on with detail carving. arXiv preprint arXiv:1912.06324(2019).Google ScholarGoogle Scholar
  30. Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018b. Video-to-video synthesis. arXiv preprint arXiv:1808.06601(2018).Google ScholarGoogle Scholar
  31. Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018a. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7794–7803.Google ScholarGoogle ScholarCross RefCross Ref
  32. Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 4 (2004), 600–612.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Han Yang, Ruimao Zhang, Xiaobao Guo, Wei Liu, Wangmeng Zuo, and Ping Luo. 2020a. Towards photo-realistic virtual try-on by adaptively generating-preserving image content. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7850–7859.Google ScholarGoogle ScholarCross RefCross Ref
  34. Han Yang, Ruimao Zhang, Xiaobao Guo, Wei Liu, Wangmeng Zuo, and Ping Luo. 2020b. Towards photo-realistic virtual try-on by adaptively generating-preserving image content. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7850–7859.Google ScholarGoogle ScholarCross RefCross Ref
  35. Honglun Zhang, Wenqing Chen, Hao He, and Yaohui Jin. 2019. Disentangled makeup transfer with generative adversarial network. arXiv preprint arXiv:1907.01144(2019).Google ScholarGoogle Scholar
  36. Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition. 586–595.Google ScholarGoogle ScholarCross RefCross Ref
  37. Xiaojing Zhong, Zhonghua Wu, Taizhe Tan, Guosheng Lin, and Qingyao Wu. 2021. Mv-ton: Memory-based video virtual try-on network. In Proceedings of the 29th ACM International Conference on Multimedia. 908–916.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. MMTrans: MultiModal Transformer for realistic video virtual try-on

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      VRCAI '22: Proceedings of the 18th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry
      December 2022
      284 pages
      ISBN:9798400700316
      DOI:10.1145/3574131
      • Editors:
      • Enhua Wu,
      • Lionel Ming-Shuan Ni,
      • Zhigeng Pan,
      • Daniel Thalmann,
      • Ping Li,
      • Charlie C.L. Wang,
      • Lei Zhu,
      • Minghao Yang

      Copyright © 2022 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 January 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate51of107submissions,48%

      Upcoming Conference

      SIGGRAPH '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format