Abstract
Recently, Image-based Virtual Try-on has garnered increasing attention within the realm of online apparel e-commerce, which aims to virtually superimpose garments onto images of portraits. Image-based virtual try-on generally consists of two steps: Image-based Virtual Try-on Alignment and Image-based Virtual Try-on Generation. In this paper, we focus on Image-based Virtual Try-on Alignment (IVTA), which plays a pivotal role in virtual try-on and aligns the target garment with the portrait. Current approaches for IVTA mostly adopt Convolutional Neural Networks (CNN) to extract local detailed features of both garments and portraits, ignoring the significance of global extensive features. To address this problem, we propose a novel model named Try-On Aligning Conformer (TOAC) to effectively aligns the target garment with the portrait and improve virtual try-on. Firstly, we integrate both Swin Transformer and CNN to comprehensively extract both global patterns and local details. Secondly, we propose a robust learned perceptual loss between generative reconstructed garment images and the ground truth to alleviate the overlap problem. Extensive experiments demonstrate the superiority of our proposed model compared to the state-of-the-art methods for virtual try-on alignment.
Supported by National Natural Science Foundation of China (No. 62036012, 62276257).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Barratt, S., Sharma, R.: A note on the inception score. arXiv preprint arXiv:1801.01973 (2018)
Choi, S., Park, S., Lee, M., Choo, J.: VITON-HD: high-resolution virtual try-on via misalignment-aware normalization. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, 19ā25 June 2021, pp. 14131ā14140 (2021)
Chopra, A., Jain, R., Hemani, M., Krishnamurthy, B.: Zflow: gated appearance flow-based virtual try-on with 3d priors. In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10ā17 October 2021, pp. 5413ā5422 (2021)
Dong, X., et al.: Dressing in the wild by watching dance videos. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18ā24 June 2022, pp. 3470ā3479 (2022)
FrĆ¼hstĆ¼ck, A., Singh, K.K., Shechtman, E., Mitra, N.J., Wonka, P., Lu, J.: Insetgan for full-body image generation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18ā24 June 2022, pp. 7713ā7722 (2022)
Fu, J. et al.: StyleGAN-human: a data-centric odyssey of human generation. In: Avidan, S., Brostow, G., CissĆ©, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, ECCV 2022, LNCS, Part XVI, vol. 13676, pp. 1ā19. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19787-1_1
Ge, Y., Song, Y., Zhang, R., Ge, C., Liu, W., Luo, P.: Parser-free virtual try-on via distilling appearance flows. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, 19ā25 June 2021, pp. 8485ā8493 (2021)
Gong, K., Liang, X., Li, Y., Chen, Y., Yang, M., Lin, L.: Instance-level human parsing via part grouping network. In: Computer Vision - ECCV 2018ā15th European Conference, Munich, Germany, 8ā14 September 2018, Proceedings, Part IV, pp. 805ā822 (2018)
Goodfellow, I.J., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, 8ā13 December 2014, Montreal, Quebec, Canada, pp. 2672ā2680 (2014)
GĆ¼ler, R.A., Neverova, N., Kokkinos, I.: Densepose: dense human pose estimation in the wild. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18ā22 June 2018, pp. 7297ā7306 (2018)
Han, X., Huang, W., Hu, X., Scott, M.R.: Clothflow: a flow-based model for clothed person generation. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp. 10470ā10479 (2019)
Han, X., Wu, Z., Wu, Z., Yu, R., Davis, L.S.: VITON: an image-based virtual try-on network. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18ā22 June 2018, pp. 7543ā7552 (2018)
He, K., Chen, X., Xie, S., Li, Y., DollĆ”r, P., Girshick, R.B.: Masked autoencoders are scalable vision learners. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18ā24 June 2022, pp. 15979ā15988 (2022)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770ā778 (2016)
He, S., Song, Y., Xiang, T.: Style-based global appearance flow for virtual try-on. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18ā24 June 2022, pp. 3460ā3469 (2022)
Huang, Z., Li, H., Xie, Z., Kampffmeyer, M., Cai, Q., Liang, X.: Towards hard-pose virtual try-on via 3d-aware global correspondence learning. arXiv preprint arXiv:2211.14052 (2022)
Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., Aila, T.: Training generative adversarial networks with limited data. In: Advances in Neural Information Processing Systems, pp. 12104ā12114 (2020)
Karras, T., et al.: Alias-free generative adversarial networks. In: Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, 6ā14 December 2021, virtual, pp. 852ā863 (2021)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012, Proceedings of a meeting held 3ā6 December 2012, Lake Tahoe, Nevada, United States, pp. 1106ā1114 (2012)
Liu, Z., et al.: Swin transformer V2: scaling up capacity and resolution. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18ā24 June 2022, pp. 11999ā12009 (2022)
Mao, X., Li, Q., Xie, H., Lau, R.Y.K., Wang, Z., Smolley, S.P.: Least squares generative adversarial networks. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22ā29 October 2017, pp. 2813ā2821 (2017)
Morelli, D., Baldrati, A., Cartella, G., Cornia, M., Bertini, M., Cucchiara, R.: LaDI-VTON: latent diffusion textual-inversion enhanced virtual try-on. arXiv preprint arXiv:2305.13501 (2023)
Parmar, G., Zhang, R., Zhu, J.: On aliased resizing and surprising subtleties in GAN evaluation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18ā24 June 2022, pp. 11400ā11410 (2022)
Qian, S., Chen, H., Xue, D., Fang, Q., Xu, C.: Open-world social event classification. In: Proceedings of the ACM Web Conference 2023, pp. 1562ā1571 (2023)
Qian, S., Xue, D., Fang, Q., Xu, C.: Adaptive label-aware graph convolutional networks for cross-modal retrieval. IEEE Trans. Multimedia 24, 3520ā3532 (2021)
Qian, S., Xue, D., Fang, Q., Xu, C.: Integrating multi-label contrastive learning with dual adversarial graph neural networks for cross-modal retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 45(4), 4794ā4811 (2022)
Qian, S., Xue, D., Zhang, H., Fang, Q., Xu, C.: Dual adversarial graph neural networks for multi-label cross-modal retrieval. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2440ā2448 (2021)
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22ā29 October 2017, pp. 618ā626 (2017)
Wang, B., Zheng, H., Liang, X., Chen, Y., Lin, L., Yang, M.: Toward characteristic-preserving image-based virtual try-on network. In: Computer Vision - ECCV 2018ā15th European Conference, Munich, Germany, 8ā14 September 2018, Proceedings, Part XIII, pp. 607ā623 (2018)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600ā612 (2004)
Xue, D., Qian, S., Fang, Q., Xu, C.: Mmt: Image-guided story ending generation with multimodal memory transformer. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 750ā758 (2022)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18ā22 June 2018, pp. 586ā595 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wang, Y., Xiang, W., Zhang, S., Xue, D., Qian, S. (2024). TOAC: Try-On Aligning Conformer forĀ Image-Based Virtual Try-On Alignment. In: Fang, L., Pei, J., Zhai, G., Wang, R. (eds) Artificial Intelligence. CICAI 2023. Lecture Notes in Computer Science(), vol 14474. Springer, Singapore. https://doi.org/10.1007/978-981-99-9119-8_3
Download citation
DOI: https://doi.org/10.1007/978-981-99-9119-8_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-9118-1
Online ISBN: 978-981-99-9119-8
eBook Packages: Computer ScienceComputer Science (R0)