Abstract
In this paper, we propose a Landmark Guided Virtual Try-On (LGVTON) method for clothes, which aims to solve the problem of clothing trials on e-commerce websites. Given the images of two people: a person and a model, it generates a rendition of the person wearing the clothes of the model. This is useful considering the fact that on most e-commerce websites images of only clothes are not usually available. We follow a three-stage approach to achieve our objective. In the first stage, LGVTON warps the clothes of the model using a Thin-Plate Spline (TPS) based transformation to fit the person. Unlike previous TPS-based methods, we use the landmarks (of human and clothes) to compute the TPS transformation. This enables the warping to work independently of the complex patterns, such as stripes, florals, and textures, present on the clothes. However, this computed warp may not always be very precise. We, therefore, further refine it in the subsequent stages with the help of a mask generator (Stage 2) and an image synthesizer (Stage 3) modules. The mask generator improves the fit of the warped clothes, and the image synthesizer ensures a realistic output. To tackle the problem of lack of paired training data, we resort to a self-supervised training strategy. Here paired data refers to the image pair of model and person wearing the same cloth. We compare LGVTON with four existing methods on two popular fashion datasets namely MPV and DeepFashion using two performance measures, FID (Fréchet Inception Distance) and SSIM (Structural Similarity Index). The proposed method in most cases outperforms the state-of-the-art methods.
Similar content being viewed by others
Data Availability Statement
The datasets used in this work are available in public domains. The sources are appropriately referred to.
Code availability
Will be made available once the work is published.
Notes
This is elaborated more in the ablation study given in the Appendix.
A more detailed discussion on the correlation layer is given in the Appendix.
a detailed study on TPS is given in the Appendix.
A detailed ablation study of MGM is given in the Appendix.
more detailed in the Appendix.
More results of LGVTON are given in the Appendix.
Please see the Appendix for a detailed study on the utility of each component of LGVTON.
References
Alp Güler R, Neverova N, Kokkinos I (2018) Densepose: Dense human pose estimation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7297–7306
Belongie S, Malik J, Puzicha J (2001) Shape context: A new descriptor for shape matching and object recognition. In: Advances in neural information processing systems, pp 831–837
Bogo F, Kanazawa A, Lassner C, Gehler P, Romero J, Black MJ (2016) Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In: European Conference on Computer Vision, Springer, pp 561–578
Cao Z, Simon T, Wei SE, Sheikh Y (2017) Realtime multi-person 2d pose estimation using part affinity fields. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: A Large-Scale Hierarchical Image Database. In: CVPR09
Donato G, Belongie S (2002) Approximate thin plate spline mappings. In: European conference on computer vision, Springer, pp 21–31
Dong H, Liang X, Shen X, Wang B, Lai H, Zhu J, Hu Z, Yin J (2019a) Towards multi-pose guided virtual try-on network. In: Proceedings of the IEEE International Conference on Computer Vision, pp 9026–9035
Dong H, Liang X, Shen X, Wu B, Chen BC, Yin J (2019b) Fw-gan: Flow-navigated warping gan for video virtual try-on. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1161–1170
Duchon J (1977) Splines minimizing rotation-invariant semi-norms in sobolev spaces. In: Constructive theory of functions of several variables, Springer, pp 85–100
Gong K, Liang X, Zhang D, Shen X, Lin L (2017) Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 932–940
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680
Han X, Wu Z, Wu Z, Yu R, Davis LS (2018) Viton: An image-based virtual try-on network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7543–7552
Han X, Hu X, Huang W, Scott MR (2019) Clothflow: A flow-based model for clothed person generation. In: Proceedings of the IEEE International Conference on Computer Vision, pp 10471–10480
Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems, pp 6626–6637
Hsieh CW, Chen CY, Chou CL, Shuai HH, Cheng WH (2019a) Fit-me: Image-based virtual try-on with arbitrary poses. In: 2019 IEEE International Conference on Image Processing (ICIP), IEEE, pp 4694–4698
Hsieh CW, Chen CY, Chou CL, Shuai HH, Liu J, Cheng WH (2019b) Fashionon: Semantic-guided image-based virtual try-on with detailed human and clothing information. In: Proceedings of the 27th ACM International Conference on Multimedia, pp 275–283
Isola P, Zhu JY, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1125–1134
Issenhuth T, Mary J, Calauzènes C (2019) End-to-end learning of geometric deformations of feature maps for virtual try-on. arXiv preprint arXiv:190601347
Jae Lee H, Lee R, Kang M, Cho M, Park G (2019) La-viton: A network for looking-attractive virtual try-on. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp 0–0
Jandial S, Chopra A, Ayush K, Hemani M, Krishnamurthy B, Halwai A (2020) Sievenet: A unified framework for robust image-based virtual try-on. In: The IEEE Winter Conference on Applications of Computer Vision, pp 2182–2190
Jetchev N, Bergmann U (2017) The conditional analogy gan: Swapping fashion articles on people images. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2287–2292
Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution. In: European conference on computer vision, Springer, pp 694–711
Liu Z, Luo P, Qiu S, Wang X, Tang X (2016a) Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1096–1104
Liu Z, Yan S, Luo P, Wang X, Tang X (2016b) Fashion landmark detection in the wild. In: European Conference on Computer Vision, Springer, pp 229–245
Mabrouk AB, Zagrouba E (2018) Abnormal behavior recognition for intelligent video surveillance systems: A review. Expert Systems with Applications 91:480–491
Maintz JA, Viergever MA (1998) A survey of medical image registration. Medical image analysis 2(1):1–36
Minar MR, Ahn H (2020) Cloth-vton: Clothing three-dimensional reconstruction for hybrid image-based virtual try-on. In: Proceedings of the Asian Conference on Computer Vision
Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv preprint arXiv:14111784
Neuberger A, Borenstein E, Hilleli B, Oks E, Alpert S (2020) Image based virtual try-on network from unpaired data. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In: European conference on computer vision, Springer, pp 483–499
Pons-Moll G, Pujades S, Hu S, Black MJ (2017) Clothcap: Seamless 4d clothing capture and retargeting. ACM Transactions on Graphics (TOG) 36(4):73
Raj A, Sangkloy P, Chang H, Hays J, Ceylan D, Lu J (2018) Swapnet: Image based garment transfer. In: Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XII, pp 679–695
Rocco I, Arandjelovic R, Sivic J (2017) Convolutional neural network architecture for geometric matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6148–6157
Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X (2016) Improved techniques for training gans. In: Advances in neural information processing systems, pp 2234–2242
Sekine M, Sugita K, Perbet F, Stenger B, Nishiyama M (2014) Virtual fitting by single-shot body shape estimation. In: Int. Conf. on 3D Body Scanning Technologies, Citeseer, pp 406–413
Shai T, Dimitri V, Michael U (2006) Polyharmonic smoothing splines and the multidimensional wiener filtering of fractal-like signals. In: IEEE TRANSACTIONS ON IMAGE PROCESSING, IEEE
Shigeki Y, Okura F, Mitsugami I, Yagi Y (2018) Estimating 3d human shape under clothing from a single rgb image. IPSJ Transactions on Computer Vision and Applications 10(1):16
Song D, Li T, Mao Z, Liu AA (2019) Sp-viton: shape-preserving image-based virtual try-on network. Multimedia Tools and Applications pp 1–13
Sprengel R, Rohr K, Stiehl HS (1996) Thin-plate spline approximation for image registration. In: Proceedings of 18th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, IEEE, vol 3, pp 1190–1191
Sun F, Guo J, Su Z, Gao C (2019) Image-based virtual try-on network with structural coherence. In: 2019 IEEE International Conference on Image Processing (ICIP), IEEE, pp 519–523
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826
Wahba G (1990) Spline models for observational data, vol 59. Siam
Wang B, Zheng H, Liang X, Chen Y, Lin L, Yang M (2018a) Toward characteristic-preserving image-based virtual try-on network. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 589–604
Wang W, Xu Y, Shen J, Zhu SC (2018b) Attentive fashion grammar network for fashion landmark detection and clothing category classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4271–4280
Wang Z, Bovik AC, Sheikh HR, Simoncelli EP, et al. (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4):600–612
Wu Z, Lin G, Tao Q, Cai J (2019) M2e-try on net: Fashion from model to everyone. In: Proceedings of the 27th ACM International Conference on Multimedia, ACM, pp 293–301
Xian W, Sangkloy P, Agrawal V, Raj A, Lu J, Fang C, Yu F, Hays J (2018) Texturegan: Controlling deep image synthesis with texture patches. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 8456–8465
Yang H, Zhang R, Guo X, Liu W, Zuo W, Luo P (2020) Towards photo-realistic virtual try-on by adaptively generating-preserving image content. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Yu R, Wang X, Xie X (2019) Vtnfp: An image-based virtual try-on network with body and clothing feature preservation. In: Proceedings of the IEEE International Conference on Computer Vision, pp 10511–10520
Zanfir M, Popa AI, Zanfir A, Sminchisescu C (2018) Human appearance transfer. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Zeng W, Zhao M, Gao Y, Zhang Z (2020) Tilegan: category-oriented attention-based high-quality tiled clothes generation from dressed person. NEURAL COMPUTING & APPLICATIONS
Zhao R, Ouyang W, Wang X (2013) Unsupervised salience learning for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3586–3593
Zheng N, Song X, Chen Z, Hu L, Cao D, Nie L (2019a) Virtually trying on new clothing with arbitrary poses. In: Proceedings of the 27th ACM International Conference on Multimedia, ACM, pp 266–274
Zheng N, Song X, Chen Z, Hu L, Cao D, Nie L (2019b) Virtually trying on new clothing with arbitrary poses. In: Proceedings of the 27th ACM International Conference on Multimedia, pp 266–274
Acknowledgements
The authors sincerely thank Aruparna Maity and Dakshya Mishra for their help and thank Sankha Subhra Mullick for helpful discussion during this work.
Author information
Authors and Affiliations
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Appendix
Appendix
In this appendix, first, we discuss Thin-plate spline (TPS) transformation. Second, we give a detailed study on the idea of the correlation layer that has been used in the PGWM of LGVTON. Third, we give a detailed ablation study to clarify in detail the effectiveness of each of the components of LGVTON. Fourth, we provide the implementation details of our network architecture. In addition, we give some results of LGVTON on DeepFashion [23] and MPV dataset [7] in Figs. 24, 25, 26, and 27.
1.1 Thin Plate Spline (TPS)
Given a set of pair of source and target landmarks {(\(\mathbf {r}_j\), \(\mathbf {t}_j\)); j = 1, . . . , N}, a polyharmonic smoothing spline fits a function \(f(\cdot )\) between the sets {\(\mathbf {r}_j\); j = 1, . . . , N} and {\(\mathbf {t}_j\); j = 1, . . . , N} that minimizes the following objective functional [36]:
where \(\nabla ^m f\) is the vector of all m-th order partial derivatives of f that enforces smoothness on f. In our case, \(\mathbf {r}_j, \mathbf {t}_j\in \mathbb {N}^{2}\), or in other words, \(\mathbf {r}\equiv (r^x, r^y)\) and \(\mathbf {t}\equiv (t^x, t^y)\). A special case of polyharmonic smoothing spline where m = 2, called thin-plate spline (TPS), is proposed by Duchon [9]. It minimizes the following objective functional based on eq. (7):
The physical analogy to the concept of the thin-plate comes from the minimization of second-order gradients (\(f_{xx}\), \(f_{xy}\), \(f_{yy}\)) in its objective, which restricts bending and enforces smoothness in the TPS fit. This is similar to the effect of physical rigidity in the thin metal plate.
A closed form solution of it as proposed in [42] is given by
where \(\mathbf {a}_0\), \(\mathbf {a}_1\), \(\mathbf {a}_2\), \(\mathbf {c}_j\) are parameters with dimension equal to the dimension of landmarks, which is 2 in our case.
The TPS is represented in terms of radial basis function (RBF) \(\phi\). Given a set of control points \(\mathbf {r}_{j}\), \(j=1, ..., N\) a RBF maps a given point \(\mathbf {p}\) to a new location \(g(\mathbf {p})\), represented by,
The radial basis kernel used in TPS is \(\phi (\mathbf {r})\) = (\(\mathbf {r}^2\) ln \(\mathbf {r}\)).
When the values of \(\mathbf {t}_j\) are noisy (due to landmark localization errors), which is a very obvious situation in practice, the interpolation requirement is relaxed by means of regularization. This reduces the problem to an approximation problem [6, 39]. This is obtained by minimizing
where \(\lambda\) is the regularization parameter, which is a +ve scalar. Here, \(\lambda\) determines the relative weight between the approximation behavior and the smoothness of the transformation. In the limiting case, \(\lambda\) = 0, we get an interpolation transformation. If the value of \(\lambda\) is small we get a good approximation behavior; for a higher value of \(\lambda\), we get a very smooth transformation function. But then the local deformations determined by the set of landmarks are maintained poorly.
1.2 Correlation layer
Correlation is a statistical technique that can show whether and how strongly pairs of variables are related. We use the idea of correlation in our Pose Guided Warping Module (PGWM). Here we model the warping of the source clothes as a function of the correlation between the human landmarks of the model and the person, and fashion landmarks of the source clothes. The reason for using the correlation is that when a person wears clothes it gets molded according to his body shape and pose. Now, here we are transferring the clothes from model to person hence, the clothes have to undergo deformation according to the way the body shape and pose changes from model to person. This change is modeled by correlation.
Coming into more detail, given the data \(f_a, f_b \in \mathbb {R}^{1\times l}\), the correlation layer produces the correlation map \(C_{ab} \in \mathbb {R}^{l \times l}\), where,
Therefore a correlation map basically contains pairwise similarity of all the values of \(f_a\) and \(f_b\). This is useful in establishing relationships among different landmarks of model and person to model the warping.
1.3 Ablation study
In this subsection, we conduct an in-depth qualitative study on the significance of each component of LGVTON.
1.3.1 Utility of different loss functions for training ISM
We run a comparative study on the effect of different loss functions used to train ISM. Keeping the other settings same, we train 3 different instances of ISM with different combinations of loss functions as shown in Fig. 19.
As GAN tries to approximate the data distribution, so it generates better output than its non-GAN variant [11]. This is evident in the results of the non-GAN variant of LGVTON where the ISM is trained with DSSIM loss. We call this variant LGVTON (non-GAN, DSSIM loss) and the corresponding GAN variant LGVTON (cGAN, DSSIM loss). Comparing the results of LGVTON(cGAN, DSSIM loss) and LGVTON (ours) (a cGAN, where the generator is trained with DSSIM and perceptual loss both), it can be observed that the result improves in the presence of VGG perceptual loss.
1.3.2 Significance of fashion landmarks in PGWM
We conduct a study on PGWM when the warping is done with human landmarks only instead of both human and fashion landmarks. Figure 20 shows two cases portraying the effectiveness of fashion landmarks around the collar and hem. Having only human landmarks might serve the purpose but at the cost of an increase in the amount of warping glitches, which is tackled to some extent in ISM with the help of the mask generated by the MGM. The scores reported in Tables 3 and 4 (LGVTON (w/o fashion landmarks)) shows that without fashion landmarks the performance of LGVTON degrades. However, for obvious reasons, the performance degrades even more if the support of the MGM is also removed (observe the score of LGVTON (w/o fashion landmarks, w/o MGM) in Tables 3 and 4).
1.3.3 Effectiveness of correlation layer in PGWM
We study the effectiveness of the correlation layer in the fashion landmark predictor network of PGWM (refer to Fig. 21). The presence of the correlation layer establishes the relationship between the human poses of the model and the person, which in turn assists in predicting better estimates of fashion landmarks of the target warp clothes.
1.3.4 Studying the role of target mask as input to ISM
For understanding the effect of the target mask, we train an instance of ISM without providing the target mask as input. Now it is observed that w/o target mask, the network can not identify the warping glitches. This can be verified from Fig. 22h where the effect of the warping glitch is propagated to the output. However, when a target mask is given as input, ISM can identify the areas of warping glitch and takes the necessary action according to the type of the glitch. The reason being the variety of clothing types makes the network confused to distinguish a warping glitch from the design of the clothes, e.g., in the first two rows, the inappropriate estimation of \(c'_{flm}\) causes the sleeves to be stretched more outwards. While that in Fig. 22j is not observed as the network removes those areas of extra stretch and replaces them with the background. In the example of the third row Fig. 22h, the effect of warping glitch exposes some body parts near the right neckline of the person (better viewed when zoomed in), while this gets filled with a clothes texture and color in Fig. 22j.
1.4 Implementation details
1.4.1 Pose Guided Warping Module (PGWM)
The dense neural network in fashion landmark predictor network \(\mathcal {F}\) contains 6 consecutive dense layers with 900, 800, 600, 500, 250, 100 nodes respectively, each having activation function tanh. Finally, the output layer has 12 nodes (for 6 fashion landmarks each with 2 coordinate values x and y) and sigmoid as the activation function.
1.4.2 Mask Generator Module (MGM)
The architecture of MGM is that of an hourglass network [30]. Generally, the processing of clothes requires identifying the different parts of them to establishing a semantic understanding of their structure. A well-known network that suits this requirement is the hourglass network [30].
An hourglass network is a CNN (Convolutional Neural Network) that captures features at various scales and is effective for analyzing spatial relationships among different parts of the input. Multiple of these hourglass networks can be stacked together with intermediate supervision for making it deeper. However, for our purpose, the stack size of 1 is found to be sufficient. An overview of the architecture of one hourglass network is given in Fig. 23.
The network is called hourglass due to its top-down, bottom-up architecture, which matches with the shape of an hourglass. It uses convolution and max pooling layers to downscale features to a low resolution. After that, the top-down sequence of upsampling begins, and at each resolution, the corresponding features from downsampling parts are added as skip connections. However, it applies more convolutions on the features of skip connections and then does element-wise addition of the two sets of features.
1.4.3 Image Synthesizer Module (ISM)
The hourglass network in G (the generator network of ISM) has a stack size of 1. The convolution layer generating \(I_m\) in G has a kernel of size 1 \(\times\) 1 with \(L_{1}\) regularization and sigmoid activation function. For training ISM, we alternate between 3 steps of generator training and 1 step of discriminator training. It is trained with the same settings of Adam as our PGWM. For DSSIM loss, the kernel size taken is 3 \(\times\) 3. The discriminator D is a patchGAN discriminator [17]. Instead of classifying the whole image as real or fake, it classifies each patch of the image, where the patch size is much smaller than the input image size. Hence pixels separated by more than a patch diameter gets modeled independently. This makes it work like a texture/style loss as discussed in [17], which helps to keep better texture in the final output image. Existing works have shown the efficacy of patchGAN [17, 47] in image-based problems. For human parsing, we used the human parsing network proposed by [10] pretrained on the LIP dataset [10]. The dataset contains 19 part labels, 6 labels for body parts, and 13 for clothing categories. During our quantitative analysis we used the models of CP-VTON, VTNFP, MG-VTON pretrained on MPV [7] dataset. For VITON we used the VITON dataset pretrained model weights provided in its official implementations. Some more results of this work on DeepFashion and MPV datasets are shown in Figs. 24, 25, 26, and 27.
Rights and permissions
About this article
Cite this article
Roy, D., Santra, S. & Chanda, B. LGVTON: a landmark guided approach for model to person virtual try-on. Multimed Tools Appl 81, 5051–5087 (2022). https://doi.org/10.1007/s11042-021-11647-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-11647-9