Skip to main content
Log in

LGVTON: a landmark guided approach for model to person virtual try-on

  • 1193: Intelligent Processing of Multimedia Signals
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In this paper, we propose a Landmark Guided Virtual Try-On (LGVTON) method for clothes, which aims to solve the problem of clothing trials on e-commerce websites. Given the images of two people: a person and a model, it generates a rendition of the person wearing the clothes of the model. This is useful considering the fact that on most e-commerce websites images of only clothes are not usually available. We follow a three-stage approach to achieve our objective. In the first stage, LGVTON warps the clothes of the model using a Thin-Plate Spline (TPS) based transformation to fit the person. Unlike previous TPS-based methods, we use the landmarks (of human and clothes) to compute the TPS transformation. This enables the warping to work independently of the complex patterns, such as stripes, florals, and textures, present on the clothes. However, this computed warp may not always be very precise. We, therefore, further refine it in the subsequent stages with the help of a mask generator (Stage 2) and an image synthesizer (Stage 3) modules. The mask generator improves the fit of the warped clothes, and the image synthesizer ensures a realistic output. To tackle the problem of lack of paired training data, we resort to a self-supervised training strategy. Here paired data refers to the image pair of model and person wearing the same cloth. We compare LGVTON with four existing methods on two popular fashion datasets namely MPV and DeepFashion using two performance measures, FID (Fréchet Inception Distance) and SSIM (Structural Similarity Index). The proposed method in most cases outperforms the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Data Availability Statement

The datasets used in this work are available in public domains. The sources are appropriately referred to.

Code availability

Will be made available once the work is published.

Notes

  1. This is elaborated more in the ablation study given in the Appendix.

  2. A more detailed discussion on the correlation layer is given in the Appendix.

  3. a detailed study on TPS is given in the Appendix.

  4. A detailed ablation study of MGM is given in the Appendix.

  5. more detailed in the Appendix.

  6. More results of LGVTON are given in the Appendix.

  7. Please see the Appendix for a detailed study on the utility of each component of LGVTON.

References

  1. Alp Güler R, Neverova N, Kokkinos I (2018) Densepose: Dense human pose estimation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7297–7306

  2. Belongie S, Malik J, Puzicha J (2001) Shape context: A new descriptor for shape matching and object recognition. In: Advances in neural information processing systems, pp 831–837

  3. Bogo F, Kanazawa A, Lassner C, Gehler P, Romero J, Black MJ (2016) Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In: European Conference on Computer Vision, Springer, pp 561–578

  4. Cao Z, Simon T, Wei SE, Sheikh Y (2017) Realtime multi-person 2d pose estimation using part affinity fields. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  5. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: A Large-Scale Hierarchical Image Database. In: CVPR09

  6. Donato G, Belongie S (2002) Approximate thin plate spline mappings. In: European conference on computer vision, Springer, pp 21–31

  7. Dong H, Liang X, Shen X, Wang B, Lai H, Zhu J, Hu Z, Yin J (2019a) Towards multi-pose guided virtual try-on network. In: Proceedings of the IEEE International Conference on Computer Vision, pp 9026–9035

  8. Dong H, Liang X, Shen X, Wu B, Chen BC, Yin J (2019b) Fw-gan: Flow-navigated warping gan for video virtual try-on. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1161–1170

  9. Duchon J (1977) Splines minimizing rotation-invariant semi-norms in sobolev spaces. In: Constructive theory of functions of several variables, Springer, pp 85–100

  10. Gong K, Liang X, Zhang D, Shen X, Lin L (2017) Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 932–940

  11. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680

  12. Han X, Wu Z, Wu Z, Yu R, Davis LS (2018) Viton: An image-based virtual try-on network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7543–7552

  13. Han X, Hu X, Huang W, Scott MR (2019) Clothflow: A flow-based model for clothed person generation. In: Proceedings of the IEEE International Conference on Computer Vision, pp 10471–10480

  14. Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems, pp 6626–6637

  15. Hsieh CW, Chen CY, Chou CL, Shuai HH, Cheng WH (2019a) Fit-me: Image-based virtual try-on with arbitrary poses. In: 2019 IEEE International Conference on Image Processing (ICIP), IEEE, pp 4694–4698

  16. Hsieh CW, Chen CY, Chou CL, Shuai HH, Liu J, Cheng WH (2019b) Fashionon: Semantic-guided image-based virtual try-on with detailed human and clothing information. In: Proceedings of the 27th ACM International Conference on Multimedia, pp 275–283

  17. Isola P, Zhu JY, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1125–1134

  18. Issenhuth T, Mary J, Calauzènes C (2019) End-to-end learning of geometric deformations of feature maps for virtual try-on. arXiv preprint arXiv:190601347

  19. Jae Lee H, Lee R, Kang M, Cho M, Park G (2019) La-viton: A network for looking-attractive virtual try-on. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp 0–0

  20. Jandial S, Chopra A, Ayush K, Hemani M, Krishnamurthy B, Halwai A (2020) Sievenet: A unified framework for robust image-based virtual try-on. In: The IEEE Winter Conference on Applications of Computer Vision, pp 2182–2190

  21. Jetchev N, Bergmann U (2017) The conditional analogy gan: Swapping fashion articles on people images. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2287–2292

  22. Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution. In: European conference on computer vision, Springer, pp 694–711

  23. Liu Z, Luo P, Qiu S, Wang X, Tang X (2016a) Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1096–1104

  24. Liu Z, Yan S, Luo P, Wang X, Tang X (2016b) Fashion landmark detection in the wild. In: European Conference on Computer Vision, Springer, pp 229–245

  25. Mabrouk AB, Zagrouba E (2018) Abnormal behavior recognition for intelligent video surveillance systems: A review. Expert Systems with Applications 91:480–491

    Article  Google Scholar 

  26. Maintz JA, Viergever MA (1998) A survey of medical image registration. Medical image analysis 2(1):1–36

    Article  Google Scholar 

  27. Minar MR, Ahn H (2020) Cloth-vton: Clothing three-dimensional reconstruction for hybrid image-based virtual try-on. In: Proceedings of the Asian Conference on Computer Vision

  28. Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv preprint arXiv:14111784

  29. Neuberger A, Borenstein E, Hilleli B, Oks E, Alpert S (2020) Image based virtual try-on network from unpaired data. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  30. Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In: European conference on computer vision, Springer, pp 483–499

  31. Pons-Moll G, Pujades S, Hu S, Black MJ (2017) Clothcap: Seamless 4d clothing capture and retargeting. ACM Transactions on Graphics (TOG) 36(4):73

    Article  Google Scholar 

  32. Raj A, Sangkloy P, Chang H, Hays J, Ceylan D, Lu J (2018) Swapnet: Image based garment transfer. In: Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XII, pp 679–695

  33. Rocco I, Arandjelovic R, Sivic J (2017) Convolutional neural network architecture for geometric matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6148–6157

  34. Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X (2016) Improved techniques for training gans. In: Advances in neural information processing systems, pp 2234–2242

  35. Sekine M, Sugita K, Perbet F, Stenger B, Nishiyama M (2014) Virtual fitting by single-shot body shape estimation. In: Int. Conf. on 3D Body Scanning Technologies, Citeseer, pp 406–413

  36. Shai T, Dimitri V, Michael U (2006) Polyharmonic smoothing splines and the multidimensional wiener filtering of fractal-like signals. In: IEEE TRANSACTIONS ON IMAGE PROCESSING, IEEE

  37. Shigeki Y, Okura F, Mitsugami I, Yagi Y (2018) Estimating 3d human shape under clothing from a single rgb image. IPSJ Transactions on Computer Vision and Applications 10(1):16

    Article  Google Scholar 

  38. Song D, Li T, Mao Z, Liu AA (2019) Sp-viton: shape-preserving image-based virtual try-on network. Multimedia Tools and Applications pp 1–13

  39. Sprengel R, Rohr K, Stiehl HS (1996) Thin-plate spline approximation for image registration. In: Proceedings of 18th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, IEEE, vol 3, pp 1190–1191

  40. Sun F, Guo J, Su Z, Gao C (2019) Image-based virtual try-on network with structural coherence. In: 2019 IEEE International Conference on Image Processing (ICIP), IEEE, pp 519–523

  41. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826

  42. Wahba G (1990) Spline models for observational data, vol 59. Siam

  43. Wang B, Zheng H, Liang X, Chen Y, Lin L, Yang M (2018a) Toward characteristic-preserving image-based virtual try-on network. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 589–604

  44. Wang W, Xu Y, Shen J, Zhu SC (2018b) Attentive fashion grammar network for fashion landmark detection and clothing category classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4271–4280

  45. Wang Z, Bovik AC, Sheikh HR, Simoncelli EP, et al. (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4):600–612

    Article  Google Scholar 

  46. Wu Z, Lin G, Tao Q, Cai J (2019) M2e-try on net: Fashion from model to everyone. In: Proceedings of the 27th ACM International Conference on Multimedia, ACM, pp 293–301

  47. Xian W, Sangkloy P, Agrawal V, Raj A, Lu J, Fang C, Yu F, Hays J (2018) Texturegan: Controlling deep image synthesis with texture patches. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 8456–8465

  48. Yang H, Zhang R, Guo X, Liu W, Zuo W, Luo P (2020) Towards photo-realistic virtual try-on by adaptively generating-preserving image content. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  49. Yu R, Wang X, Xie X (2019) Vtnfp: An image-based virtual try-on network with body and clothing feature preservation. In: Proceedings of the IEEE International Conference on Computer Vision, pp 10511–10520

  50. Zanfir M, Popa AI, Zanfir A, Sminchisescu C (2018) Human appearance transfer. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  51. Zeng W, Zhao M, Gao Y, Zhang Z (2020) Tilegan: category-oriented attention-based high-quality tiled clothes generation from dressed person. NEURAL COMPUTING & APPLICATIONS

  52. Zhao R, Ouyang W, Wang X (2013) Unsupervised salience learning for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3586–3593

  53. Zheng N, Song X, Chen Z, Hu L, Cao D, Nie L (2019a) Virtually trying on new clothing with arbitrary poses. In: Proceedings of the 27th ACM International Conference on Multimedia, ACM, pp 266–274

  54. Zheng N, Song X, Chen Z, Hu L, Cao D, Nie L (2019b) Virtually trying on new clothing with arbitrary poses. In: Proceedings of the 27th ACM International Conference on Multimedia, pp 266–274

Download references

Acknowledgements

The authors sincerely thank Aruparna Maity and Dakshya Mishra for their help and thank Sankha Subhra Mullick for helpful discussion during this work.

Author information

Authors and Affiliations

Authors

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Appendix

Appendix

In this appendix, first, we discuss Thin-plate spline (TPS) transformation. Second, we give a detailed study on the idea of the correlation layer that has been used in the PGWM of LGVTON. Third, we give a detailed ablation study to clarify in detail the effectiveness of each of the components of LGVTON. Fourth, we provide the implementation details of our network architecture. In addition, we give some results of LGVTON on DeepFashion [23] and MPV dataset [7] in Figs. 242526, and 27.

1.1 Thin Plate Spline (TPS)

Given a set of pair of source and target landmarks {(\(\mathbf {r}_j\), \(\mathbf {t}_j\)); j = 1, . . . , N}, a polyharmonic smoothing spline fits a function \(f(\cdot )\) between the sets {\(\mathbf {r}_j\); j = 1, . . . , N} and {\(\mathbf {t}_j\); j = 1, . . . , N} that minimizes the following objective functional [36]:

$$\begin{aligned} \tau [f] = \sum _{j=1}^N \Vert f(\mathbf {r}_j)-\mathbf {t}_j\Vert _2^2 + \iint \limits _{\mathbb {R}^2} \Vert \nabla ^m f\Vert _2^2dx\ dy \end{aligned}$$
(7)

where \(\nabla ^m f\) is the vector of all m-th order partial derivatives of f that enforces smoothness on f. In our case, \(\mathbf {r}_j, \mathbf {t}_j\in \mathbb {N}^{2}\), or in other words, \(\mathbf {r}\equiv (r^x, r^y)\) and \(\mathbf {t}\equiv (t^x, t^y)\). A special case of polyharmonic smoothing spline where m = 2, called thin-plate spline (TPS), is proposed by Duchon [9]. It minimizes the following objective functional based on eq. (7):

$$\begin{aligned} \tau [f] = \sum _{j=1}^N \Vert f(\mathbf {r}_j)-\mathbf {t}_j\Vert _2^2 + \iint \limits _{\mathbb {R}^2} [f_{xx}^2 + 2f_{xy}^2 + f_{yy}^2]dx\ dy \end{aligned}$$
(8)

The physical analogy to the concept of the thin-plate comes from the minimization of second-order gradients (\(f_{xx}\), \(f_{xy}\), \(f_{yy}\)) in its objective, which restricts bending and enforces smoothness in the TPS fit. This is similar to the effect of physical rigidity in the thin metal plate.

A closed form solution of it as proposed in [42] is given by

$$\begin{aligned} \mathbf {t} = f(\mathbf {r}) = \mathbf {a}_0 + \mathbf {a}_1 r^x + \mathbf {a}_2 r^y + \sum _{j=1}^N \mathbf {c}_j\phi (\Vert \mathbf {r} - \mathbf {r}_j\Vert _2), \end{aligned}$$
(9)

where \(\mathbf {a}_0\), \(\mathbf {a}_1\), \(\mathbf {a}_2\), \(\mathbf {c}_j\) are parameters with dimension equal to the dimension of landmarks, which is 2 in our case.

The TPS is represented in terms of radial basis function (RBF) \(\phi\). Given a set of control points \(\mathbf {r}_{j}\), \(j=1, ..., N\) a RBF maps a given point \(\mathbf {p}\) to a new location \(g(\mathbf {p})\), represented by,

$$\begin{aligned} g(\mathbf {p}) = \sum _{j=1}^N \mathbf {c}_j\phi (\Vert \mathbf {p} - \mathbf {r}_j\Vert _2) \end{aligned}$$
(10)

The radial basis kernel used in TPS is \(\phi (\mathbf {r})\) = (\(\mathbf {r}^2\) ln \(\mathbf {r}\)).

When the values of \(\mathbf {t}_j\) are noisy (due to landmark localization errors), which is a very obvious situation in practice, the interpolation requirement is relaxed by means of regularization. This reduces the problem to an approximation problem [6, 39]. This is obtained by minimizing

$$\begin{aligned} H[f] = \sum _{j=1}^N \Vert f(\mathbf {r}_j)-\mathbf {t}_j\Vert _2^2 + \lambda \iint \limits _{\mathbb {R}^2} [f_{xx}^2 + 2f_{xy}^2 + f_{yy}^2]dx\ dy, \end{aligned}$$
(11)

where \(\lambda\) is the regularization parameter, which is a +ve scalar. Here, \(\lambda\) determines the relative weight between the approximation behavior and the smoothness of the transformation. In the limiting case, \(\lambda\) = 0, we get an interpolation transformation. If the value of \(\lambda\) is small we get a good approximation behavior; for a higher value of \(\lambda\), we get a very smooth transformation function. But then the local deformations determined by the set of landmarks are maintained poorly.

1.2 Correlation layer

Correlation is a statistical technique that can show whether and how strongly pairs of variables are related. We use the idea of correlation in our Pose Guided Warping Module (PGWM). Here we model the warping of the source clothes as a function of the correlation between the human landmarks of the model and the person, and fashion landmarks of the source clothes. The reason for using the correlation is that when a person wears clothes it gets molded according to his body shape and pose. Now, here we are transferring the clothes from model to person hence, the clothes have to undergo deformation according to the way the body shape and pose changes from model to person. This change is modeled by correlation.

Coming into more detail, given the data \(f_a, f_b \in \mathbb {R}^{1\times l}\), the correlation layer produces the correlation map \(C_{ab} \in \mathbb {R}^{l \times l}\), where,

$$\begin{aligned} C_{ab} = {f_a}^{T}{f_b}. \end{aligned}$$
(12)

Therefore a correlation map basically contains pairwise similarity of all the values of \(f_a\) and \(f_b\). This is useful in establishing relationships among different landmarks of model and person to model the warping.

1.3 Ablation study

In this subsection, we conduct an in-depth qualitative study on the significance of each component of LGVTON.

1.3.1 Utility of different loss functions for training ISM

We run a comparative study on the effect of different loss functions used to train ISM. Keeping the other settings same, we train 3 different instances of ISM with different combinations of loss functions as shown in Fig. 19.

Fig. 19
figure 19

Effectiveness of different losses during training the Image Synthesizer module (ISM). (1) LGVTON (non-GAN, DSSIM loss) - a non-GAN variant of ISM trained with DSSIM loss, (2) LGVTON (cGAN, DSSIM loss) - cGAN with the generator trained with DSSIM loss, (3) LGVTON - similar to (2) but the generator is trained with DSSIM and VGG perceptual loss both. Notice the clarity of output increases from LGVTON (non-GAN, DSSIM loss) to LGVTON

As GAN tries to approximate the data distribution, so it generates better output than its non-GAN variant [11]. This is evident in the results of the non-GAN variant of LGVTON where the ISM is trained with DSSIM loss. We call this variant LGVTON (non-GAN, DSSIM loss) and the corresponding GAN variant LGVTON (cGAN, DSSIM loss). Comparing the results of LGVTON(cGAN, DSSIM loss) and LGVTON (ours) (a cGAN, where the generator is trained with DSSIM and perceptual loss both), it can be observed that the result improves in the presence of VGG perceptual loss.

1.3.2 Significance of fashion landmarks in PGWM

We conduct a study on PGWM when the warping is done with human landmarks only instead of both human and fashion landmarks. Figure 20 shows two cases portraying the effectiveness of fashion landmarks around the collar and hem. Having only human landmarks might serve the purpose but at the cost of an increase in the amount of warping glitches, which is tackled to some extent in ISM with the help of the mask generated by the MGM. The scores reported in Tables 3 and 4 (LGVTON (w/o fashion landmarks)) shows that without fashion landmarks the performance of LGVTON degrades. However, for obvious reasons, the performance degrades even more if the support of the MGM is also removed (observe the score of LGVTON (w/o fashion landmarks, w/o MGM) in Tables 3 and 4).

Fig. 20
figure 20

Role of different fashion landmarks in predicting target warp of the model clothes in PGWM. The figure shows the utility of fashion landmarks around the collar (left) and hem (right) respectively. For a better understanding of the viewer, instead of showing the generated warped clothes image only, we show the overlay of it on the person image. (1) Person, (2) model, (3) clothes warped using only human landmarks, (4) transformed locations of fashion landmarks of the model clothes, obtained by the corresponding transformation function of (3), (5) clothes warped by PGWM using both human and fashion landmarks, (6) fashion landmarks predicted in PGWM, which is observed to be more accurate than (4). This results in better warping of the model clothes as shown in (5) in comparison to that in (3)

1.3.3 Effectiveness of correlation layer in PGWM

We study the effectiveness of the correlation layer in the fashion landmark predictor network of PGWM (refer to Fig. 21). The presence of the correlation layer establishes the relationship between the human poses of the model and the person, which in turn assists in predicting better estimates of fashion landmarks of the target warp clothes.

Fig. 21
figure 21

A study on the effectiveness of correlation layer in fashion landmark predictor network \(\mathcal {F}\) of PGWP in LGVTON. (a) Person image, (b) Model image, (c, e) predicted locations of fashion landmarks and warp clothes generated by PGWM when \(\mathcal {F}\) is trained w/o correlation layer and with correlation layer respectively. (d, f) final VTON result for the warp cloths shown in (c) and (e) respectively. We observe that the results in (f) are better than that in (d), which justifies the potency of the correlation layer in PGWM

1.3.4 Studying the role of target mask as input to ISM

For understanding the effect of the target mask, we train an instance of ISM without providing the target mask as input. Now it is observed that w/o target mask, the network can not identify the warping glitches. This can be verified from Fig. 22h where the effect of the warping glitch is propagated to the output. However, when a target mask is given as input, ISM can identify the areas of warping glitch and takes the necessary action according to the type of the glitch. The reason being the variety of clothing types makes the network confused to distinguish a warping glitch from the design of the clothes, e.g., in the first two rows, the inappropriate estimation of \(c'_{flm}\) causes the sleeves to be stretched more outwards. While that in Fig. 22j is not observed as the network removes those areas of extra stretch and replaces them with the background. In the example of the third row Fig. 22h, the effect of warping glitch exposes some body parts near the right neckline of the person (better viewed when zoomed in), while this gets filled with a clothes texture and color in Fig. 22j.

Fig. 22
figure 22

Effectiveness of target mask generated by our mask generator module (MGM). We notice significant differences (as shown in (f), (g)) between the mask (d) of the predicted warped clothes (c) and the mask (e) predicted by MGM corresponding to the warp clothes (c). (h) shows the final try-on result contains artifacts when ISM is trained without the target mask. To show the effectiveness of the target mask we also show the result where instead of (e) we give (d) as target mask input to ISM; which is shown in (i). As we observe the artifacts still remain in (i). Whereas, when (e) is given to ISM as target mask, the result (j) improves, e.g., the areas with pixel value 1 in (f) are filled with necessary color and texture details, and those in (g) are replaced with background details resulting in a better VTON result. Note that the hole in the warp clothes in row 2 is not due to a warping glitch. This is due to inaccurate human parsing. However, LGVTON handles this also

1.4 Implementation details

1.4.1 Pose Guided Warping Module (PGWM)

The dense neural network in fashion landmark predictor network \(\mathcal {F}\) contains 6 consecutive dense layers with 900, 800, 600, 500, 250, 100 nodes respectively, each having activation function tanh. Finally, the output layer has 12 nodes (for 6 fashion landmarks each with 2 coordinate values x and y) and sigmoid as the activation function.

1.4.2 Mask Generator Module (MGM)

The architecture of MGM is that of an hourglass network [30]. Generally, the processing of clothes requires identifying the different parts of them to establishing a semantic understanding of their structure. A well-known network that suits this requirement is the hourglass network [30].

An hourglass network is a CNN (Convolutional Neural Network) that captures features at various scales and is effective for analyzing spatial relationships among different parts of the input. Multiple of these hourglass networks can be stacked together with intermediate supervision for making it deeper. However, for our purpose, the stack size of 1 is found to be sufficient. An overview of the architecture of one hourglass network is given in Fig. 23.

Fig. 23
figure 23

Overview of an hourglass network (taken from the original paper [30]), where each block represents a residual module

The network is called hourglass due to its top-down, bottom-up architecture, which matches with the shape of an hourglass. It uses convolution and max pooling layers to downscale features to a low resolution. After that, the top-down sequence of upsampling begins, and at each resolution, the corresponding features from downsampling parts are added as skip connections. However, it applies more convolutions on the features of skip connections and then does element-wise addition of the two sets of features.

1.4.3 Image Synthesizer Module (ISM)

The hourglass network in G (the generator network of ISM) has a stack size of 1. The convolution layer generating \(I_m\) in G has a kernel of size 1 \(\times\) 1 with \(L_{1}\) regularization and sigmoid activation function. For training ISM, we alternate between 3 steps of generator training and 1 step of discriminator training. It is trained with the same settings of Adam as our PGWM. For DSSIM loss, the kernel size taken is 3 \(\times\) 3. The discriminator D is a patchGAN discriminator [17]. Instead of classifying the whole image as real or fake, it classifies each patch of the image, where the patch size is much smaller than the input image size. Hence pixels separated by more than a patch diameter gets modeled independently. This makes it work like a texture/style loss as discussed in [17], which helps to keep better texture in the final output image. Existing works have shown the efficacy of patchGAN [17, 47] in image-based problems. For human parsing, we used the human parsing network proposed by [10] pretrained on the LIP dataset [10]. The dataset contains 19 part labels, 6 labels for body parts, and 13 for clothing categories. During our quantitative analysis we used the models of CP-VTON, VTNFP, MG-VTON pretrained on MPV [7] dataset. For VITON we used the VITON dataset pretrained model weights provided in its official implementations. Some more results of this work on DeepFashion and MPV datasets are shown in Figs. 242526, and 27.

Fig. 24
figure 24

Our results on DeepFashion [23] dataset. Image at position (i,j) represents the result of VTON when person at \(j^{th}\) column wears the clothes of the model at the \(i^{th}\) row

Fig. 25
figure 25

Our results on MPV [7] dataset. Image at position (i,j) represents the result of VTON when person at \(j^{th}\) column wears the clothes of the model at the \(i^{th}\) row

Fig. 26
figure 26

Our results on MPV dataset for different model and person combinations

Fig. 27
figure 27

Our results on MPV dataset for different model and person combinations

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Roy, D., Santra, S. & Chanda, B. LGVTON: a landmark guided approach for model to person virtual try-on. Multimed Tools Appl 81, 5051–5087 (2022). https://doi.org/10.1007/s11042-021-11647-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-11647-9

Keywords

Navigation