Keywords

1 Introduction

Online fashion sales are already worth billions of dollars, and keep growing rapidly. An ultraconnected millenial generation spends large amounts of time on social networks such as Instagram. As a result, the quest for fashion inspiration is being deeply transformed. Instead of browsing shop displays, they take screenshots of images with fashion favourites and expect to be able to search for them on their smartphone. Hence, “street-to-shop,”, the task of retrieving products that are similar to garments depicted in a “wild” image of varying quality and non-uniform background, is a critical capability for online stores. More and more of them now offer visual search features, Asos in the UK being one of the latest examples [29].

The surge of computational fashion is reflected by a growing number of academic contributions to the topic, often in partnership with online shops. A lot of effort focuses on representation learning for apparel [1, 3, 11, 28], where fashion items are encoded as feature vectors in a high-dimensional map that is implicitly structured by their properties. Many efforts are also spent on building taggers, models that predict attributes of products such as yellow, v-neck and long sleeves [2, 4, 5]. Another line of work is clothing parsing [7, 31, 34, 35], which consists in finding various apparel categories in the image, pixel-wise. Magic mirrors are also popular [6, 13, 18, 23, 30, 36], and fashion synthesis is gaining attention [14, 40]. Large companies also contribute, for example Pinterest [16] or eBay [37], even if the actual architectures or datasets are not shared.

Fig. 1.
figure 1

Typical images from our catalogue for a black t-shirt.

Research on the street-to-shop task started with [24], and has grown into a large body of literature. Early studies [8, 17, 24, 33, 34] rely on classical computer vision and use tools such as body part detection and hand-crafted features. A few years ago, deep learning has become the norm, however. Recent studies typically share a common principle: Feature representations are learnt for both query images and products via attribute classification, then a ranking loss is employed which encourages matching pairs to have higher scores [12, 15, 25, 27, 28, 32]. Often, siamese architectures are used to relate the two domains [12, 32], but some studies do not make a dichotomy between query and product images and employ a single branch [25].

As the leading European online fashion platform, which has been active for ten years, Zalando has accumulated a huge catalogue containing millions of items, each of which is described by a set of studio images (Fig. 1) and a set of attributes. This data is leveraged throughout the company to provide internal and customer-facing services. At Zalando Research we continuously develop the representation learning framework FashionDNA [3, 11]. It serves both as a practical numerical interface for visual information, and as a research tool for fashion computation. In particular, it provides a compact product representation in our visual search model Studio2Shop [20]. Studio2Shop matches studio model images (with uniform background, as in Figs. 1b–d) with fashion articles. It works well, but the restriction to studio model images limits its practical application, and we aim to extend it to real-life (wild) images with various types of backgrounds and different illuminations.

To extend Studio2Shop to a true street-to-shop model, we can think of two obvious approaches:

  • Replace the studio images with “wild” image data, and retrain,

  • build a pipeline that first performs image segmentation, i.e. separates the background from the person shown, then matches the latter image to fashion products.

While the first approach is direct, more elegant, and likely faster at test-time, obtaining training data with appropriate annotations in sufficient numbers and quality is a time-consuming and costly endeavour, as there is a shortage of publicly available data, and images would have to be annotated manually to retain the quality of studio image annotations. The second approach allows us to retain our large catalogue dataset as primary source to learn image–product matching. Moreover, a body of ground truth data for the segmentation model is available for academic purposes, and the image segmentation task can easily be crowd-sourced for commercial purposes. Recent models [9, 26] are mature enough to quickly adapt working techniques.

Here, we follow the second approach and show that it is promising. Our main contribution is three-fold:

  • We extend Studio2Shop to wild images,

  • we keep all the former contributions of Studio2Shop: (a) to require no user input, (b) to naturally handle all fashion categories, and (c) to show that static product representations are effective.

  • we achieve reasonable results on external datasets without fine-tuning our model.

We now proceed with an overview of our model pipeline (Sect. 2), and describe its components in more detail in Sects. 35. Section 6 shows our results on external datasets. Finally, Sect. 7 concludes the study.

Fig. 2.
figure 2

The pipeline of Street2Fashion2Shop. The query image (top row) is segmented by Street2Fashion, while FashionDNA is run on the title images of the products in the assortment (bottom row) to obtain static feature vectors. The result of these two operations forms the input of Fasion2Shop which handles the product matching.

2 Street2Fashion2Shop: The Pipeline

Figure 2 shows the working pipeline of Street2Fashion2Shop. On the top row, the wild query image is segmented by Street2Fashion in order to discard the background and keep the fashion only. On the bottom row, our product representation FashionDNA is applied to the title images of the set of products we should retrieve from (the assortment), in order to obtain static feature vectors. Finally the segmented query image and the product feature vectors (the fDNAs) form the input of the product matching model Fashion2Shop which gives a score to each product conditioned on the query image. Finally products are ranked by relevance (decreasing score) and the first ones are displayed.

3 Fashion DNA “1.1” (fDNA1.1)

FashionDNA (fDNA) is the pipeline component that provides a fixed numerical article encoding. For convenience, we denote the previous version used in [20] as FashionDNA 1.0, and now describe the current module, FashionDNA 1.1.

The wealth of curated fashion article data owned by Zalando enables us to find meaningful item representations using deep learning methods. For the present work, we employ article embeddings of dimension \(d = 128\), as opposed to 1536 previously, that are extracted as hidden activations of a fully convolutional neural network with residual layers. Images of products presented on a white background (“title” images) serve as input, and the network is tasked to predict thousands of binary labels (tags) describing the article.

3.1 Data

FashionDNA is meant to be of general purpose for Zalando and uses a very large set of \(\sim \)2.9 M items. For these products, we retrieve high-resolution title images, of the kind shown in Fig. 1a, typically in an upright \(762\times 1100\) pixel format, although exact dimensions can vary. These images are downscaled to the biggest size that would fit a canvas of dimension \(165\times 245\), preserving their aspect ratio, and placed in its center. We pad images with white background as necessary.

As labels, we use curated item attributes that are assigned by fashion experts at the time of image production. Attributes include general properties (like brand and silhouette, describing the functional article category), as well as tags assigned for specific product groups (e. g., neckline tags for the shirt silhouette). They possess wide variations in frequency, consistency, and visibility on images. We roll out all existing attribute options into a sequence of boolean labels of equal status. For fDNA 1.1, we obtain 11,668 such labels, as opposed to 6,092 for FashionDNA 1.0. These 11,668 labels include >6k distinct brands, and 5–25 of them typically assigned per article. Their population has long-tail characteristics, where most labels occur only rarely. The Shannon entropy quantifies the information contained in the label distribution alone, disregarding the images:

$$\begin{aligned} \mathcal{H}_S \,=\, -\sum _{\lambda = 1}^L q_\lambda \log q_\lambda \,\approx \, 53.5 \;, \end{aligned}$$
(1)

where \(q_\lambda \) is the frequency of label \(\lambda \) among items. This entropy serves as a reference point in training our image-to-label classification model.

3.2 The Model

Architecture. For fDNA 1.1, we employ a fully convolutional residual network [10] with batch normalization and a bottleneck layer for fDNA extraction that was purpose-built and trained using Tensorflow. For architecture details we refer to Tables 1a and 1b. We note that the current model is much deeper than the network used in [20] which rested on the AlexNet model [19], and provides superior quality embeddings with fewer parameters and fewer dimensions (128 instead of 1536).

Table 1. Architecture of FashionDNA 1.1.

Loss Function. We train our network to minimize the cumulative cross-entropy loss, averaged over training set items. To estimate it, we compare the predicted probability \(p_{k \lambda }\) for the kth article to carry label \(\lambda \) in a minibatch of \(K = 128\) with the corresponding ground truth \(y_{k \lambda } \in \{ 0, 1 \}\), and sum over all labels:

$$\begin{aligned} \mathcal{L}_\mathrm {mini} \,=\, - \frac{1}{K} \sum _{k = 1}^K \sum _{\lambda = 1}^L \bigl ( y_{k \lambda } \log p_{k \lambda } + (1 - y_{k \lambda }) \log (1 - p_{k \lambda }) \bigr ) \;. \end{aligned}$$
(2)

Note that all binary labels carry equal weight in our model.

Training Procedure. We split the articles into training (\(\sim \)2.8M items, 96%) and test sets (\(\sim \)110k items). Unlike some embedding models for fashion images [1, 28] that foot on pre-trained, open-sourced models, we train our network from scratch. We use Glorot initialization for all layers, save for the logistic layer bias which we pre-populate with the inverse sigmoid function \(\sigma ^{-1}(q_\lambda )\) of the observed label frequency, so that the initial network loss (2) matches the Shannon entropy \(\mathcal{H}_S\) (1). The model is trained for 11 epochs using AdaGrad as an optimiser.

3.3 Results

We reach a loss of \(\mathcal{L} \approx 19.5\), as opposed to the initial 53.5. In comparison, FashionDNA 1.0 only reached 22.5. We use test data to assess FashionDNA 1.1 on the task of predicting some product categories of interest to this manuscript: blazer, dress, pullover, shirt, skirt, t-shirt/top, trouser. We use as metric the area under the ROC curve, which should be 0.5 for a random guess, and 1 for a perfect model. For all the categories of interest, the area under the curve was above 0.99. These results are very encouraging, as fDNA1.1 is clearly able to encode much of the information needed for visual search.

Evaluating the ability of a fashion item embedding to create meaningful neighbourhoods is conceptually difficult, as the notion of product similarity is in the eye of the observer. Our model has the capacity to use many different modes of similarity (like silhouette, color, function, material, etc.) simultaneously to achieve the match between two quite different presentations of the same article. Sampling nearest neighbors in fDNA space hints indeed at such multi-modal behavior (Fig. 3): A pair of snow pants is mostly matched by function, a dress based on color and embellishment (lace inserts).

Fig. 3.
figure 3

Nearest neighbors of sample articles (left column) among a test set of 50k clothing items in the FashionDNA embedding space.

4 Street2Fashion (segmentation)

Street2Fashion is the component of the pipeline responsible for segmenting out the background, i.e. everything which is not people or fashion items.

4.1 Data

Street2Fashion is trained using a mixture of publicly available data and data from our shop (both from the catalogue and from street shots). Our complete dataset is made of:

  • 19554 fashion images mostly from Chictopia10K [21, 22] and Fashionista [22, 35] where various categories of garments, together with hair and skin, are segmented separately. We do not use the category-specific segmentations, instead we combine them and treat them all as foreground.

  • 200 studio model images from our catalogue, segmented by us.

  • 90 model images from our street shots, segmented by us.

We cannot release our own images, however much of the results can be reproduced using the public Chictopia and Fashionista datasets.

In addition, images are slightly altered to increase the robustness of the segmentation model. As described in Appendix Sect. A, the set of transformations is {“none”,“translate”, “rotate”,“zoom in”, “zoom out”,“blur”, “noise”,“darken”, “brighten”}. Additionally, this set is doubled by applying a horizontal flip (“fliph”) to the image first. All images undergo all transformations, with the parameters of the transformation each time randomly sampled from an acceptable range, thereby inflating the dataset by a factor of 18. Images are then resized to 285 \(\times \) 189. Typical images from each source are shown, after transformation, in Fig. 5 in Sect. 4.3.

4.2 Model

Architecture. Our segmentation model follows the idea of a U-net architecture [26], as given in Table 2. The input is an image of size (285, 189, 3) whose values are divided by 255 to lie between 0 and 1. The output is an image where the person has been identified and the original background is replaced by white. The values also lie between 0 and 1.

Table 2. Architecture of Street2Fashion.

Backward Pass. The labels or targets are the original images where the background has been replaced with white pixels, and the loss is the mean-square-error between the corresponding pixels:

$$ \mathcal {L} = \frac{1}{N}\sum _{i=1}^{N}{\left( \frac{1}{J}\sum _{j=1}^{J}{\left( v_{ij} - l_{ij}\right) ^2}\right) } $$

where N is the number of images in the batch, J the number of elements per image (285 \(\times \) 189 \(\times \) 3 = 161595 elements), \(v_{ij}\) the predicted value for the \(j^{\text {th}}\) element of the \(i^{\text {th}}\) image, and \(l_{ij}\) the ground-truth value of that element. We considered a binary cross-entropy loss directly on the mask itself, but that changed neither the global performance nor the aspect of the segmentation.

4.3 Results

Experimental Set-up. We randomly split the dataset, keeping 80% for training, and setting the rest aside for testing. The optimiser is Adam, with a learning rate of 0.0001 and other parameters set to default, the batch size is 64. Our internal images, due to their much lower representation, are upweighted by a factor of 10 in the loss. Performance is measured using two metrics: mean-square-error (mse) and accuracy. Both metrics are computed at the pixel level using the soft mask predicted by the model of interest. The mse compares the true segmented image to the soft segmented image obtained from the soft mask:

$$ \text {mse(image)} = \frac{1}{J}\sum _{j=1}^{J}{\left( v_j - l_j\right) ^2} $$

where J is the number of elements per image (285 \(\times \) 189 \(\times \) 3 = 161595 elements), \(v_j\) the predicted value for the \(j^{\text {th}}\) element of the image, and \(l_j\) the ground-truth value of that element. The accuracy compares the true mask to the hard mask obtained from the soft mask (the soft values higher than or equal to 0.5 become 1, the others 0):

$$ \text {accuracy(image)} = \frac{1}{J}\sum _{j=1}^{J}{\mathbbm {1}_{\left( q_j = m_j\right) }} $$

where \(q_j\) is the predicted hard value for the \(j^{\text {th}}\) element of the mask, and \(m_j\) the ground-truth value of that element. We want low mse and high accuracy.

Comparison with Mask-RCNN [9]. We run the publicly available keras implementation of Mask-RCNN, a state-of-the-art model for multiclass segmentation, on our test images. We keep the image size and the set of images the same as above. For Mask-RCNN, we consider as fashion the categories {person, backpack, umbrella, handbag, tie, suitcase} as they would all be labelled as foreground in our ground truth data. Figure 4 shows the distributions of these metrics at the image level (averaged over all pixels in an image), in green for Mask-RCNN, in orange for our model. In each boxplot, the black horizontal middle line represents the median, while the red line represents the mean.

Fig. 4.
figure 4

Distribution of metrics at the image level, in green for Mask-RCNN [9], in orange for Street2Fashion. In each boxplot, the black horizontal middle line represents the median, while the red line represents the mean. (Color figure online)

Street2Fashion performs significantly better on these fashion images than Mask-RCNN, despite being much simpler, however this statement should be interpreted very carefully. Street2Fashion has trained on the same distribution of images as the test images, it is tuned to solve this task, but it can only process people and fashion accessories such as umbrellas or bags, it only knows these categories as “foreground” and it does not generalise to crowds. In contrast, Mask-RCNN is more powerful as it can do many more categories and can also deal with several instances of each category. However our use case is specifically about such fashion images, which implies that we do not need to be general or to deal with multiple instances. We could fine-tune Mask-RCNN to our task and benefit from its flexibility, however, because it is much simpler, Street2Fashion is also much faster, which is important at test time. For a batch size of 16 images, without any image pre-processing or results post-processing involved, applying the keras method predict_on_batch on the same GPU takes 0.04 s for Street2Fashion, 5.25 s for Mask-RCNN (these numbers averaged over 500 batches and are only a rough estimation).

Examples of Segmentations. Figure 5 shows random examples of test segmentations for the four image sources used in training. Generally the results are encouraging, and even if they are not always perfect, they are good enough to retain most of the clothing while discarding most of the background.

5 Fashion2Shop (Product Matching)

Fashion2Shop is the component of the pipeline responsible for matching query images with white backgrounds to relevant products. It uses the same dataset and the same architecture as Studio2Shop [20], however there are a few key differences:

  • FashionDNA 1.1 is of better quality than FashionDNA 1.0 [20], see Sect. 3. In addition, its dimensionality is much smaller (128 against 1536), so unlike in [20] no dimensionality reduction method is needed.

  • The training uses model images segmented by Street2Fashion, see Sect. 4.

  • In order to be more robust to unknown types of images, the image transformations described in Appendix Sect. A are used during training. The dataset is not inflated but a transformation and its parameters are sampled randomly for each image of each batch.

5.1 Data [20]

Catalogue Images. Zalando has a catalogue rich of millions of fashion products, each of which is described by a set of studio images. For historical and practical reasons, for the visual search task, we focus on 7 types of articles: dress, blazer, pullover, t-shirt/top, shirt, skirt and trousers, however the approach put forward is not formally restricted to a number of categories and can be extended in a straightforward manner.

Figure 1 shows an example of our catalogue images for a black t-shirt. All our images follow a standard format (though the standard format may change over time) of two main types:

  • The title image, which shows a front view of the article on a (invisible) hanger, see Fig. 1a.

  • (studio) model images, which show various views of the article being worn by a model with a clean background, and can have three different formats:

    • Full-body: these images show the full person and usually display more than one article, see Fig. 1b. There can be occlusions, see Fig. 1c.

    • Half-body: these images focus on the article, see Fig. 1b.

    • Detail: these images focus on a detail of the article (say the zipper of a cardigan) and show fewt to no body parts, see Fig. 1e. These “detail” images are challenging because it is often very hard to tell even for humans what kind of product it is even supposed to be.

Fig. 5.
figure 5

Examples of segmentation results on test images.

As in [20], the product is represented by FashionDNA, a numerical vector learnt separately on title images. Unlike [20] however, we do not need to apply PCA to reduce dimensionality as fDNA1.1 vectors are already of dimension 128, which allows us not to lose any information. Indeed, as stressed in Sect. 3, the product representation in this manuscript is quite different from [20], both in terms of model and quality. Having this representation is a strong asset, but in order to assess the generalising capacity of our model properly, title images should not be used as query images. This can change for a model used in production however, and should not constitute a limitation.

Model images are all considered query images. We take the first 4 model images per article, or all of them if there are fewer than 4. This is enough to capture most full/half-body images and discard as many detail images as possible. In total, in this manuscript, we use 246,961 products and 957,535 model images for training, 50000 products and 20000 images for testing.

Annotations. In visual search we are interested in retrieving products relevant to a query image. For training, we need examples of such relevant products, which we easily get for the studio model images from our catalogue. Several products are usually visible on model images, however usually only the product for which the image was shot is entered into the system. As a result, most of our model images are annotated with this one product of interest only.

For a (increasing) minority of the full-body images however, more than one product are entered into the system. In our dataset, this minority is made of 126713 images (about 13%). This subset of well annotated images allows us to find several types of products in most full-body images. The maximum number is 4, but most images (about 80%) only have 1. Note that we only keep the annotations that belong to the 7 types of articles of interest. This multiple annotations are precious but not perfect. A lot of the extra products in these images overlap, i.e. some products are re-used for different shots, some more than 100 times or even 700 times. A typical case is a pair of black skinny jeans that fits virtually every t-shirt/top/blouse. Models will change their top for different shots but will keep the jeans for efficiency. Another example is a plain t-shirt or tank top, often worn under the shirt and therefore not visible, which creates inconsistent annotations.

Table 3. Architecture of Model2Shop.

5.2 Model

Architecture [20]. The architecture is the same as in [20] and given in Table 3. It has a left leg (or query leg), whose input is referred to as query_input in the table, and a short static right leg (or product leg), whose input is referred to as article_fDNA in the table.

Training. The training procedure is as described in [20]. The query leg is fed with model images, the product leg with its fDNA. Model images are resized to 224 \(\times \) 155, transformed randomly using one of the transformations in Appendix Sect. A and segmented with Street2Fashion. Each query image is run against 50 products: we first take all the products annotated, they will constitute our positive articles, and we complete the 50 slots using randomly sampled products, which will constitute our negative articles. The positives are always the same but the negatives are sampled randomly for each batch and for each epoch. For a mini-batch size of N, we therefore get \(50\times N\) matches x, some positive with label \(y=1\), some negative with label \(y=0\). The optimiser is Adam, with a learning rate of 0.0001 and other parameters set to default, the batch size is 64.

Backward Pass. We use a cross-entropy loss:

$$ \mathcal {L} = \sum _{i=1}^N{\sum _{j=1}^{50}{\left( y_{ij}\log {p\left( I_i, x_{ij}\right) } + \left( 1-y_{ij}\right) \log {\left( 1-p(I_i, x_{ij})\right) }\right) }} $$

where \(I_i\) is the \(i^{th}\) model image, \(x_{ij}\) the \(j^{th}\) product for that image, \(p\left( I_i, x_{ij}\right) \) the probability given by the model for the match between image \(I_i\) and product \(x_{ij}\), and \(y_{ij}\) the actual label of that match (1 or 0).

Testing. At test time, each test query image is matched against a gallery of previously unseen products, and these products are then ranked by decreasing matching probability. The plots in this manuscript typically show the top 50 suggested products for each image.

5.3 Competing Architectures [20]

In [20], Studio2Shop is tested alongside different variants of its architecture, summarised below. For more detailed information, please refer to the original manuscript.

  • fDNA1.0-ranking-loss (formerly referred to as fDNA-ranking-loss) uses a simple dot product to order the (query image, product) pairs, and a sigmoid ranking loss.

  • fDNA1.0-linear uses (formerly referred to as fDNA-linear) a simple dot product to order the (query image, product) pairs, and a cross-entropy loss.

  • Studio2Shop with fc14 (formerly referred to as fc14-non-linear) has the same architecture and loss as Studio2Shop but uses VGG features for products instead of fDNA1.0.

  • Studio2Shop with 128floats (formerly referred to as 128floats-non-linear) has the same architecture and loss as Studio2Shop but uses 128-floats features for products instead of fDNA1.0.

5.4 Results

The data and the architecture of Model2Shop are the same as in [20], so we can assess FashionDNA 1.1 and Street2Fashion.

Experimental Set-up. For ease of comparison, we follow the same procedure as described in [20], and run tests on 20000 randomly sampled unseen test query images against 50000 unseen test products. For each test query image, all 50000 possible (image, product) pairs are submitted to the model and are ranked by decreasing score.

Performance. Table 4 shows performance results on the various models and compare them to [20]. We assess our models using top-k retrieval, which measures the proportion of query images for which a correct product is found in the top k suggestions. The top-1% measure (which here means top-500) is given for easier comparisons in case the gallery of a different dataset has a different size. The average metric gives the average index of retrieval: an average of 5 means that a correct product is found on average at position 5. Because the distribution of retrieval indices is typically heavy-tailed, we add the median metric which gives the median index of retrieval: a median of 5 means that for half the images, a correct product is found at position 5. All our models are assessed on the exact same (query image, product) pairs.

Table 4. Results of the retrieval test using 20000 query images against 50000 Zalando articles. Top-k indicates the proportion of query images for which the correct article was found at position k or below. Average and median refer respectively to the average and median position at which an article is retrieved. The best performance is shown in bold.

Generally it was found in [20] that:

  • Studio2Shop outperforms other architectures, mostly thanks to the non-linear matching module.

  • fDNA1.0 can be replaced with any product representation such as fc14 (VGG16 features), but having a specialised feature representation makes a significant difference.

  • The time needed for (naive) retrieval is too long for real-time applications, however it could be heavily shortened by pre-filtering the candidate products using a fast linear model and by optimising the implementation.

Note that FashionDNA 1.1 leads to a significant boost to the performance of Studio2Shop. In contrast, it seems that the segmentation and the image transformations have not made any further difference. It is unsurprising as the model cannot really perform better on our catalogue images simply because the background is white instead of neutral. Adding transformations could even make the task harder in principle, so it is rather reassuring that the performance is stable. It means that Street2Fashion works well enough on our studio images, despite the very few training instances it has seen. We will see the positive effect of the segmentation and image transformations on external datasets.

Fig. 6.
figure 6

Random examples of the retrieval test using 20000 queries against 50000 Zalando articles. Query images and their segmented version are in the left columns, next to two rows displaying the top 50 suggested products, in western reading order. Green boxes show exact hits. (Color figure online)

Table 5. Quantitative results on external datasets. Top-k retrieval indicates the proportion of query images for which a correct product was found at position lower than or equal to k. Average stands for the average retrieval position, median for the median retrieval position.

Retrieval. Figure 6 shows random examples of retrievals on test query images. The query image and its segmented version are shown on the left, while the top 50 suggestions are displayed on the right in western reading order. A few observations can be made. Firstly, we usually find one correct article very high up in the suggestions. Secondly, even if the correct article is not found, the top suggestions respect the style and are almost always relevant, which is the most important result as, realistically, the correct product will likely not be part of our assortment at query time. Secondly, we are able to retrieve more than one category. A more customer-friendly interface could exploit this to present the results in a more pleasing way, but if the image has a full-body, we usually find a mixture of tops and bottoms in the top suggested products.

6 Experiments on External Datasets

Most academic contributions whose data is publicly available focus on retrieving images which potentially contain models (and even backgrounds) on both sides, the query side and the product side. Practically speaking, a product can be represented by a variety of pictures, when in this manuscript the product is represented by FashionDNA based on title images only. Consequently, it is very difficult to compare our work to others, because most models are not reproducible, and most datasets do not fit our requirement, i.e. do not have isolated title images.

6.1 The Datasets

We use two external datasets.

  • DeepFashion In-Sop-Retrieval [25]. The original study using DeepFashion [25] does not do any domain transfer, i.e. both query images and product images are of the same type, however the product images that are of the title kind can easily be isolated. We reduce the product set to those images, keeping only those from the DeepFashion categories that are closest to ours (Denim, Jackets_Vests, Pants, Shirts_Polos, Shorts, Sweaters, Sweatshirts_Hoodies, Tees_Tanks, Blouses_Shirts, Cardigans, Dresses, Graphic_Tees, Jackets_Coats, Skirts), leaving 683 products. The query set is then restricted to the images whose product is kept, i.e. 2922 images. We contacted the group to ask them to run their model on this reduced dataset so we could compare ourselves to them fairly, they did accept to help but never sent the results, even after several reminders on our part.

  • LookBook [38]. LookBook is a dataset that was put together for the task of morphing a query image into the title image of the product it contains using generative adversarial networks. There are 68820 query images and 8726 products. The category of product is unknown so we keep all of them. Though the data is appealing because the product images are title images and the query images wild images, here again we cannot compare ourselves to them as they do not work on visual search.

6.2 Experimental Set-up

We apply FashionDNA 1.0 and 1.1 to the title images of these external product sets, without fine-tuning, to obtain the representations of the external products. For each dataset, we then apply Street2Fashion2Shop, again without finetuning, to the query images and match them against (a) the products coming from the dataset itself for quantitative assessment, (b) the Zalando products used in Sect. 5.4 for qualitative assessment.

6.3 Quantitative Results

Table 5 shows quantitative results for DeepFashion In-Shop-Retrieval (see Table 5a) and LookBook (see Table 5b). Top-k retrieval indicates the proportion of query images for which a correct product was found at position lower than or equal to k. Average stands for the average retrieval position, median for the median retrieval position. No fine-tuning is involved.

The segmentation gives a much greater boost to the performance on LookBook images than on DeepFashion In-Shop-Retrieval images. This is to be expected as the DeepFashion images are clean and therefore do not benefit so much from the segmentation, while LookBook images are wild.

Fig. 7.
figure 7

Qualitative results on external datasets. For each query image, the query image is displayed on the very left, followed by the segmented image and by the top 50 product suggestions. Better viewed with a zoom.

6.4 Qualitative Results

Figure 7 shows random examples of the qualitative experiments, for DeepFashion In-Shop-Retrieval (see Fig. 7a), LookBook (see Fig. 7b) and for street shots (see Fig. 7c). For each query image, the query image is displayed on the very left, followed by the segmented image and by the top 50 product suggestions. The reader may want to zoom into the figures to see the results better.

Without fine-tuning the model to this kind of data, the results are already pleasing. The functional category of the garments and the colour family are respected, which should be a minimum. The style usually follows too. For example, in LookBook’s last example, the dress is not only blue but denim-like and so are most suggestions. In the first street shot, the man is wearing a sports outfit, and our model gives sports products as suggestions. Additionally, more than one categories are naturally returned when they are fully visible, which - if presented properly to customers - could be a nice feature. Note as well that in most cases (not all), although there is no specific garment detection, the model is not confused about which colour to assign to which garment, it has developed an internal intuition for where to find what, a prior on where garments usually locate so to say. An interesting example of this is the last street shot: our model does not know about scarves and there are no scarves in the assortment it could retrieve, but the scarf the man is wearing is located where the shirt should be, and our model suggests shirts of a similar pattern.

7 Conclusions and Future Work

We have presented Street2Fashion2Shop, a pipeline that enables us to train a visual search model using ill-suited but abundant annotated data, namely the studio images of our catalogue. The pipeline has three steps: a feature vector is obtained for all products using FashionDNA on their title image, the query image is segmented using Street2Fashion, then the query image and the products are matched using Fashion2Model.

Street2Fashion2Shop has the advantage to use a powerful product representation that most companies develop for multiple purposes, and to match products to wild images even though it has no labelled wild data.

Much work remains to be done. The segmentation model Street2Fashion could be improved in several ways, for example:

  • It could be made fully convolutional to deal with images of various sizes.

  • It is already able to deal with different scales thanks to the image transformations during training but it is not quite scale-invariant. We do not see much use in analysing shots taken form afar as it is not really a realistic use case, but we do have problems with close shots. Gathering more corresponding training data would be key.

  • Each mask pixel is independent of the others in the loss. We could use a loss that enforces smoothness, for example by using regularisers on local regions. Alternatively, there has been work for example on using RNNs to model conditional random fields over the entire image [39].

We can also think of a few directions to improve the (image, product) matching model Fashion2Shop:

  • It can be made fully convolutional to deal with images of various sizes.

  • FashionDNA can be fine-tuned in the hope of needing only a simple linear match, instead of the current non-linear match which is much slower.

  • The architecture of the query leg can be made more elegant and powerful, for example using ResNets as in FashionDNA 1.1.

All the small improvements mentioned above may bring performance gains, but they remain incremental efforts. A significant change we would like to make is not so much about performance but about flexibility. So far, a query image has to be matched against the whole assortment to find the most suitable products, which is not efficient. This can be made much faster by reducing the set of candidate products using a linear model to identify the most promising candidates and exploiting the structure of FashionDNA. Alternatively we would like to generate, from a query image, an appropriate distribution over FashionDNA. Sampling from this distribution would allow us to generate a set of relevant feature vectors that could then be matched to products using fast nearest neighbour search.