Elsevier

Neurocomputing

Volume 395, 28 June 2020, Pages 237-246
Neurocomputing

Learning fashion compatibility across categories with deep multimodal neural networks

https://doi.org/10.1016/j.neucom.2018.06.098Get rights and content

Abstract

Fashion compatibility is a subjective sense of human for relationships between fashion items, which is essential for fashion recommendation. Recently, it increasingly attracts more and more attentions and has become a very hot research topic. Learning fashion compatibility is a challenging task, since it needs to consider plenty of factors about fashion items, such as color, texture, style and functionality. Unlike low-level visual compatibility (e.g., color, texture), high-level semantic compatibility (e.g., style, functionality) cannot be handled purely based on fashion images. In this paper, we propose a novel multimodal framework to learn fashion compatibility, which simultaneously integrates both semantic and visual embeddings into a unified deep learning model. For semantic embeddings, a multilayered Long Short-Term Memory (LSTM) is employed for discriminative semantic representation learning, while a deep Convolutional Neural Network (CNN) is used for visual embeddings. A fusion module is then constructed to combine semantic and visual information of fashion items, which equivalently transforms semantic and visual spaces into a latent feature space. Furthermore, a new triplet ranking loss with compatible weights is introduced to measure fine-grained relationships between fashion items, which is more consistent with human feelings on fashion compatibility in reality. Extensive experiments conducted on Amazon fashion dataset demonstrate the effectiveness of the proposed method for learning fashion compatibility, which outperforms the state-of-the-art approaches.

Introduction

With the booming of fashion websites, like Pinterest1 and Polyvore,2 fashion studies [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29] increasingly attract more and more attentions of researchers. Learning fashion compatibility is an important branch of this hot topic, since it is useful for recommendation system. Fashion compatibility is essentially a kind of human sense to identify and understand relationships between fashion items, which exhibits substantial heterogeneity. Generally, it can be measured from certain aspects of fashion items, such as low-level visual compatibility (e.g., color and texture) and high-level semantic compatibility (e.g., style and functionality). Therefore, learning fashion compatibility has to take into account multiple factors from heterogeneous data, which makes it full of challenges.

Recently, there are a few works [25], [26] for learning fashion compatibility, which is treated as a metric learning problem. First, a large dataset containing compatible and incompatible items is collected. A parametered function is then defined for distance measurement. Finally, the parametered function is trained so that distances of compatible items become closer than incompatible ones. Such metric learning methods are incredibly flexible and work well to some extent. Unfortunately, these methods are based on fashion images, which only consider visual compatibility between fashion items. Although visual content is a crucial clue to capture the low-level visual compatibility between fashion items, it is hard to measure the high-level compatibility, such as style compatibility. Style recognition for fashion items is still an unsolved task in computer vision community. Moreover, the label between a pair of fashion items in existing works is binary: compatible or incompatible. But there is a wide range of situations in which the pair of fashion items is between compatible and incompatible. Therefore, the rule of hard quantification is insufficient to represent human feelings for the fashion compatibility, which should be evaluated in a fine-grained level.

Motivated by the aforementioned observations, in this paper, we model the fashion compatibility by mining the multimodal information between fashion items, i.e., fashion images and textual descriptions. As shown in Fig. 1, there are three pairs of compatible fashion items crawled from Polyvore. Two items in Fig. 1(a) are obviously compatible, since they have similar style, including color and texture attributes. However, it is hard to explain why the items in Fig. 1(b) and (c) are compatible only judging from their visual appearances. It can be found from their textual descriptions. For instance, the textual description like “sunproof” for the top dress in Fig. 1(b) indicates its function, which is semantically consistent with the textual description for the bottom dress in Fig. 1(b), “vacation” and “beach”. As can be seen that visual and textual information are inherently complementary each other. It is reasonable to measure the fashion compatibility by combining them. In addition, a notion of “compatible weight” is introduced for soft quantification of fashion compatibility. It means that fashion compatibility is rated by a range of values instead of binary values, which is more accurate to measure the relationship between fashion items.

Similar to [25], [26], learning fashion compatibility is treated as a metric learning problem in our work. The most challenging issues are how to measure the compatibility between fashion items by combining their visual and textual information and how to integrate compatible weights into the model. As shown in Fig. 2, an end-to-end deep learning framework, named Visual-Semantic Fusion Model (VSFM), is proposed to explore these issues, which learns a feature transformation from fashion images as well as texts into a latent feature space. Compatible fashion items are closer than incompatible ones in this latent feature space. Visual and textual representations are implemented by a deep Convolutional Neural Network (CNN) and a multilayered Long Short-Term Memory (LSTM), respectively. A layer of linear embedding is then exploited to fuse visual and textual information. On the top of this model, a novel triplet ranking loss layer with compatible weights is constructed, which captures fine-grained compatibility between fashion items. Finally, the offline trained model can be directly used to calculate the distance for pairs of fashion items. The only online processing is to choose the most compatible pair of fashion items according to their distances.

The contributions of this work are summarized as follows:

  • A novel end-to-end deep learning framework is constructed to learn the fashion compatibility, which simultaneously integrates two kinds of heterogeneous data, visual and textual information, to form an efficient and effective fashion learning model.

  • The notion of compatible weights is proposed to more accurately describe the fashion compatibility, and it is integrated into a triplet ranking loss layer to refine the whole deep learning framework.

  • To evaluate the performance of the proposed model, extensive experiments have been conducted on Amazon dataset, which demonstrate that the proposed model achieves a substantial improvement compared to the state-of-the-art approaches.

The rest of this paper is organized as follows: Related work is reviewed in Section 2. The proposed Visual-Semantic Fusion Model is elaborated in Section 3. Experiments and results are presented in Section 4. Finally, this paper is concluded with a summary in Section 5.

Section snippets

Related work

The most related branches of our work are fashion research and deep neural networks, which will be discussed in the following subsections.

Problem formulation

The fashion compatibility is determined by the distance between fashion items, which is calculated by VSFM with offline training. It is illustrated in Fig. 2. Given a pair of compatible fashion items Ip and Iq, where Ip ∈ Ci, Iq ∈ Cj, i ≠ j, which are represented by {Vp, Tp} and {Vq, Tq}, respectively, our goal is to learn a feature transformation f: Hp=f({Vp,Tp}) from visual and textual space to a latent hybrid feature space. In this space, compatible fashion items Ip and Iq are close. Symbols

Dataset

There are several existing fashion datasets, including outfit dataset [27], Amazon clothing dataset [26], eBay dataset [24], magic closet [23] and deep fashion [9]. Amazon clothing dataset [26] has textual descriptions, images of fashion items, and the records of user co-purchasing, which meets our requirement. Therefore, we select the fashion items including these fields: “title”, “imageURL”, “category: clothing”, “buy together” from Amazon clothing dataset. A new dataset with 127,479 fashion

Conclusion

In this paper, a deep multimodal neural network is constructed for learning fashion compatibility. A deep CNN is built for visual embedding, and semantic embedding is constructed with a multilayered Long Short-Term Memory (LSTM). The visual and textual information is integrated with a full-connected module. Meanwhile, the notion of compatible weights is introduced and integrated into triplet ranking loss measure, which refines the relationships between fashion items. To verify the proposed

Declaration of interest

None.

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China (Grant Nos. 61772436), Sichuan Science and Technology Innovation Seedling Fund (2017RZ0015, 2017020), Foundations for Department of Transportation of Henan Province (2019J-2-2) and the Fundamental Research Funds for the Central Universities.

Guang-Lu Sun received his B.Eng. degree in Computer Science from The People’s Liberation Army Information Engineering University, in 2010, and Ph.D. degree in School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China, in 2018. His research interests include multimedia information retrieval, data mining and image/video processing.

References (49)

  • LiuS. et al.

    Street-to-shop: cross-scenario clothing retrieval via parts alignment and auxiliary set

    Proceedings of the CVPR

    (2012)
  • HuangJ. et al.

    Cross-domain image retrieval with a dual attribute-aware ranking network

    Proceedings of the ICCV

    (2015)
  • LiuZ. et al.

    Deepfashion: powering robust clothes recognition and retrieval with rich annotations

    Proceedings of the CVPR

    (2016)
  • JiangS. et al.

    Deep bi-directional cross-triplet embedding for cross-domain clothing retrieval

    Proceedings of the ACM MM

    (2016)
  • ZhaoB. et al.

    Memory-augmented attribute manipulation networks for interactive fashion search

    Proceedings of the CVPR

    (2017)
  • KalantidisY. et al.

    Getting the look: clothing recognition and segmentation for automatic product suggestions in everyday photos

    Proceedings of the ICMR

    (2013)
  • B. Zhao et al.

    Clothing cosegmentation for shopping images with cluttered background

    IEEE Trans. Multimedia

    (2016)
  • K. Yamaguchi et al.

    Parsing clothing in fashion photographs

    Proceedings of the CVPR

    (2012)
  • K. Yamaguchi et al.

    Paper doll parsing: Retrieving similar styles to parse clothing items

    Proceedings of the ICCV

    (2013)
  • LiuS. et al.

    Fashion parsing with weak color-category labels

    IEEE Trans. Multimedia

    (2014)
  • LiangX. et al.

    Clothes co-parsing via joint image segmentation and labeling with application to clothing retrieval

    IEEE Trans. Multimedia

    (2016)
  • M.H. Kiapour et al.

    Hipster wars: discovering elements of fashion styles

    Proceedings of the ECCV

    (2014)
  • K. Yamaguchi et al.

    Chic or social: visual popularity analysis in online fashion networks

    Proceedings of the ACM MM

    (2014)
  • E. Simo-Serra et al.

    Neuroaesthetics in fashion: modeling the perception of fashionability

    Proceedings of the CVPR

    (2015)
  • Cited by (21)

    • Diagnosing fashion outfit compatibility with deep learning techniques

      2023, Expert Systems with Applications
      Citation Excerpt :

      It contains 7,000 outfit parts, where 4,000 are incompatible and 3,000 are compatible, but it does not contain information about which parts of the outfits are compatible or incompatible with each other. There have been many studies conducted to find outfit composition and compatibility prediction (Kavitha et al., 2020; Sun et al., 2020; Vasileva et al., 2018; Wang et al., 2019a, 2020; Yang et al., 2020a). When the literature is reviewed, it can be seen that the clothes are not taken as a whole outfit.

    • Attribute-aware heterogeneous graph network for fashion compatibility prediction

      2022, Neurocomputing
      Citation Excerpt :

      One is to directly treat the information features as the items’ initial representations, such as the images and text descriptions features. Then learning the item relationships by serializing these features [7,4], considering by pairs [8–10], or obtaining new item feature representations by the homogeneous graph network [11,12] to solve the fashion compatibility problem. The other way is to apply side information like attributes, the overall try-on appearances as auxiliary information for basic features (such as visual features obtained from images) [13,6].

    • Learning compatibility knowledge for outfit recommendation with complementary clothing matching

      2022, Computer Communications
      Citation Excerpt :

      Fashion compatibility is a subjective sense of human for relationships between fashion items, which is essential for fashion clothing matching. Sun et al. [26] combined semantic and visual embeddings to learn fashion compatibility, which adopt triplet ranking loss with compatible weights to measure fine-grained relationships between fashion items. Recently, the methods of estimating pairwise compatibility between a pair of items are presented [21,24,27,28].

    View all citing articles on Scopus

    Guang-Lu Sun received his B.Eng. degree in Computer Science from The People’s Liberation Army Information Engineering University, in 2010, and Ph.D. degree in School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China, in 2018. His research interests include multimedia information retrieval, data mining and image/video processing.

    Jun-Yan He is pursuing his Ph.D. degree from School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China. He received the B.Sc. degree in Software Engineering from Southwest Jiaotong University in 2013. His research interests include multimedia, computer vision and machine learning.

    Xiao Wu received the B.Eng. and M.S. degrees in computer science from Yunnan University, Yunnan, China, in 1999 and 2002, respectively, and the Ph.D. degree in computer science from the City University of Hong Kong, Hong Kong, in 2008. He is currently a Professor and the Assistant Dean with the School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China. He was with the Institute of Software, Chinese Academy of Sciences, Beijing, China, from 2001 to 2002. He was a Research Assistant and a Senior Research Associate with the City University of Hong Kong, Hong Kong, from 2003 to 2004 and from 2007 to 2009, respectively. He was with the School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, and with the School of Information and Computer Science, University of California at Irvine, Irvine, CA, USA, as a Visiting Scholar, from 2006 to 2007 and 2015 to 2016, respectively. He has authored or co-authored over 70 research papers in well-respected journals, such as TIP, TMM, TMI, and prestigious proceedings like CVPR and ACM MM. His research interests include multimedia information retrieval, image/video computing, and computer vision. He was a recipient of the Second Prize of Natural Science Award of the Ministry of Education, China, in 2016, and the Second Prize of Science and Technology Progress Award of Henan Province, China, in 2017.

    Bo Zhao received his B.Sc. and Ph.D. degrees from School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China, in 2010 and 2017, respectively. Currently, he is a Postdoctoral Research Fellow at the University of British Columbia, Vancouver, British Columbia, Canada. He was at the Department of Electrical and Computer Engineering, National University of Singapore, Singapore as a Visiting Scholar from 2015 to 2017. His research interests include multimedia, computer vision and machine learning.

    Qiang Peng received the B.E. degree in automation control from Xi’an Jiaotong University, Xi’an, China, the M.Eng. degree in computer application and technology, and the Ph.D. degree in traffic information and control engineering from Southwest Jiaotong University, Chengdu, China, in 1984, 1987, and 2004, respectively. He is currently a Professor with the School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China. He has published over 80 papers and holds ten Chinese patents. His research interests include digital video compression and transmission, image/graphics processing, traffic information detection and simulation, virtual reality technology, and multimedia system and application. He was a recipient of the second prize of the Science and Technology Progress Award, Henan, China, in 2017.

    View full text