Learning fashion compatibility across categories with deep multimodal neural networks
Introduction
With the booming of fashion websites, like Pinterest1 and Polyvore,2 fashion studies [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29] increasingly attract more and more attentions of researchers. Learning fashion compatibility is an important branch of this hot topic, since it is useful for recommendation system. Fashion compatibility is essentially a kind of human sense to identify and understand relationships between fashion items, which exhibits substantial heterogeneity. Generally, it can be measured from certain aspects of fashion items, such as low-level visual compatibility (e.g., color and texture) and high-level semantic compatibility (e.g., style and functionality). Therefore, learning fashion compatibility has to take into account multiple factors from heterogeneous data, which makes it full of challenges.
Recently, there are a few works [25], [26] for learning fashion compatibility, which is treated as a metric learning problem. First, a large dataset containing compatible and incompatible items is collected. A parametered function is then defined for distance measurement. Finally, the parametered function is trained so that distances of compatible items become closer than incompatible ones. Such metric learning methods are incredibly flexible and work well to some extent. Unfortunately, these methods are based on fashion images, which only consider visual compatibility between fashion items. Although visual content is a crucial clue to capture the low-level visual compatibility between fashion items, it is hard to measure the high-level compatibility, such as style compatibility. Style recognition for fashion items is still an unsolved task in computer vision community. Moreover, the label between a pair of fashion items in existing works is binary: compatible or incompatible. But there is a wide range of situations in which the pair of fashion items is between compatible and incompatible. Therefore, the rule of hard quantification is insufficient to represent human feelings for the fashion compatibility, which should be evaluated in a fine-grained level.
Motivated by the aforementioned observations, in this paper, we model the fashion compatibility by mining the multimodal information between fashion items, i.e., fashion images and textual descriptions. As shown in Fig. 1, there are three pairs of compatible fashion items crawled from Polyvore. Two items in Fig. 1(a) are obviously compatible, since they have similar style, including color and texture attributes. However, it is hard to explain why the items in Fig. 1(b) and (c) are compatible only judging from their visual appearances. It can be found from their textual descriptions. For instance, the textual description like “sunproof” for the top dress in Fig. 1(b) indicates its function, which is semantically consistent with the textual description for the bottom dress in Fig. 1(b), “vacation” and “beach”. As can be seen that visual and textual information are inherently complementary each other. It is reasonable to measure the fashion compatibility by combining them. In addition, a notion of “compatible weight” is introduced for soft quantification of fashion compatibility. It means that fashion compatibility is rated by a range of values instead of binary values, which is more accurate to measure the relationship between fashion items.
Similar to [25], [26], learning fashion compatibility is treated as a metric learning problem in our work. The most challenging issues are how to measure the compatibility between fashion items by combining their visual and textual information and how to integrate compatible weights into the model. As shown in Fig. 2, an end-to-end deep learning framework, named Visual-Semantic Fusion Model (VSFM), is proposed to explore these issues, which learns a feature transformation from fashion images as well as texts into a latent feature space. Compatible fashion items are closer than incompatible ones in this latent feature space. Visual and textual representations are implemented by a deep Convolutional Neural Network (CNN) and a multilayered Long Short-Term Memory (LSTM), respectively. A layer of linear embedding is then exploited to fuse visual and textual information. On the top of this model, a novel triplet ranking loss layer with compatible weights is constructed, which captures fine-grained compatibility between fashion items. Finally, the offline trained model can be directly used to calculate the distance for pairs of fashion items. The only online processing is to choose the most compatible pair of fashion items according to their distances.
The contributions of this work are summarized as follows:
- •
A novel end-to-end deep learning framework is constructed to learn the fashion compatibility, which simultaneously integrates two kinds of heterogeneous data, visual and textual information, to form an efficient and effective fashion learning model.
- •
The notion of compatible weights is proposed to more accurately describe the fashion compatibility, and it is integrated into a triplet ranking loss layer to refine the whole deep learning framework.
- •
To evaluate the performance of the proposed model, extensive experiments have been conducted on Amazon dataset, which demonstrate that the proposed model achieves a substantial improvement compared to the state-of-the-art approaches.
The rest of this paper is organized as follows: Related work is reviewed in Section 2. The proposed Visual-Semantic Fusion Model is elaborated in Section 3. Experiments and results are presented in Section 4. Finally, this paper is concluded with a summary in Section 5.
Section snippets
Related work
The most related branches of our work are fashion research and deep neural networks, which will be discussed in the following subsections.
Problem formulation
The fashion compatibility is determined by the distance between fashion items, which is calculated by VSFM with offline training. It is illustrated in Fig. 2. Given a pair of compatible fashion items Ip and Iq, where Ip ∈ Ci, Iq ∈ Cj, i ≠ j, which are represented by {Vp, Tp} and {Vq, Tq}, respectively, our goal is to learn a feature transformation f: from visual and textual space to a latent hybrid feature space. In this space, compatible fashion items Ip and Iq are close. Symbols
Dataset
There are several existing fashion datasets, including outfit dataset [27], Amazon clothing dataset [26], eBay dataset [24], magic closet [23] and deep fashion [9]. Amazon clothing dataset [26] has textual descriptions, images of fashion items, and the records of user co-purchasing, which meets our requirement. Therefore, we select the fashion items including these fields: “title”, “imageURL”, “category: clothing”, “buy together” from Amazon clothing dataset. A new dataset with 127,479 fashion
Conclusion
In this paper, a deep multimodal neural network is constructed for learning fashion compatibility. A deep CNN is built for visual embedding, and semantic embedding is constructed with a multilayered Long Short-Term Memory (LSTM). The visual and textual information is integrated with a full-connected module. Meanwhile, the notion of compatible weights is introduced and integrated into triplet ranking loss measure, which refines the relationships between fashion items. To verify the proposed
Declaration of interest
None.
Acknowledgment
This work was supported in part by the National Natural Science Foundation of China (Grant Nos. 61772436), Sichuan Science and Technology Innovation Seedling Fund (2017RZ0015, 2017020), Foundations for Department of Transportation of Henan Province (2019J-2-2) and the Fundamental Research Funds for the Central Universities.
Guang-Lu Sun received his B.Eng. degree in Computer Science from The People’s Liberation Army Information Engineering University, in 2010, and Ph.D. degree in School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China, in 2018. His research interests include multimedia information retrieval, data mining and image/video processing.
References (49)
- et al.
Part-based clothing image annotation by visual neighbor retrieval
Neurocomputing
(2016) - et al.
Impulsive controller design for exponential synchronization of delayed stochastic memristor-based recurrent neural networks
Neurocomputing
(2016) - et al.
Effects of bounded and unbounded leakage time-varying delays in memristor-based recurrent neural networks with different memductance functions
Neurocomputing
(2016) - et al.
Impulsive synchronization of Markovian jumping randomly coupled neural networks with partly unknown transition probabilities via multiple integral approach
Neural Netw.
(2015) Learning characteristics of stochastic-gradient-descent algorithms: a general study, analysis, and critique
Signal Process.
(1984)- et al.
Apparel classification with style
Proceedings of the ACCV
(2013) - et al.
Describing clothing by semantic attributes
Proceedings of the ECCV
(2012) - et al.
Pointwise and pairwise clothing annotation: combining features from social media
Multimedia Tools Appl.
(2016) - et al.
Clothes search in consumer photos via color matching and attribute learning
Proceedings of the ACM MM
(2011) - et al.
Efficient clothing retrieval with semantic-preserving visual phrases
Proceedings of the ACCV
(2012)
Street-to-shop: cross-scenario clothing retrieval via parts alignment and auxiliary set
Proceedings of the CVPR
Cross-domain image retrieval with a dual attribute-aware ranking network
Proceedings of the ICCV
Deepfashion: powering robust clothes recognition and retrieval with rich annotations
Proceedings of the CVPR
Deep bi-directional cross-triplet embedding for cross-domain clothing retrieval
Proceedings of the ACM MM
Memory-augmented attribute manipulation networks for interactive fashion search
Proceedings of the CVPR
Getting the look: clothing recognition and segmentation for automatic product suggestions in everyday photos
Proceedings of the ICMR
Clothing cosegmentation for shopping images with cluttered background
IEEE Trans. Multimedia
Parsing clothing in fashion photographs
Proceedings of the CVPR
Paper doll parsing: Retrieving similar styles to parse clothing items
Proceedings of the ICCV
Fashion parsing with weak color-category labels
IEEE Trans. Multimedia
Clothes co-parsing via joint image segmentation and labeling with application to clothing retrieval
IEEE Trans. Multimedia
Hipster wars: discovering elements of fashion styles
Proceedings of the ECCV
Chic or social: visual popularity analysis in online fashion networks
Proceedings of the ACM MM
Neuroaesthetics in fashion: modeling the perception of fashionability
Proceedings of the CVPR
Cited by (21)
M2GAN: Mimicry fashion generation combined with the two-step mullerian evolutionary hypothesis
2024, Applied Soft ComputingDiagnosing fashion outfit compatibility with deep learning techniques
2023, Expert Systems with ApplicationsCitation Excerpt :It contains 7,000 outfit parts, where 4,000 are incompatible and 3,000 are compatible, but it does not contain information about which parts of the outfits are compatible or incompatible with each other. There have been many studies conducted to find outfit composition and compatibility prediction (Kavitha et al., 2020; Sun et al., 2020; Vasileva et al., 2018; Wang et al., 2019a, 2020; Yang et al., 2020a). When the literature is reviewed, it can be seen that the clothes are not taken as a whole outfit.
Attribute-aware heterogeneous graph network for fashion compatibility prediction
2022, NeurocomputingCitation Excerpt :One is to directly treat the information features as the items’ initial representations, such as the images and text descriptions features. Then learning the item relationships by serializing these features [7,4], considering by pairs [8–10], or obtaining new item feature representations by the homogeneous graph network [11,12] to solve the fashion compatibility problem. The other way is to apply side information like attributes, the overall try-on appearances as auxiliary information for basic features (such as visual features obtained from images) [13,6].
Learning compatibility knowledge for outfit recommendation with complementary clothing matching
2022, Computer CommunicationsCitation Excerpt :Fashion compatibility is a subjective sense of human for relationships between fashion items, which is essential for fashion clothing matching. Sun et al. [26] combined semantic and visual embeddings to learn fashion compatibility, which adopt triplet ranking loss with compatible weights to measure fine-grained relationships between fashion items. Recently, the methods of estimating pairwise compatibility between a pair of items are presented [21,24,27,28].
Fashion intelligence in the Metaverse: promise and future prospects
2024, Artificial Intelligence ReviewResearch on type-aware fashion compatibility prediction based on a hybrid attention mechanism
2024, Multimedia Tools and Applications
Guang-Lu Sun received his B.Eng. degree in Computer Science from The People’s Liberation Army Information Engineering University, in 2010, and Ph.D. degree in School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China, in 2018. His research interests include multimedia information retrieval, data mining and image/video processing.
Jun-Yan He is pursuing his Ph.D. degree from School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China. He received the B.Sc. degree in Software Engineering from Southwest Jiaotong University in 2013. His research interests include multimedia, computer vision and machine learning.
Xiao Wu received the B.Eng. and M.S. degrees in computer science from Yunnan University, Yunnan, China, in 1999 and 2002, respectively, and the Ph.D. degree in computer science from the City University of Hong Kong, Hong Kong, in 2008. He is currently a Professor and the Assistant Dean with the School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China. He was with the Institute of Software, Chinese Academy of Sciences, Beijing, China, from 2001 to 2002. He was a Research Assistant and a Senior Research Associate with the City University of Hong Kong, Hong Kong, from 2003 to 2004 and from 2007 to 2009, respectively. He was with the School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, and with the School of Information and Computer Science, University of California at Irvine, Irvine, CA, USA, as a Visiting Scholar, from 2006 to 2007 and 2015 to 2016, respectively. He has authored or co-authored over 70 research papers in well-respected journals, such as TIP, TMM, TMI, and prestigious proceedings like CVPR and ACM MM. His research interests include multimedia information retrieval, image/video computing, and computer vision. He was a recipient of the Second Prize of Natural Science Award of the Ministry of Education, China, in 2016, and the Second Prize of Science and Technology Progress Award of Henan Province, China, in 2017.
Bo Zhao received his B.Sc. and Ph.D. degrees from School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China, in 2010 and 2017, respectively. Currently, he is a Postdoctoral Research Fellow at the University of British Columbia, Vancouver, British Columbia, Canada. He was at the Department of Electrical and Computer Engineering, National University of Singapore, Singapore as a Visiting Scholar from 2015 to 2017. His research interests include multimedia, computer vision and machine learning.
Qiang Peng received the B.E. degree in automation control from Xi’an Jiaotong University, Xi’an, China, the M.Eng. degree in computer application and technology, and the Ph.D. degree in traffic information and control engineering from Southwest Jiaotong University, Chengdu, China, in 1984, 1987, and 2004, respectively. He is currently a Professor with the School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China. He has published over 80 papers and holds ten Chinese patents. His research interests include digital video compression and transmission, image/graphics processing, traffic information detection and simulation, virtual reality technology, and multimedia system and application. He was a recipient of the second prize of the Science and Technology Progress Award, Henan, China, in 2017.