Multi-depth dilated network for fashion landmark detection with batch-level online hard keypoint mining
Introduction
With the development of electronic commerce, such Amazon, Taobao, Jingdong, etc., clothing image analysis, including the study of fashion landmark detection [1], [2], clothing attribute prediction [3], [4], clothing retrieval [5], [6], and fashion recommendation [7], [8], has received much attention from researchers. The development and application of these analysis methods can bring huge commercial value to e-commerce companies, and lay a foundation for new application fields in the future, such as online custom clothing design and virtual dressing rooms. Additionally, the rapid development of deep neural networks [9], [10] and the emergence of large-scale clothing image datasets [2], [11] enable us to overcome these challenging tasks more quickly and effectively.
In this paper, one fundamental task and a key problem in clothing image analysis, namely fashion landmark detection, is addressed. Extracting features from detected fashion landmarks can significantly improve the performance of other types of clothing image analysis, such as clothing attribute prediction and clothing retrieval. The general aim of fashion landmark detection is to recognize and locate the functional keypoints defined on clothes, such as the corners of necklines, hemlines, and cuffs.
A recent notable work in fashion landmark detection is called the attentive fashion grammar network [12]. It is a model of encoding a knowledge set over fashion clothes based on deep learning, and uses two attention mechanisms to make the network pay more attention to the areas that surround the fashion landmarks. Although great progress has been made, there remains room for further improvement; for example, it is still difficult to locate hard keypoints that are either invisible or occluded by other objects in the image.
To address this problem, dilated convolution [13] was utilized in the present study to design the novel Multi-Depth Dilated (MDD) block that parallelly extracts the multi-depth convolutional features; this allows multi-level context information to be obtained and is beneficial for hard keypoints location. By stacking the MDD blocks, the hard keypoints can be located more accurately. In addition, to further improve the recognition of hard keypoints, the method of Batch-level Online Hard Keypoints Mining (B-OHKM) is also proposed for training. Compared with Online Hard Keypoints Mining (OHKM) [14] in a single image, this method focuses on hard keypoints mining in batch-level images, which is better for training a network to detect hard keypoints. Considering these two important factors for addressing the problem of hard keypoints detection, i.e., the feature extraction level and the network training level, the Multi-Depth Dilated Network (MDDNet) is ultimately proposed, and is demonstrated to outperform current state-of-the-art methods on two fashion benchmark datasets. Additionally, it is experimentally established that the proposed MDD block exhibits non-trivial improvements.
In summary, the main contributions of this work are three-fold:
1) A novel MDD block that can efficiently extract different levels of large-scale context information, which is beneficial for the inference of hard keypoints, is developed.
2) The B-OHKM method is proposed for training to further improve the effectiveness of hard keypoints detection.
3) It is demonstrated that a network (MDDNet) that uses the MDD block and B-OHKM training method obtains significant improvements over state-of-the-art methods on standard benchmarks for fashion landmark detection.
Section snippets
Related Work
Visual fashion understanding has recently attracted wide attention with the development of e-commerce, and a series of related applied subjects have been studied, including clothing identification [3], [15] and retrieval [5], [16], the prediction of semantic properties [4], [17], and fashion trend discovery [7], [18]. These research projects have all relied on the understanding of the information content of clothing images, which is transformed into the required information on the garments.
The Proposed Approach
In this section, data preprocessing for network input and training is first introduced. The architecture of the deep neural network that includes the structure of the proposed MDD block is then described in detail. Additionally, an explanation of how to stack MDD blocks for building the MDDNet that can effectively tackle the hard keypoints problem and accurately locate the fashion landmarks is provided. Finally, the network learning policy, which includes the training loss function and the
Experiment
The performance of the proposed method was evaluated on two large-scale fashion benchmark datasets, namely DeepFashion [2] and the Fashion Landmark Dataset (FLD) [1]. An ablation study was also carried out to verify the effectiveness of the proposed MDD block and B-OHKM method.
Conclusion
This paper proposed a Multi-Depth Dilated Network (MDDNet) that can effectively tackle the hard keypoints location problem, and hence accurately detect fashion landmarks. The network is stacked by the Multi-Depth Dilated (MDD) blocks used for extracting the multi-level, large-scale context information which is necessary for the inference of hard keypoints. In addition, the Batch-level Online Hard Keypoints Mining (B-OHKM) method was proposed for training to further increase the accuracy of hard
Acknowledgement
This work is supported by the National Key R&D Program of China under grant 2017YFB1002504, Shaanxi international science and technology cooperation and exchange program of China(2017KW-010).
Conflict of interest
The authors declared that they have no conflicts of interest to this work.
References (28)
- Z. Liu, S. Yan, P. Luo, X. Wang, and X. Tang, “Fashion Landmark Detection in the Wild,” in Computer Vision - ECCV 2016...
- Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, “DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich...
- H. Chen, A. C. Gallagher, and B. Girod, “Describing Clothing by Semantic Attributes,” in Computer Vision - ECCV 2012 -...
- X. Wang and T. Zhang, “Clothes search in consumer photos via color matching and attribute learning,” in Proceedings of...
- M. H. Kiapour, X. Han, S. Lazebnik, A. C. Berg, and T. L. Berg, “Where to Buy It: Matching Street Clothing Photos in...
- J. Huang, R. S. Feris, Q. Chen, and S. Yan, “Cross-Domain Image Retrieval with a Dual Attribute-Aware Ranking Network,”...
- M. H. Kiapour, K. Yamaguchi, A. C. Berg, and T. L. Berg, “Hipster Wars: Discovering Elements of Fashion Styles,” in...
- E. Simo-Serra, S. Fidler, F. Moreno-Noguer, and R. Urtasun, “Neuroaesthetics in fashion: Modeling the perception of...
- K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” CoRR, vol....
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in 2016 IEEE Conference on...
Cited by (7)
Development of Deep Learning-based Automatic Scan Range Setting Model for Lung Cancer Screening Low-dose CT Imaging
2022, Academic RadiologyCitation Excerpt :For the detection of hard keypoints, such as obscured or invisible keypoints, we use the OHKM technique. It has a strong ability to express the keypoint information, and helps the network to understand the positional relationship between keypoints (28). By using this technique, the hard keypoints can be more accurately located, and the overall keypoints detection effect can be further improved.
Human Shape and Clothing Estimation
2024, arXivIntelligent garment detection using deep learning
2022, Handbook of Intelligent Computing and Optimization for Sustainable DevelopmentLayout-Aware Bidirectional Transfer Network for Fashion Landmark Detection
2022, Proceedings of SPIE - The International Society for Optical EngineeringFashion Landmark Detection via Deep Residual Spatial Attention Network
2021, Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI