Abstract
Fashion style is expressed through the way clothing and accessories are put together, as well as the silhouettes, textiles, colors, and shape details of each fashion item. The challenge of style classification lies in the wide visual variation within the same style and the existence of visually similar styles. Fashion experts categorize fashion styles not only by global appearance but also by the attributes of individual items and their combinations. We propose an item-region-based fashion style classification network (IRSN) that effectively classifies fashion styles by analyzing item-level features and their combinations. IRSN extracts item features using item region pooling (IRP), analyzes them separately, and aggregates them using gated feature fusion (GFF). In addition, IRSN applies a dual-backbone architecture that combines a domain-specific feature extractor and a general feature extractor pretrained with a large general image-text dataset. In the experiment, we evaluated IRSN variants based on six widely used backbones, including EfficientNet, ConvNeXt, and SwinTransformer. The IRSN models outperformed their baseline models by an average of 8.9% and a maximum of 16.7% on the FashionStyle14 dataset, and by an average of 9.4% and a maximum of 17.0% on the ShowniqV3 dataset. The visualization results support that the IRSN models are more effective than the baseline models in capturing differences between similar style classes.











Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability and access
FashionStyle14 [4] is available from https://esslab.jp/~ess/en/data/fashionstyle14/. ShowniqV3 is not publicly available because ShowniqV3 was collected for commercial service.
References
Lee MG, Kim HJ (2021) Analysis of the sales promotion strategy of online fashion shopping mall. Korea Inst Cult Prod Des 64:227–240
Kennedy A, Stoehrer EB, Calderin J (2013) Fashion Design, Referenced: A Visual Guide to the History, Language, and Practice of Fashion. Rockport Publishers, Beverly, Mass
Sorger R, Udale J (2006) The Fundamentals of Fashion Design. AVA Publishing, Worthing, West Sussex, United Kingdom
Takagi M, Simo-Serra E, Iizuka S, Ishikawa H (2017) What Makes a Style: Experimental Analysis of Fashion Prediction. In: Proceedings of the international conference on computer vision workshops (ICCVW). https://doi.org/10.1109/ICCVW.2017.263
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4510–4520
Tan M, Le Q (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning, pp 6105–6114. PMLR
Wang J, Sun K, Cheng T, Jiang B, Deng C, Zhao Y, Liu D, Mu Y, Tan M, Wang X et al (2020) Deep high-resolution representation learning for visual recognition. IEEE Trans Pattern Anal Mach Intell 43(10):3349–3364
Liu Z, Mao H, Wu C-Y, Feichtenhofer C, Darrell T, Xie S (2022) A convnet for the 2020s. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11976–11986
Woo S, Debnath S, Hu R, Chen X, Liu Z, Kweon IS, Xie S (2023) Convnext v2: Co-designing and scaling convnets with masked autoencoders. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16133–16142
Wang W, Dai J, Chen Z, Huang Z, Li Z, Zhu X, Hu X, Lu T, Lu L, Li H et al (2023) Internimage: Exploring large-scale vision foundation models with deformable convolutions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14408–14419
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems 30
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale. In: International conference on learning representations
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022
Liu Z, Hu H, Lin Y, Yao Z, Xie Z, Wei Y, Ning J, Cao Y, Zhang Z, Dong L et al (2022) Swin transformer v2: Scaling up capacity and resolution. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12009–12019
Mishra S, Liang P, Czajka A, Chen DZ, Hu XS (2019) Cc-net: Image complexity guided network compression for biomedical image segmentation. In: 2019 IEEE 16th International symposium on biomedical imaging (ISBI 2019), pp 57–60. IEEE
Sun G-L, Wu X, Chen H-H, Peng Q (2015) Clothing style recognition using fashion attribute detection. In: Proceedings of the 8th international conference on mobile multimedia communications. MobiMedia ’15, pp 145–148. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), Brussels, BEL
Liu Z, Luo P, Qiu S, Wang X, Tang X (2016) Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1096–1104
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR. arXiv:1704.04861
Huang Z, Wang X, Huang L, Huang C, Wei Y, Liu W (2019) Ccnet: Criss-cross attention for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 603–612
Wan Q, Huang Z, Lu J, Gang Y, Zhang L (2022) Seaformer: Squeeze-enhanced axial transformer for mobile semantic segmentation. In: The eleventh international conference on learning representations
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers & distillation through attention. In: International conference on machine learning, pp 10347–10357. PMLR
Dai Z, Liu H, Le QV, Tan M (2021) Coatnet: Marrying convolution and attention for all data sizes. Adv Neural Inf Process Syst 34:3965–3977
Park N, Kim S (2021) How do vision transformers work? In: International conference on learning representations
Kim S, Choi Y, Park J (2021) Recognition of multi label fashion styles based on transfer learning and graph convolution network. J Soc e-Bus Stud 26(1):29–41. https://doi.org/10.7838/jsebs.2021.26.1.029
Chen X, Deng Y, Di C, Li H, Tang G, Cai H (2022) High-accuracy clothing and style classification via multi-feature fusion. Appl Sci 12(19):10062. https://doi.org/10.3390/app121910062
Woo S, Park J, Lee J-Y, Kweon IS (2018) Cbam: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3–19
Hendrycks D, Lee K, Mazeika M (2019) Using pre-training can improve model robustness and uncertainty. In: International conference on machine learning, pp 2712–2721. PMLR
He K, Girshick R, Dollár P (2019) Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4918–4927
Ke A, Ellsworth W, Banerjee O, Ng AY, Rajpurkar P (2021) Chextransfer: performance and parameter efficiency of imagenet models for chest x-ray interpretation. In: Proceedings of the conference on health, inference, and learning, pp 116–124
Marmanis D, Datcu M, Esch T, Stilla U (2015) Deep learning earth observation classification using imagenet pretrained networks. IEEE Geosci Remote Sens Lett 13(1):105–109
Li A, Jabri A, Joulin A, Van Der Maaten L (2017) Learning visual n-grams from web data. In: Proceedings of the IEEE international conference on computer vision, pp 4183–4192
Joulin A, Van Der Maaten L, Jabri A, Vasilache N (2016) Learning visual features from large weakly supervised data. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, pp 67–84. Springer
Zhang Y, Jiang H, Miura Y, Manning CD, Langlotz CP (2022) Contrastive learning of medical visual representations from paired images and text. In: Machine learning for healthcare conference, pp 2–25. PMLR
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, pp 8748–8763. PMLR
Lüddecke T, Ecker A (2022) Image segmentation using text and image prompts. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 7086–7096
Kumari T, Syal P, Aggarwal AK, Guleria V (2020) Hybrid image registration methods: a review. Int J Adv Trends Comput Sci Eng 9:1134–1142
Maini D, Aggarwal AK (2018) Camera position estimation using 2d image dataset. Int J Innov Eng Technol 10:199–203
Arora K, Kumar A (2017) A comparative study on content based image retrieval methods. Int J Latest Technol Eng Manag Appl Sci 6(4):77–80
Arora K, Aggarwal AK (2017) Approaches for image database retrieval based on color, texture, and shape features. Handbook of research on advanced concepts in real-time image and video processing, 28
Aggarwal AK (2022) Learning texture features from glcm for classification of brain tumor mri images using random forest classifier. Trans Signal Process 18:60–63
Kumari T, Guleria V, Syal P, Aggarwal AK (2021) A feature cum intensity based ssim optimised hybrid image registration technique. In: 2021 International conference on computing, communication and green engineering (CCGE), pp 1–8. IEEE
https://pytorch.org/docs/stable/generated/torch.nn.AdaptiveAvgPool2d.html
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, pp 618–626
Liu X, Zhu X, Li M, Wang L, Zhu E, Liu T, Kloft M, Shen D, Yin J, Gao W (2019) Multiple kernel \( k \) k-means with incomplete kernels. IEEE Trans Pattern Anal Mach Intell 42(5):1191–1204
Zhou Z, Zhang B, Yu X (2022) Immune coordination deep network for hand heat trace extraction. Infrared Phys Technol 127:104400
Yu X, Ye X, Zhang S (2022) Floating pollutant image target extraction algorithm based on immune extremum region. Digit Signal Process 123:103442
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp 248–255. https://doi.org/10.1109/CVPR.2009.5206848
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826
Acknowledgements
This research was supported by Deep Fashion Co., Ltd and the MSIT (Ministry of Science and ICT), Korea, under the National Program for Excellence in SW supervised by the IITP (Institute of Information & Communications Technology Planning & Evaluation) in 2023 (2023-0-00055).
Author information
Authors and Affiliations
Contributions
All authors contributed to the conception and design of the study. Jinyoung Choi mainly developed the model and performed the experiments together with Youngchae Kwon under the supervision of Injung Kim. The manuscript was drafted, revised, and approved by all authors.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Choi, J., Kwon, Y. & Kim, I. Item-region-based style classification network (IRSN): a fashion style classifier based on domain knowledge of fashion experts. Appl Intell 54, 9579–9593 (2024). https://doi.org/10.1007/s10489-024-05683-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05683-9