Symbol | Definition |
× | Multiplication of Vectors or Matrixs |
⊕ | Concatenation of Two Vectors |
+ | Addition of Corresponding Elements in two Matrixs or Vectors |
Due to the complex ornamentation and special composition of ethnic minority costumes, the performance of current costume image recognition algorithms is limited.Models based on convolutional neural networks can extract deep semantic features from clothing images, and perform better in datasets with more images, but ignore the large-scale features of images along the dimensional direction. Therefore, we propose an improved model based on Vision Transformer, which extracts the features of the image along the height and width directions through asymmetric convolution, and then inputs them into the Transformer encoder for serialization and encoding, and uses its output to get the recognition result. Using the accuracy as the evaluation index on the minority clothing dataset, the results show that the method we proposed performs better than ResNet34, and is 1.2% higher than the classic Vision Transformer.
Citation: |
Table 1. Symbol definition
Symbol | Definition |
× | Multiplication of Vectors or Matrixs |
⊕ | Concatenation of Two Vectors |
+ | Addition of Corresponding Elements in two Matrixs or Vectors |
Table 2. Software and hardware environment used in the experiment
CPU | Intel Core i7-12700KF |
Host Memory | 32GB |
GPU | NVIDIA GeForce RTX3090 |
GPU Memory | 24GB |
Operating System | Windows 11 |
Programming Language | Python |
Deep Learning Framework | Pytorch |
Dependency Library | Cuda 11.3 |
Table 3. Definitions of TP and FN
Number of Samples Predicted | Number of Samples Belonging to the Current Recognition |
Number of Samples Predicted to Be Currently Classification | TP |
Number of Samples Predicted to Be Other Classification | FN |
Table 4. Results on the Test Set
Used Neural Network | Accuracy | Recall | AUC | ||
Hani | Wa | Yi | |||
ViT base | 98.6% | 99.12% | 99.65% | 90.24% | 0.9863 |
ViT Improvement | 99.5% | 100.00% | 99.65% | 97.56% | 0.9994 |
ViT Improvement+mask | 99.8% | 100.00% | 100.00% | 97.56% | 0.9997 |
Inception v3 | 99.1% | 98.23% | 99.31% | 100.00% | 0.9993 |
ResNet34 | 99.3% | 99.12% | 99.31% | 100.00% | 0.9965 |
DenseNet121 | 99.5% | 100.00% | 99.65% | 97.56% | 0.9981 |
[1] |
Q.-P. Bao and Z.-F. Sun, Metric learning-based clothing image classification and retrieval, Computer Applications and Software, 34 (2017), 255-259.
![]() |
[2] |
L. Bossard, M. Dantone, C. Leistner and et al., Apparel classification with style, Asian Conference on Computer Vision. Springer, Berlin, Heidelberg, Springer, Berlin, Heidelberg, 2012, 321-335.
![]() |
[3] |
H. Chen, A. Gallagher and B. Girod, Describing clothing by semantic attributes, European Conference on Computer Vision, Springer, Berlin, Heidelberg, 2012, 609-623.
![]() |
[4] |
C. Chenbunyanon and J. H. Jiang, Clothing classification with multi-attribute using convolutional neural network, International Computer Symposium, Springer, Singapore, 2018, 190-196.
![]() |
[5] |
Y.-F. Cheng, Feature Extraction and Recognition of Ethnic Minority Costumes, M.E thesis, Guizhou University for Nationalities, 2018.
![]() |
[6] |
A. Dosovitskiy, L. Beyer, A. Kolesnikov and et al., An image is worth 16x16 words: Transformers for image recognition at scale, International Conference on Learning Representations, 2020.
![]() |
[7] |
M. Elleuch, A. Mezghani, M. Khemakhem and et al., Clothing classification using deep CNN architecture based on transfer learning, International Conference on Hybrid Intelligent Systems, Springer, Cham, 2019,240-248.
![]() |
[8] |
K. Hori, S. Okada and K. Nitta, Fashion image classification on mobile phones using layered deep convolutional neural networks, Proceedings of the 15th International Conference on Mobile and Ubiquitous Multimedia, 2016,359-361.
![]() |
[9] |
X.-Q. Jiang and D. Q. Yang, Design and implementation of minority clothing recognition algorithm based on PCA, Computer Knowledge and Technology, 2017.
![]() |
[10] |
A. Krizhevsky, I. Sutskever and G. E. Hinton, Imagenet classification with deep convolutional neural networks, Advances in Neural Information Processing Systems, 25 (2012), 1097-1105.
![]() |
[11] |
B. Lao and K. Jagadeesh, Convolutional neural networks for fashion classification and object detection, CCCV 2015: Computer Vision, 2015,120-129.
![]() |
[12] |
Q.-C. Lei, Research and Application of Key Technologies in Image Processing of Ethnic Minority Costumes, M.E thesis, Yunnan Normal University, 2020.
![]() |
[13] |
Z. Liu, Y. Lin, Y. Cao and et al., Swin transformer: Hierarchical vision transformer using shifted windows, Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, 10012-10022.
![]() |
[14] |
L.-Y. Luo, Construction of National Costume Unicom Learning System Based on Image Recognition Technology, M.E thesis, Yunnan Normal University, 2017.
![]() |
[15] |
M. Shajini and A. Ramanan, A knowledge-sharing semi-supervised approach for fashion clothes classification and attribute prediction, Vis Comput, 2021.
![]() |
[16] |
X.-M. Shen, Research and Implementation of Content-Based Minority Costume Image Retrieval Technology, M.E thesis, Yunnan Normal University, 2016.
![]() |
[17] |
W. Surakarin and P. Chongstitvatana, Predicting types of clothing using SURF and LDP based on Bag of Features, 015 12th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), IEEE, 2015, 1-5.
![]() |
[18] |
A. Vaswani, N. Shazeer, N. Parmar and et al., Attention is all you need, Advances in Neural Information Processing Systems, 2017, 5998-6008.
![]() |
[19] |
S.-M. Wu, L. Liu and X.-D. Fu, et al., Minority clothing recognition combined with human detection and multi-task learning, Journal of Image and Graphics, 24 (2019), 562-572.
![]() |
[20] |
B. Yang, Minority Costume Recognition based on Multi-scale Attention Mechanism, M.E thesis, Yunnan University, 2020.
![]() |
[21] |
B. Yang, D. Xu and H.-Y. Zhang, et al., Recognition of ethnic costumes based on improved DenseNet-BC, Journal of Zhejiang University (Science Edition), 48 (2021), 676-683.
![]() |
[22] |
H.-Y. Zhao, Research on Educational Resources Retrieval of National Costume Image Based on Convolutional Neural Network, M.E thesis, Yunnan Normal University, 2018.
![]() |
Vision Transformer
Improved embedding layer, take convolution kernel
Improved Transformer encoder
Improved model based on Vision Transformer
Accuracy changes on the training set