Research on image recognition of ethnic minority clothing based on improved vision transformer

  • * Corresponding author: Bin Wen

  • Due to the complex ornamentation and special composition of ethnic minority costumes, the performance of current costume image recognition algorithms is limited.Models based on convolutional neural networks can extract deep semantic features from clothing images, and perform better in datasets with more images, but ignore the large-scale features of images along the dimensional direction. Therefore, we propose an improved model based on Vision Transformer, which extracts the features of the image along the height and width directions through asymmetric convolution, and then inputs them into the Transformer encoder for serialization and encoding, and uses its output to get the recognition result. Using the accuracy as the evaluation index on the minority clothing dataset, the results show that the method we proposed performs better than ResNet34, and is 1.2% higher than the classic Vision Transformer.

    Mathematics Subject Classification: Primary: 68T07, 68T45.


    \begin{equation} \\ \end{equation}
  • Figure 1.  Vision Transformer

    Figure 2.  Improved embedding layer, take convolution kernel 1×S as an example

    Figure 3.  Improved Transformer encoder

    Figure 4.  Improved model based on Vision Transformer

    Figure 5.  Accuracy changes on the training set

    Table 1.  Symbol definition

    Symbol Definition
    × Multiplication of Vectors or Matrixs
    Concatenation of Two Vectors
    + Addition of Corresponding Elements in two Matrixs or Vectors
    Table 2.  Software and hardware environment used in the experiment

    CPU Intel Core i7-12700KF
    Host Memory 32GB
    GPU NVIDIA GeForce RTX3090
    GPU Memory 24GB
    Operating System Windows 11
    Programming Language Python
    Deep Learning Framework Pytorch
    Dependency Library Cuda 11.3
    Table 3.  Definitions of TP and FN

    Number of Samples Predicted Number of Samples Belonging to the Current Recognition
    Number of Samples Predicted to Be Currently Classification TP
    Number of Samples Predicted to Be Other Classification FN
    Table 4.  Results on the Test Set

    Used Neural Network Accuracy Recall AUC
    Hani Wa Yi
    ViT base 98.6% 99.12% 99.65% 90.24% 0.9863
    ViT Improvement 99.5% 100.00% 99.65% 97.56% 0.9994
    ViT Improvement+mask 99.8% 100.00% 100.00% 97.56% 0.9997
    Inception v3 99.1% 98.23% 99.31% 100.00% 0.9993
    ResNet34 99.3% 99.12% 99.31% 100.00% 0.9965
    DenseNet121 99.5% 100.00% 99.65% 97.56% 0.9981
