Recent works have shown that multi-label image recognition is still a challenging task in computer vision due to the complicatedness and diversity of multi-label images. However, the existing works ignore the co-occurrence correlation and global contextual information between image space and objects. We present a model to solve these problems. On the one hand, we devise the graph attention mechanism to compute the hidden representations of different categories in multi-label images. It can specify different weights to different neighbor objects and well model the label dependency. On the other hand, we iterate the global contextual information by the second-order covariance pooling to enhance nonlinear modeling capability and use basic residual network to extract features. The proposed model is thoroughly evaluated on PASCAL VOC 2007 and MS-COCO datasets. Compared with classical ML-GCN, the model can better combine the image features and label embedding. Meanwhile, experiments show that it outperforms the state-of-the-art methods such as residual multi-layer perceptron, EfficientNet, and vision transformer. |
ACCESS THE FULL ARTICLE
No SPIE Account? Create one
CITATIONS
Cited by 2 scholarly publications.
Performance modeling
Visual process modeling
Chromium
Convolution
Data modeling
Feature extraction
Detection and tracking algorithms