Depth-Aware Multi-Modal Fusion for Generalized Zero-Shot Learning | IEEE Conference Publication | IEEE Xplore

Depth-Aware Multi-Modal Fusion for Generalized Zero-Shot Learning


Abstract:

Realizing Generalized Zero-Shot Learning (GZSL) based on large models is emerging as a prevailing trend. However, most existing methods merely regard large models as blac...Show More

Abstract:

Realizing Generalized Zero-Shot Learning (GZSL) based on large models is emerging as a prevailing trend. However, most existing methods merely regard large models as black boxes, solely leveraging the features output by the final layer while disregarding potential performance enhancements from other layers. Indeed, numerous researchers have visually depicted variations in the features learned across different layers of neural networks. Motivated by this observation, we propose a Vision Transformer (ViT)-based GZSL method named Depth-Aware Multi-Modal ViT (DAM2ViT), which exploits multi-level features of ViT. DAM2ViT incorporates a multi-modal interaction block to align semantic information of categories across multiple layers, thereby augmenting the model's capacity to learn associations between visual and semantic spaces. Extensive experiments conducted on three benchmark datasets (i.e., CUB, SUN, AWA2) have showcased that DAM2ViT achieves competitive results compared to state-of-the-art methods.
Date of Conference: 18-20 August 2024
Date Added to IEEE Xplore: 12 December 2024
ISBN Information:

ISSN Information:

Conference Location: Beijing, China

Contact IEEE to Subscribe

References

References is not available for this document.