HPViT: A Hybrid Visual Model with Feature Pyramid Transformer Structure | IEEE Conference Publication | IEEE Xplore

HPViT: A Hybrid Visual Model with Feature Pyramid Transformer Structure


Abstract:

Recently,the fusion design of Transformer and CNN has significantly improved the efficiency and accuracy of the model. In this work, we propose a hybrid backbone network ...Show More

Abstract:

Recently,the fusion design of Transformer and CNN has significantly improved the efficiency and accuracy of the model. In this work, we propose a hybrid backbone network model –Hybrid Pyramid Vision Transformer(HPViT), which can be used for dense prediction tasks. Compared with the ViT image classification design, HPViT introduces the Transformer structure into CNN and also adopts a pyramid structure, which allows various dense prediction tasks, detection and segmentation tasks, etc. Compared with ViT, HPViT has the following advantages: (1) Compared with the high computational complexity and high memory usage of ViT, HPViT can not only train high-resolution images for density division to capture enough detail information, but also converge faster, occupy less memory, and reduce the computation brought by the Transformer structure through the pyramid structure; (2) HPViT has the advantages of CNNs and Transformer and can be used as a general backbone. (3) Experiments show that HPViT performs well in image classification and object detection, with a top1 accuracy rate of 81.2% on the ImageNet1k dataset. In the task of object detection, RetinaNet+HPViT finetuned on COCO for 12 rounds reached 34.3%AP, while RetinaNet+ResNet50 only had 22.9%AP.
Date of Conference: 22-24 December 2023
Date Added to IEEE Xplore: 09 April 2024
ISBN Information:
Conference Location: Changsha, China

Contact IEEE to Subscribe

References

References is not available for this document.