Loading [MathJax]/extensions/TeX/cellcolor_ieee.js
SDViT: Towards Efficient Visual Foundation Model via Unifying Sparse and Dense Representation Learning | IEEE Conference Publication | IEEE Xplore

SDViT: Towards Efficient Visual Foundation Model via Unifying Sparse and Dense Representation Learning

Publisher: IEEE

Abstract:

Although window-based self-attention stands out as efficient and effective, a limitation in many previous window-based approaches lies in their reliance on fixed partitio...View more

Abstract:

Although window-based self-attention stands out as efficient and effective, a limitation in many previous window-based approaches lies in their reliance on fixed partition patterns within each encoder layer. This constraint restricts the potential for flexible interaction between query and key-value pairs and limits the effective receptive field. To address this issue, we introduce SParse Windows Attention (SPWA), which includes three meta partition patterns: square window, stripes window, and dilation window. In SPWA, each pattern is an independent encoding branch, allowing queries to interact with a broader set of key-value pairs while maintaining linear complexity. Additionally, we introduce Dense Regional Attention (DRA), where each query attends to a set of aggregated regions. By empirically combining sparse and dense encoding schedules, the derived network, SDViT, achieves both coarse and fine-grained interaction within a single layer, prompting the capability of multi-scale learning. Empirical experiments on a variety of tasks including ImageNet-1k classification, MS-COCO object detection, and ADE20K semantic segmentation validate the effectiveness of our approach.
Date of Conference: 15-19 July 2024
Date Added to IEEE Xplore: 30 September 2024
ISBN Information:

ISSN Information:

Publisher: IEEE
Conference Location: Niagara Falls, ON, Canada

References

References is not available for this document.