Conferences >2024 IEEE International Confe...

SDViT: Towards Efficient Visual Foundation Model via Unifying Sparse and Dense Representation Learning

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Although window-based self-attention stands out as efficient and effective, a limitation in many previous window-based approaches lies in their reliance on fixed partitio...View more

Metadata

Abstract:

Although window-based self-attention stands out as efficient and effective, a limitation in many previous window-based approaches lies in their reliance on fixed partition patterns within each encoder layer. This constraint restricts the potential for flexible interaction between query and key-value pairs and limits the effective receptive field. To address this issue, we introduce SParse Windows Attention (SPWA), which includes three meta partition patterns: square window, stripes window, and dilation window. In SPWA, each pattern is an independent encoding branch, allowing queries to interact with a broader set of key-value pairs while maintaining linear complexity. Additionally, we introduce Dense Regional Attention (DRA), where each query attends to a set of aggregated regions. By empirically combining sparse and dense encoding schedules, the derived network, SDViT, achieves both coarse and fine-grained interaction within a single layer, prompting the capability of multi-scale learning. Empirical experiments on a variety of tasks including ImageNet-1k classification, MS-COCO object detection, and ADE20K semantic segmentation validate the effectiveness of our approach.

Published in: 2024 IEEE International Conference on Multimedia and Expo (ICME)

Date of Conference: 15-19 July 2024

Date Added to IEEE Xplore: 30 September 2024

ISBN Information:

ISSN Information:

DOI: 10.1109/ICME57554.2024.10687501

Conference Location: Niagara Falls, ON, Canada

Contents

References is not available for this document.

SDViT: Towards Efficient Visual Foundation Model via Unifying Sparse and Dense Representation Learning

Abstract:

Metadata

Abstract:

ISSN Information:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

SDViT: Towards Efficient Visual Foundation Model via Unifying Sparse and Dense Representation Learning

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

References

IEEE Account

Purchase Details

Profile Information

Need Help?