Loading [a11y]/accessibility-menu.js
A Transformer-Based Architecture for High-Resolution Stereo Matching | IEEE Journals & Magazine | IEEE Xplore

A Transformer-Based Architecture for High-Resolution Stereo Matching


Abstract:

The Transformer architecture is now widely used due to its superior parallel computing and global modelling capabilities. In this paper, We build a dense Feature Extracti...Show More

Abstract:

The Transformer architecture is now widely used due to its superior parallel computing and global modelling capabilities. In this paper, We build a dense Feature Extraction Transformer (FET) for stereo matching tasks, incorporating Transformer and convolution blocks. In stereo matching tasks, FET has three advantages: 1) For stereo image pairs with high resolution, Transformer blocks joined with Spatial pyramidal pooling windows can obtain a wide range of contextual representations while maintaining linear computational complexity; 2) We use convolution and transposed convolution blocks to respectively implement overlapping patch embedding, which allows features to capture enough proximity information to facilitate fine-grained matching. 3) FET creatively utilizes the jump-query strategy to apply the transformer encoder and decoder structures to feature extraction tasks simultaneously. Furthermore, to obtain an architecture more thoroughly based on Transformer, we use STTR's (Li et al., 2021) attention-based pixel-matching strategy. Our model obtained 0.32 end-point error and 0.89% 3-px error on the Scene Flow benchmark (30.95% point and 29.36% point absolute improvement compared to STTR). On the KITTI 2015 benchmark, our model obtained 1.80 D1-bg in Estimated pixels (1.57 points of error reduction compared to STTR).
Published in: IEEE Transactions on Computational Imaging ( Volume: 10)
Page(s): 83 - 92
Date of Publication: 10 January 2024

ISSN Information:

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.