Abstract:
Self-supervised multi-frame depth estimation outperforms single-frame approaches by utilizing not only appearance information, but also geometric information. A common pr...Show MoreMetadata
Abstract:
Self-supervised multi-frame depth estimation outperforms single-frame approaches by utilizing not only appearance information, but also geometric information. A common practice for multi-frame methods is to employ feature-metric bundle adjustment (FBA) to refine depth map initialized from the single-frame prior. However, FBA cannot always provide effective residual updates due to unreliable matching costs, which are corrupted by thin texture, occlusion, and especially object motion. To tackle this problem, we propose a context-aware transformer (CAT) to refine the corrupted matching costs by leveraging the spatial context information. Specifically, the CAT adaptively aggregates matching costs according to the spatial affinity inferred from local appearance context, and produces reliable contextual costs for FBA. Moreover, we design a motion-aware regularization loss to provide supervision for regions with moving objects, making CAT competent for dynamic scenes. Extensive experiments and analyses on the KITTI and Cityscapes datasets demonstrate the effectiveness and superior generalization capability of our approach.
Published in: IEEE Transactions on Circuits and Systems for Video Technology ( Volume: 34, Issue: 6, June 2024)