Robust self-supervised monocular visual odometry based on prediction-update pose estimation network

https://doi.org/10.1016/j.engappai.2022.105481Get rights and content

Abstract

Visual odometry aims at estimating the camera pose from video sequence, which is an important part of visual Simultaneous Localization and Mapping (SLAM). In this paper, we propose a novel prediction-update pose estimation network, PU-PoseNet, for self-supervised monocular visual odometry. It allows the network to use the effective information of the previous frame in estimating the current pose. The long-time pose consistency constraint-based motion weighted photometric loss is designed to make the network to pay more attention to the pixels of stationary objects and enhance the time consistency of estimation results. The depth image-based occlusion detection, the depth smoothness loss and auto-Masking are used to construct the depth consistency constraint loss term to reduce the influences of interferences such as occlusion. To further improve the robustness and the accuracy of the proposed method, both the depth consistency constraint and the variational auto-encoder are used for network training. For frame missing cases, a novel frame missing training strategy is used to make our method adapt frame missing cases. Extensive experiments on the KITTI dataset have validated the effectiveness of our proposed method.

Introduction

The autonomous driving vehicle has been one of the research hotspots in the field of artificial intelligence. It needs to determine their position during navigation. The traditional localization methods use Global Positioning System (GPS) or Real-Time Kinematic (RTK) positioning system. Recently, with the rapid development of computer vision technology, more and more researchers began to pay attention to the localization methods based on Visual odometry (VO), which estimates the camera’s position and pose from the video frames. As an important part of visual Simultaneous Localization and Mapping (SLAM) system, VO has been widely used in robot navigation, autonomous driving, augmented reality, etc. Existing VO methods can be divided into two categories: the traditional geometry-based methods and the learning-based methods. The traditional geometry-based VO methods first extract features from adjacent frames, and then match them to estimate the relative pose based on geometric relationships (Campos et al., 2021, Mur-Artal et al., 2015, Mur-Artal and Tardos, 2017). This kind of method has achieved satisfactory results in some practical applications, however it still suffers from the poor balance between computational performance and robustness. In recent years, the deep learning technique has developed rapidly and many methods using frontier deep learning methods, like federated learning, transfer learning, are proposed (Zhang et al., 2021a, Zhang and Li, 2022). With the successful application of deep learning technique in computer vision fields such as object detection, keypoint detection and feature extraction (DeTone et al., 2018, He et al., 2017, Liu et al., 2019, Ren et al., 2017, Sarlin et al., 2020), the learning-based VO methods have attracted more and more researchers’ attention. Compared with the traditional VO methods, the deep learning-based VO methods can learn the prior knowledge from a large number of images, and their performances do not depend on the accuracy of feature detection and image matching. So, the deep learning-based monocular VO methods are worthy of further study.

The learning-based VO methods can be divided into the supervised learning-based methods and the self-supervised learning-based methods. The supervised learning-based methods use the real poses as supervised signal to obtain high precision pose estimation results, but the acquisition of real poses requires additional expensive equipment, such as lidar or GPS systems. The self-supervised learning-based VO methods do not require the acquisition of real poses. So, they are more flexible than the supervised learning-based methods. In recent years, many self-supervised learning-based VO methods have been proposed. These methods mainly use the geometric relationship between depth and camera pose to train the networks. However, the performance of these methods is still somewhat poorer than supervised learning-based methods.

Most existing learning-based VO methods have not considered the association between consecutive frames. So the effective information from adjacent frames has not been fully used. In addition, the moving objects make it difficult to describe all motions in the scene with a single Euclidean transformation. Usually, the motions of moving objects are not consistent with the camera trajectories. So, it is unreasonable to treat all pixels equally in network training. Furthermore, most existing methods will fall when facing frame missing cases. To solve the above problems, a robust self-supervised monocular VO method based on prediction-update pose estimation network is proposed.

To better use the related information of adjacent frames and make the method adapt to frame missing cases, we introduce Kalman filter’s prediction-update idea into the design of the pose estimation network. The contributions of this paper can be summarized as follows.

(1) An end-to-end self-supervised monocular VO framework is proposed. We design a novel pose estimation network based on prediction-update mechanism, which can use the extracted information from previous moments to guide the estimation of the pose of current moment and adapt frame missing cases by applying a novel training strategy.

(2) We propose a novel loss function for network training to improve the robustness and accuracy. The long-time pose consistency constraint-based motion weighted photometric loss term is designed to reduce the impact on moving objects and enhance the time consistency of estimation results. The occlusion detection, depth smoothness and auto-Masking are used to construct the depth consistency constraint loss term to reduce the influences of interferences such as occlusion.

(3) Extensive experimental results on the KITTI dataset show that our proposed method achieves state-of-the-art accuracy, and the ablation study validates that the restructured image-based motion weighting, the depth consistency and the depth image-based occlusion detection can all bring improvement for our proposed VO method.

Section snippets

Related works

Traditional geometry-based VO methods can be divided into keypoint-based methods (Campos et al., 2021, Mur-Artal et al., 2015, Mur-Artal and Tardos, 2017) and direct methods (Engel et al., 2018, Engel et al., 2014). The keypoint-based methods first extract features from the image, match them, and then estimate the pose. The MonoSLAM (Davison et al., 2007) (Monocular SLAM) is the first real-time monocular visual SLAM system. It uses the extended Kalman filter method to track sparse keypoints on

Prediction-update pose estimation network

The Kalman filter is an algorithm that uses the linear system state equation to optimally estimate the system state through the system input and output observation. It has been successfully used in traditional geometry-based VO and SLAM systems. Inspired by the Kalman filter’s idea of “prediction-update”, we design a novel prediction-update pose estimation network (PU-PoseNet) to obtain the relative pose between the current frame It and the previous frame It1. As shown in Fig. 1, the

Self-supervised monocular visual odometry

In this section, we present a novel self-supervised monocular VO method based on our proposed PU-PoseNet. At first, the overall framework of the proposed method is presented. Then, the detailed construction process of each loss function term is described in the following subsections.

Inspired by the work of SfMLearner (Zhou et al., 2017), we design a novel self-supervised monocular VO using our proposed PU-PoseNet. Its main idea is to first learn two networks on the estimation of monocular

Experiments and evaluations

In this paper, we use the KITTI odometry dataset (Geiger et al., 2012) to evaluate our method. The KITTI odometry dataset includes binocular images, lidar points, and ground truth of urban highway environment for tasks of stereo, optical flow, visual odometry, 3D object detection and 3D tracking. It contains 11 video sequences from 00–10 with ground truth trajectories. In our experiments, we use the same dataset splitting strategy as Zhou et al. (2017). The 00–08 sequences are used for training

Conclusions

In this paper, we propose an end-to-end self-supervised monocular VO method, which uses the Kalman filter’s prediction-update idea to perform pose estimation. Firstly, we design a novel prediction-update pose estimation network referring to the Kalman filter’s prediction-update idea, which can make effective use of the information between adjacent frames. Secondly, the long-time pose consistency constraint-based motion weighted photometric loss, the depth smoothness loss, the depth consistency

CRediT authorship contribution statement

Haixin Xiu: Methodology, Software, Writing – original draft, Writing – review & editing. Yiyou Liang: Investigation, Validation. Hui Zeng: Project administration, Supervision, Writing – review & editing. Qing Li: Conceptualization, Formal analysis. Hongmin Liu: Writing – review & editing. Bin Fan: Writing – review. Chen Li: Conceptualization, Methodology.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Funding

This work was supported by the National Key R&D Program of China (2020YFB1313002) and National Natural Science Foundation of China (Grant No. 61973029, 62076026, 62033010), Scientific and Technological Innovation Foundation of Foshan, China (BK21BF004), and Research Project of the Beijing Young Topnotch Talents Cultivation Program, China (Grand No: CIT&TCD201904009).

References (47)

  • BianJ.-W. et al.

    Unsupervised scale-consistent depth and ego-motion learning from monocular video

  • CamposC. et al.

    ORB-SLAM3: An accurate open-source library for visual, visual–inertial, and multimap SLAM

    IEEE Trans. Robot.

    (2021)
  • CharlesR.Q. et al.

    PointNet: Deep learning on point sets for 3D classification and segmentation

  • DavisonA.J. et al.

    MonoSLAM: Real-time single camera SLAM

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2007)
  • DeToneD. et al.

    SuperPoint: Self-supervised interest point detection and description

  • EngelJ. et al.

    Direct sparse odometry

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2018)
  • EngelJ. et al.

    LSD-SLAM: Large-scale direct monocular SLAM

  • EngelJ. et al.

    Semi-dense visual odometry for a monocular camera

  • FengT. et al.

    SGANVO: Unsupervised deep visual odometry and depth estimation with stacked generative adversarial networks

    IEEE Robot. Autom. Lett.

    (2019)
  • GeigerA. et al.

    Are we ready for autonomous driving? The KITTI vision benchmark suite

  • GodardC. et al.

    Unsupervised monocular depth estimation with left-right consistency

  • GodardC. et al.

    Digging into self-supervised monocular depth estimation

  • GordonA. et al.

    Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras

  • HeK. et al.

    Mask R-CNN

  • HuangY. et al.

    Learning optical flow with R-CNN for visual odometry

  • JauY.-Y. et al.

    Deep keypoint-based camera pose estimation with geometric constraints

  • Kingma, D.P., Welling, M., 2014. Auto-encoding variational bayes. In: International Conference on Learning...
  • KleinG. et al.

    Parallel tracking and mapping for small AR workspaces

  • LiY. et al.

    Pose graph optimization for unsupervised monocular visual odometry

  • LiR. et al.

    UnDeepVO: Monocular visual odometry through unsupervised deep learning

  • LiangZ. et al.

    Deep unsupervised learning based visual odometry with multi-scale matching and latent feature constraint

  • LiuY. et al.

    GIFT: Learning transformation-invariant dense visual descriptors via group CNNs

  • MahjourianR. et al.

    Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints

  • Cited by (6)

    • Structural asymmetric convolution for wireframe parsing

      2024, Engineering Applications of Artificial Intelligence
    • Pseudo-LiDAR for Visual Odometry

      2023, IEEE Transactions on Instrumentation and Measurement
    View full text