Robust self-supervised monocular visual odometry based on prediction-update pose estimation network

doi:10.1016/j.engappai.2022.105481

Engineering Applications of Artificial Intelligence

Volume 116, November 2022, 105481

https://doi.org/10.1016/j.engappai.2022.105481 Get rights and content

Abstract

Visual odometry aims at estimating the camera pose from video sequence, which is an important part of visual Simultaneous Localization and Mapping (SLAM). In this paper, we propose a novel prediction-update pose estimation network, PU-PoseNet, for self-supervised monocular visual odometry. It allows the network to use the effective information of the previous frame in estimating the current pose. The long-time pose consistency constraint-based motion weighted photometric loss is designed to make the network to pay more attention to the pixels of stationary objects and enhance the time consistency of estimation results. The depth image-based occlusion detection, the depth smoothness loss and auto-Masking are used to construct the depth consistency constraint loss term to reduce the influences of interferences such as occlusion. To further improve the robustness and the accuracy of the proposed method, both the depth consistency constraint and the variational auto-encoder are used for network training. For frame missing cases, a novel frame missing training strategy is used to make our method adapt frame missing cases. Extensive experiments on the KITTI dataset have validated the effectiveness of our proposed method.

Introduction

The autonomous driving vehicle has been one of the research hotspots in the field of artificial intelligence. It needs to determine their position during navigation. The traditional localization methods use Global Positioning System (GPS) or Real-Time Kinematic (RTK) positioning system. Recently, with the rapid development of computer vision technology, more and more researchers began to pay attention to the localization methods based on Visual odometry (VO), which estimates the camera’s position and pose from the video frames. As an important part of visual Simultaneous Localization and Mapping (SLAM) system, VO has been widely used in robot navigation, autonomous driving, augmented reality, etc. Existing VO methods can be divided into two categories: the traditional geometry-based methods and the learning-based methods. The traditional geometry-based VO methods first extract features from adjacent frames, and then match them to estimate the relative pose based on geometric relationships (Campos et al., 2021, Mur-Artal et al., 2015, Mur-Artal and Tardos, 2017). This kind of method has achieved satisfactory results in some practical applications, however it still suffers from the poor balance between computational performance and robustness. In recent years, the deep learning technique has developed rapidly and many methods using frontier deep learning methods, like federated learning, transfer learning, are proposed (Zhang et al., 2021a, Zhang and Li, 2022). With the successful application of deep learning technique in computer vision fields such as object detection, keypoint detection and feature extraction (DeTone et al., 2018, He et al., 2017, Liu et al., 2019, Ren et al., 2017, Sarlin et al., 2020), the learning-based VO methods have attracted more and more researchers’ attention. Compared with the traditional VO methods, the deep learning-based VO methods can learn the prior knowledge from a large number of images, and their performances do not depend on the accuracy of feature detection and image matching. So, the deep learning-based monocular VO methods are worthy of further study.

The learning-based VO methods can be divided into the supervised learning-based methods and the self-supervised learning-based methods. The supervised learning-based methods use the real poses as supervised signal to obtain high precision pose estimation results, but the acquisition of real poses requires additional expensive equipment, such as lidar or GPS systems. The self-supervised learning-based VO methods do not require the acquisition of real poses. So, they are more flexible than the supervised learning-based methods. In recent years, many self-supervised learning-based VO methods have been proposed. These methods mainly use the geometric relationship between depth and camera pose to train the networks. However, the performance of these methods is still somewhat poorer than supervised learning-based methods.

Most existing learning-based VO methods have not considered the association between consecutive frames. So the effective information from adjacent frames has not been fully used. In addition, the moving objects make it difficult to describe all motions in the scene with a single Euclidean transformation. Usually, the motions of moving objects are not consistent with the camera trajectories. So, it is unreasonable to treat all pixels equally in network training. Furthermore, most existing methods will fall when facing frame missing cases. To solve the above problems, a robust self-supervised monocular VO method based on prediction-update pose estimation network is proposed.

To better use the related information of adjacent frames and make the method adapt to frame missing cases, we introduce Kalman filter’s prediction-update idea into the design of the pose estimation network. The contributions of this paper can be summarized as follows.

(1) An end-to-end self-supervised monocular VO framework is proposed. We design a novel pose estimation network based on prediction-update mechanism, which can use the extracted information from previous moments to guide the estimation of the pose of current moment and adapt frame missing cases by applying a novel training strategy.

(2) We propose a novel loss function for network training to improve the robustness and accuracy. The long-time pose consistency constraint-based motion weighted photometric loss term is designed to reduce the impact on moving objects and enhance the time consistency of estimation results. The occlusion detection, depth smoothness and auto-Masking are used to construct the depth consistency constraint loss term to reduce the influences of interferences such as occlusion.

(3) Extensive experimental results on the KITTI dataset show that our proposed method achieves state-of-the-art accuracy, and the ablation study validates that the restructured image-based motion weighting, the depth consistency and the depth image-based occlusion detection can all bring improvement for our proposed VO method.

Section snippets

Related works

Traditional geometry-based VO methods can be divided into keypoint-based methods (Campos et al., 2021, Mur-Artal et al., 2015, Mur-Artal and Tardos, 2017) and direct methods (Engel et al., 2018, Engel et al., 2014). The keypoint-based methods first extract features from the image, match them, and then estimate the pose. The MonoSLAM (Davison et al., 2007) (Monocular SLAM) is the first real-time monocular visual SLAM system. It uses the extended Kalman filter method to track sparse keypoints on

Prediction-update pose estimation network

The Kalman filter is an algorithm that uses the linear system state equation to optimally estimate the system state through the system input and output observation. It has been successfully used in traditional geometry-based VO and SLAM systems. Inspired by the Kalman filter’s idea of “prediction-update”, we design a novel prediction-update pose estimation network (PU-PoseNet) to obtain the relative pose between the current frame $I_{t}$ and the previous frame $I_{t - 1}$ . As shown in Fig. 1, the

Self-supervised monocular visual odometry

In this section, we present a novel self-supervised monocular VO method based on our proposed PU-PoseNet. At first, the overall framework of the proposed method is presented. Then, the detailed construction process of each loss function term is described in the following subsections.

Inspired by the work of SfMLearner (Zhou et al., 2017), we design a novel self-supervised monocular VO using our proposed PU-PoseNet. Its main idea is to first learn two networks on the estimation of monocular

Experiments and evaluations

In this paper, we use the KITTI odometry dataset (Geiger et al., 2012) to evaluate our method. The KITTI odometry dataset includes binocular images, lidar points, and ground truth of urban highway environment for tasks of stereo, optical flow, visual odometry, 3D object detection and 3D tracking. It contains 11 video sequences from 00–10 with ground truth trajectories. In our experiments, we use the same dataset splitting strategy as Zhou et al. (2017). The 00–08 sequences are used for training

Conclusions

In this paper, we propose an end-to-end self-supervised monocular VO method, which uses the Kalman filter’s prediction-update idea to perform pose estimation. Firstly, we design a novel prediction-update pose estimation network referring to the Kalman filter’s prediction-update idea, which can make effective use of the information between adjacent frames. Secondly, the long-time pose consistency constraint-based motion weighted photometric loss, the depth smoothness loss, the depth consistency

CRediT authorship contribution statement

Haixin Xiu: Methodology, Software, Writing – original draft, Writing – review & editing. Yiyou Liang: Investigation, Validation. Hui Zeng: Project administration, Supervision, Writing – review & editing. Qing Li: Conceptualization, Formal analysis. Hongmin Liu: Writing – review & editing. Bin Fan: Writing – review. Chen Li: Conceptualization, Methodology.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Funding

This work was supported by the National Key R&D Program of China (2020YFB1313002) and National Natural Science Foundation of China (Grant No. 61973029, 62076026, 62033010), Scientific and Technological Innovation Foundation of Foshan, China (BK21BF004), and Research Project of the Beijing Young Topnotch Talents Cultivation Program, China (Grand No: CIT&TCD201904009).

References (47)

BianJ.-W. et al.
Unsupervised scale-consistent depth and ego-motion learning from monocular video
CamposC. et al.
ORB-SLAM3: An accurate open-source library for visual, visual–inertial, and multimap SLAM
IEEE Trans. Robot.
(2021)
CharlesR.Q. et al.
PointNet: Deep learning on point sets for 3D classification and segmentation
DavisonA.J. et al.
MonoSLAM: Real-time single camera SLAM
IEEE Trans. Pattern Anal. Mach. Intell.
(2007)
DeToneD. et al.
SuperPoint: Self-supervised interest point detection and description
EngelJ. et al.
Direct sparse odometry
IEEE Trans. Pattern Anal. Mach. Intell.
(2018)
EngelJ. et al.
LSD-SLAM: Large-scale direct monocular SLAM
EngelJ. et al.
Semi-dense visual odometry for a monocular camera
FengT. et al.
SGANVO: Unsupervised deep visual odometry and depth estimation with stacked generative adversarial networks
IEEE Robot. Autom. Lett.
(2019)
GeigerA. et al.
Are we ready for autonomous driving? The KITTI vision benchmark suite

GodardC. et al.

Unsupervised monocular depth estimation with left-right consistency

GodardC. et al.

Digging into self-supervised monocular depth estimation

GordonA. et al.

Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras

HeK. et al.

Mask R-CNN

HuangY. et al.

Learning optical flow with R-CNN for visual odometry

JauY.-Y. et al.

Deep keypoint-based camera pose estimation with geometric constraints

Kingma, D.P., Welling, M., 2014. Auto-encoding variational bayes. In: International Conference on Learning...

KleinG. et al.

Parallel tracking and mapping for small AR workspaces

LiY. et al.

Pose graph optimization for unsupervised monocular visual odometry

LiR. et al.

UnDeepVO: Monocular visual odometry through unsupervised deep learning

LiangZ. et al.

Deep unsupervised learning based visual odometry with multi-scale matching and latent feature constraint

LiuY. et al.

GIFT: Learning transformation-invariant dense visual descriptors via group CNNs

MahjourianR. et al.

Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints

Cited by (6)

Joint estimation of pose, depth, and optical flow with a competition–cooperation transformer network
2024, Neural Networks
Estimating depth, ego-motion, and optical flow from consecutive frames is a critical task in robot navigation and has received significant attention in recent years. In this study, we propose PDF-Former, an unsupervised joint estimation network comprising a full transformer-based framework, as well as a competition and cooperation mechanism. The transformer framework captures global feature dependencies and is customized for different task types, thereby improving the performance of sequential tasks. The competition and cooperation mechanisms enable the network to obtain additional supervisory information at different training stages. Specifically, the competition mechanism is implemented early in training to achieve iterative optimization of 6 DOF poses (rotation and translation information from the target image to the two reference images), the depth of target image, and optical flow (from the target image to the two reference images) estimation in a competitive manner. In contrast, the cooperation mechanism is implemented later in training to facilitate the transmission of results among the three networks and mutually optimize the estimation results. We conducted experiments on the KITTI dataset, and the results indicate that PDF-Former has significant potential to enhance the accuracy and robustness of sequential tasks in robot navigation.
Monocular depth estimation using self-supervised learning with more effective geometric constraints
2024, Engineering Applications of Artificial Intelligence
Self-supervised learning-based depth estimation from monocular videos is a challenging yet promising way for 3D environment perception. Existing methods that use photometric consistency as supervision are often fragile in the case of textureless environments or illumination variations. In this paper, we adopt geometric constraints to improve the reliability and accuracy of monocular depth estimation. We introduce an equation derived from the two-view imaging geometry, from which we also develop a novel geometric loss function that is shown to be more effective than the epipolar constraint in self-supervised learning. In addition, we design a depth reconstruction loss by explicitly aligning the depth scale of the consecutive estimated depth maps to keep the depth scale consistent. Finally, a new self-supervised learning framework is formulated by incorporating the proposed geometric constraints into depth, pose, and optical flow estimation models. Extensive experiments on KITTI, Make3D, and NYU Depth v2 datasets show that, by adding our geometric constraints, the depth estimation model compares favorably with state-of-the-art self-supervised learning methods, and each sub-task has a significant improvement.
Structural asymmetric convolution for wireframe parsing
2024, Engineering Applications of Artificial Intelligence
Simultaneously extracting junctions and their corresponding line segments from images presents a promising approach to structural environment cognition. However, conventional methods employ square convolution for line feature extraction, resulting in the exclusion of long-range dependencies and the generation of suboptimal wireframe predictions. In this paper, we introduce an efficient and concise parsing method named Structural Asymmetric Convolution-based Wireframe Parser (SACWP). Taking advantage of the inherent similarities between structural asymmetric convolution and the predominant distribution of line segments in man-made environments, we propose a Structural Asymmetric Convolution module (SAC) that captures long-range contextual features while efficiently filtering out irrelevant information from neighboring pixels. Additionally, we introduce a feature aggregation module based on dilated convolution (DCFA) to seamlessly integrate contextual information from multiple receptive fields. We thoroughly evaluate our approach on the Wireframe and YorkUrban datasets, achieving preferable results of 69.3% and 29.7% msAP respectively. On the other hand, the promising results adequately demonstrate the effectiveness of SACWP to Wireframe Parsing task.
ReSLAM: Reusable SLAM with heterogeneous cameras
2024, Neurocomputing
State-of-the-art SLAM methods are designed to work only with the type of camera employed to create the map, and little attention has been paid to the reusability of the maps created. In other words, the maps generated by current methods can only be reused with the same camera employed to create them. This paper presents a novel SLAM approach that allows maps generated with one camera to be used by other cameras with different resolutions and optics. Our system allows, for instance, creating highly detailed maps processed off-line with high-end computers, to be reused later by low-powered devices (e.g. a drone or robot) using a different camera. The first map, called base map, can be reused with other cameras and dynamically adapted by creating an augmented map. The principal idea of our method is a bottom-up pyramidal representation of the images that allows us to match keypoints between different camera types seamlessly. The experiments conducted validate our proposal, showing that it outperforms the state-of-the-art approaches, namely ORBSLAM, OpenVSLAM and UcoSLAM.
Deep Learning for Visual SLAM: The State-of-the-Art and Future Trends
2023, Electronics (Switzerland)
Pseudo-LiDAR for Visual Odometry
2023, IEEE Transactions on Instrumentation and Measurement

View full text

Robust self-supervised monocular visual odometry based on prediction-update pose estimation network

Abstract

Introduction

Section snippets

Related works

Prediction-update pose estimation network

Self-supervised monocular visual odometry

Experiments and evaluations

Conclusions

CRediT authorship contribution statement

Declaration of Competing Interest

Funding

Unsupervised scale-consistent depth and ego-motion learning from monocular video

ORB-SLAM3: An accurate open-source library for visual, visual–inertial, and multimap SLAM

IEEE Trans. Robot.

PointNet: Deep learning on point sets for 3D classification and segmentation

MonoSLAM: Real-time single camera SLAM

IEEE Trans. Pattern Anal. Mach. Intell.

SuperPoint: Self-supervised interest point detection and description

Direct sparse odometry