Rigid-aware self-supervised GAN for camera ego-motion estimation

https://doi.org/10.1016/j.dsp.2022.103471Get rights and content

Abstract

Learning-based camera ego-motion estimation has attracted increasing attention and has made impressive improvements. However, the accuracy of unsupervised paradigm is still limited, especially in complex dynamic environment. In this paper, we propose a rigid-aware self-supervised generative adversarial network (GAN) for camera ego-motion estimation, which can effectively learn the rigidity of the scene and improve the accuracy of ego-motion estimation by combining the pixel- and the structure-level perception. Specifically, a rigid-aware generator is first designed for joint unsupervised learning of optical flow, stereo depth and camera pose from two consecutive frames. Then, an iterative pose refinement strategy with rigidity learning is presented to reduce the impact of moving objects in scenes. To overcome the limitation of the purely pixel-wised photometric methods, a rigidity mask embedded discriminator is attached to perceive structural distortion artifacts in synthesized fake images, which encourages the generator to learn additional structure-level information to improve the accuracy of pose estimation. Experiments on the benchmark datasets show that our model achieves the state-of-the-art performance in terms of both RPE and ATE compared to recent GAN-based methods.

Introduction

Camera ego-motion estimation, recovering 6 degrees of freedom (DoF) pose of a camera mounted on a mobile platform, is usually an essential intermediate task in the fields of localization, autonomous driving, robot visual navigation, and so on [1]. Although the traditional vision-based ego-motion estimation technology has been well-studied in various structure-from-motion (SfM) and visual simultaneous localization and mapping (vSLAM) [2], [3], it is usually difficult to work well in low-textured regions, occlusion and dynamic scenes, and its performance heavily relies on the accuracy of image feature extraction and matching.

With the rapid development of deep learning techniques in recent years, convolutional neural networks (CNNs) have made impressive progress in this research field. Compared with the traditional ego-motion estimation technology, CNN based methods [4], [5] directly learn the scene geometry and ego-motion from image sequence, and have the superiority of robust performance in challenging scenarios with e.g. large textureless regions, occlusion, repetitive patterns, reflections, shadows, illumination variations or poor imaging quality. However, most of them are trained in a supervised fashion on a substantial labeled dataset, which is often difficult or impractical to obtain.

In order to overcome ground truth data limitation, recent work pays more attention to the unsupervised manner [6], [7], [8]. It is well known that the main idea behind unsupervised methodology is minimizing the photometric re-projection loss between two adjacent frames based on the photometric consistency assumption. However, this assumption is always hard to fulfill in the presence of moving objects or occlusions [9], [10]. Moreover, the photometric loss is a pixel-wise loss and difficult to eliminate the distortion artifacts [11] caused by inaccurate depth and pose.

In this paper, a rigid-aware self-supervised generative adversarial network (GAN) for camera ego-motion estimation is proposed to address the aforementioned issues in unsupervised paradigm. Our inspiration is mainly to infer the rigidity of each scene pixel for the camera ego-motion estimation, and embed a rigidity mask into an adversarial training framework to enforce the structural-level perception of static scene regions. To this end, we first design a generator with an iterative optimization strategy for joint unsupervised learning of stereo depth, relative camera pose and optical flow, then we infer the pre-pixel rigidity according to the rigid geometrical consistency assumption between optical flow and stereo depth, and lastly a rigidity-mask embedded discriminator is adopted for the adversarial training by distinguishing the distortion artifacts in synthesized fake images which are caused by inaccurate depth and pose. Compared with the existing unsupervised frameworks of camera ego-motion estimation, our method has two contributions.

(1) A rigid-aware self-supervised generative adversarial learning framework is proposed for robust camera ego-motion estimation, which contains a rigid-aware generator and a rigidity mask embedded discriminator. By taking full advantage of GAN, our model combines the pixel- and the structure-level rigidity perception for robust unsupervised learning of camera ego-motion.

(2) An iterative optimization strategy is adopted for the iterative refinement of camera relative pose. It is based on a decoupled pose estimation structure, and iteratively updates camera pose through a recurrent unit. At each iteration, it learns a residual pose based on the previously estimated pose, and generates a potentially finer pose estimation.

We evaluate our model on the popular benchmark, KITTI Visual Odometry Datasets [12], for camera ego-motion estimation. The evaluation results show that our model can achieve state-of-the-art performance.

Section snippets

Related work

Before the emergence of deep learning, traditional camera ego-motion estimation methods usually follow the feature-based pipeline and the triangulation algorithm [2], [3]. However, the performance of traditional methods may degrade dramatically in some challenging situations, such as weakly textured or dynamic scenarios. In order to alleviate this problem, a variety of learning-based VO methods have been proposed and achieve impressive results compared with traditional methods. For brevity, we

Network architecture

With the in-depth study of our previous work [31], we propose a rigid-aware self-supervised GAN for camera ego-motion estimation (RA-GANVO). Compared with our previous work [31] and most of existing methods, RA-GANVO is substantially novel in three aspects: (1) a flow-pose structure for accurate pose estimation by fusing the motion features from optical flow. (2) an iterative pose refinement strategy with rigidity learning to reduce the impact of moving objects in scenes. (3) a rigidity mask

Implementation details

Our experiments are conducted on a desktop PC with an Intel Core i5-4570 3.2 GHz CPU, 32G DDR4 memory, and a NVIDIA GeForce GTX 1080Ti GPU with 11 GB GDDR5X memory.

Due to the limited GPU resources, our network takes 2 consecutive stereo pairs as input, and the size of each image is resized to 128×416. Our network is implemented in Python using TensorFlow, which contains about 28.95 million trainable parameters. The initial learning rate is 2e4, and the batch size is 4. It is trained with Adam

Conclusion

We have proposed a novel rigid-aware self-supervised GAN for camera ego-motion estimation, RA-GANVO. It employs a rigidity mask embedded discriminator to take advantage of the structural perception ability of adversarial learning. In addition, an iterative optimization strategy is proposed for the pose refinement. We have validated our model on the KITTI Visual Odometry Datasets. Both quantitative evaluation and visual comparisons demonstrate the superiorities of our model over the

CRediT authorship contribution statement

Lili Lin: Conceptualization, Validation, Writing – original draft. Wan Luo: Data curation, Software. Zhengmao Yan: Data curation, Software. Wenhui Zhou: Formal analysis, Investigation, Methodology, Writing – original draft.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work is supported in part by Zhejiang Provincial Natural Science Foundation of China (LY21F010007) and Joint Funds of the Zhejiang Provincial Natural Science Foundation of China (LTY22F020001). The authors are grateful for the anonymous reviewers who made constructive comments.

Lili Lin is currently an associate professor in the School of Information and Electronic Engineering at Zhejiang Gongshang University, China. She received the Ph.D degree in information and communication engineering from Zhejiang University, China, in 2005. Since then she has worked in the School of Information and Electronic Engineering at Zhejiang Gongshang University, China. She was a visiting scholar at Indiana University Bloomington from March 2012 to March 2013. Her research interests lie

References (50)

  • L. Lin et al.

    Unsupervised monocular visual odometry with decoupled camera pose estimation

    Digit. Signal Process.

    (2021)
  • D. Scaramuzza et al.

    Visual odometry

    IEEE Robot. Autom. Mag.

    (2011)
  • R. Mur-Artal et al.

    ORB-SLAM: a versatile and accurate monocular SLAM system

    IEEE Trans. Robot.

    (2015)
  • R. Mur-Artal et al.

    ORB-SLAM2: an open-source SLAM system for monocular, stereo, and RGB-d cameras

    IEEE Trans. Robot.

    (2017)
  • A. Kendall et al.

    Posenet: a convolutional network for real-time 6-dof camera relocalization

  • G. Costante et al.

    Exploring representation learning with CNNs for frame-to-frame ego-motion estimation

    IEEE Robot. Autom. Lett.

    (2016)
  • T. Zhou et al.

    Unsupervised learning of depth and ego-motion from video

  • R. Mahjourian et al.

    Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints

  • Z. Yin et al.

    Unsupervised learning of dense depth, optical flow and camera pose

  • Y. Wang et al.

    UnOS: unified unsupervised optical-flow and stereo-depth estimation by watching videos

  • G.S. Yang Jiao et al.

    EffiScene: efficient per-pixel rigidity inference for unsupervised joint learning of optical flow, depth, camera pose and motion segmentation

  • S. Li et al.

    Sequential adversarial learning for self-supervised deep visual odometry

  • A. Geiger et al.

    Are we ready for autonomous driving? The KITTI vision benchmark suite

  • B. Ummenhofer et al.

    Demon: depth and motion network for learning monocular stereo

  • H. Zhou et al.

    Deeptam: deep tracking and mapping

  • S. Wang et al.

    DeepVO: towards end-to-end visual odometry with deep recurrent convolutional neural networks

  • S. Wang et al.

    End-to-end, sequence-to-sequence probabilistic visual odometry through deep neural networks

    Int. J. Robot. Res.

    (2017)
  • J. Jiao et al.

    MagicVO: an end-to-end hybrid CNN and Bi-LSTM method for monocular visual odometry

    IEEE Access

    (2019)
  • F. Xue et al.

    Beyond tracking: selecting memory and refining poses for deep visual odometry

  • H. Zhan et al.

    Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction

  • C. Chi et al.

    Feature-level collaboration: joint unsupervised learning of optical flow, stereo depth and camera motion

  • H. Jiang et al.

    Unsupervised monocular depth perception: focusing on moving objects

    IEEE Sens. J.

    (2021)
  • S. Vijayanarasimhan et al.

    SfM-Net: learning of structure and motion from video

  • Y. Wang et al.

    Occlusion aware unsupervised learning of optical flow

  • J. Bian et al.

    Unsupervised scale-consistent depth and ego-motion learning from monocular video

    Adv. Neural Inf. Process. Syst.

    (2019)
  • Lili Lin is currently an associate professor in the School of Information and Electronic Engineering at Zhejiang Gongshang University, China. She received the Ph.D degree in information and communication engineering from Zhejiang University, China, in 2005. Since then she has worked in the School of Information and Electronic Engineering at Zhejiang Gongshang University, China. She was a visiting scholar at Indiana University Bloomington from March 2012 to March 2013. Her research interests lie in the general areas of image processing, multimedia signal processing and machine learning. Her current work focuses on depth and motion perception from 2D/3D imaging data.

    Wan Luo is currently a master's degree candidate in the School of Information and Electronic Engineering at Zhejiang Gongshang University, China. He received the B.S. degree in Electronic Information Engineering from Harbin Far East Institute of Science and Technology, China, in 2019. His research interests include image processing, computational photography and machine learning.

    Zhengmao Yan is currently a master's degree candidate in the School of Computer Science and Technology at Hangzhou Dianzi University, China. He received the B.S. degree in computer science from Hangzhou Dianzi University, China, in 2020. His research interests include ego-motion estimation, optical/scene flow, depth estimation and deep learning.

    Wenhui Zhou is currently a professor in the School of Computer Science and Technology at Hangzhou Dianzi University, China. He received the Ph.D. degree in information and communication engineering from Zhejiang University, China, in 2005, and worked as a postdoctoral researcher in the Department of Information Science and Electronic Engineering at Zhejiang University from July 2005 to October 2007. Since then he has worked in the School of Computer Science and Technology at Hangzhou Dianzi University, China. He was a visiting scholar at Indiana University Bloomington from April 2015 to April 2016. His research interests lie in the general areas of computer vision, computational photography and machine learning. His current work focuses on self/un-supervised learning for image/video/3D data representation and understanding.

    View full text