Skip to main content
Log in

Self-Supervised Monocular Depth and Motion Learning in Dynamic Scenes: Semantic Prior to Rescue

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

We introduce an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion, and depth in a monocular camera setup without geometric supervision. Our technical contributions are three-fold. First, we highlight the fundamental difference between inverse and forward projection while modeling the individual motion of each rigid object, and propose a geometrically correct projection pipeline using a neural forward projection module. Second, we propose two types of residual motion learning frameworks to explicitly disentangle camera and object motions in dynamic driving scenes with different levels of semantic prior knowledge: video instance segmentation as a strong prior, and object detection as a weak prior. Third, we design a unified photometric and geometric consistency loss that holistically imposes self-supervisory signals for every background and object region. Lastly, we present a unsupervised method of 3D motion field regularization for semantically plausible object motion representation. Our proposed elements are validated in a detailed ablation study. Through extensive experiments conducted on the KITTI, Cityscapes, and Waymo open dataset, our framework is shown to outperform the state-of-the-art depth and motion estimation methods. Our code, dataset, and models are publicly available

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. This is different from the reversed optical flow leveraged in Liu et al. (2019); Wang et al. (2019); Luo et al. (2019). Since flow-based warping techniques do not consider geometric structure, serious distortions will appear where multiple source pixels are warped to the same target locations, e.g., object boundaries, as shown in Fig. 2b. Our forward and inverse warping are not about temporal order, but rather which coordinate frame from which to conduct the geometric transformation when warping from the reference to the target view. Hereafter, we express forward projection as forward warping for consistency with inverse warping.

  2. Previous works (Gordon et al., 2019; Li et al., 2020) have alleviated this issue by applying motion smoothness term. This is fair, but only nearby motion vectors are regularized. On the other hands, our regularization method plays with the distribution of motion vectors. Considering the rigidity of the moving objects, e.g., mostly vehicles on traffic roads, we postulate that boosting consistency over a set of whole motion vectors for each object is more helpful to learn semantically plausible object motion field.

  3. In our previous work (Lee et al., 2021) we proposed contrastive sample consensus (CSAC). While CSAC focuses on the motion boundary between the object and background (modulating two distributions), HSAC has a more general perspective to find and converge to a target value by observing its internal distribution without any supervision.

  4. This is why we postulate Assumption 1. We use this only for the initial object mask. Since it is not accurate, we calculate the regularization loss (\(\textsc {CalcPenalty}()\) in line 16) by excluding query vectors that deviate significantly (\(s_k < 0.01\)).

  5. We surmise that this experiment on measuring the ATE metric shows saturated performances. Therefore, we provide the additional results on the relative errors in Table 11, and qualitative results in Fig. 18

References

  • Bangunharcana, A., Cho, J.W., Lee, S., Kweon, I.S., Kim, K.S., & Kim, S. (2021). Correlate-and-excite: Real-time stereo matching via guided cost volume excitation. In: IROS.

  • Behley, J., Garbade, M., Milioto, A., Quenzel, J., Behnke, S., Stachniss, C., & Gall, J. (2019). SemanticKITTI: A dataset for semantic scene understanding of LiDAR sequences. In ICCV.

  • Bian, J. W., Li, Z., Wang, N., Zhan, H., Shen, C., Cheng, M. M., & Reid, I. (2019). Unsupervised scale-consistent depth and ego-motion learning from monocular video. In NeurIPS.

  • Bian, J. W., Zhan, H., Wang, N., Li, Z., Zhang, L., Shen, C., Cheng, M. M., & Reid, I. (2021). Unsupervised scale-consistent depth learning from video. International Journal of Computer Vision (IJCV).

  • Cao, Z., Kar, A., Hane, C., & Malik, J. (2019). Learning independent object motion from unlabelled stereoscopic videos. In CVPR.

  • Casser, V., Pirk, S., Mahjourian, R., & Angelova, A. (2019). Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In AAAI.

  • Casser, V., Pirk, S., Mahjourian, R., & Angelova, A. (2019). Unsupervised monocular depth and ego-motion learning with structure and semantics. In CVPR workshops.

  • Chang, J.R., & Chen, Y.S. (2018). Pyramid stereo matching network. In CVPR.

  • Chen, P. Y., Liu, A. H., Liu, Y. C., & Wang, Y.C.F. (2019). Towards scene understanding: Unsupervised monocular depth estimation with semantic-aware representation. In CVPR.

  • Chen, Y., Schmid, C., & Sminchisescu, C. (2019). Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In ICCV.

  • Cheng, B., Saggu, I. S., Shah, R., Bansal, G., & Bharadia, D. (2020). \(s^{3}\)net: Semantic-aware self-supervised depth estimation with monocular videos and synthetic data. In ECCV.

  • Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In CVPR.

  • Dai, Q., Patil, V., Hecker, S., Dai, D., Van Gool, L., & Schindler, K. (2020). Self-supervised object motion and depth estimation from video. In CVPR workshops.

  • Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., & Brox, T. (2015). Flownet: Learning optical flow with convolutional networks. In ICCV.

  • Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In NIPS.

  • Garg, R., BG, V. K., Carneiro, G., & Reid, I. (2016). Unsupervised cnn for single view depth estimation: Geometry to the rescue. In ECCV.

  • Geiger, A., Lauer, M., Wojek, C., Stiller, C., & Urtasun, R. (2014). 3d traffic scene understanding from movable platforms. In IEEE transactions on pattern analysis and machine intelligence (PAMI).

  • Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? The kitti vision benchmark suite. In CVPR.

  • Godard, C., Mac Aodha, O., & Brostow, G.J. (2017). Unsupervised monocular depth estimation with left-right consistency. In CVPR.

  • Godard, C., Mac Aodha, O., Firman, M., & Brostow, G.J. (2019). Digging into self-supervised monocular depth estimation. In ICCV.

  • Gordon, A., Li, H., Jonschkowski, R., & Angelova, A. (2019). Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In ICCV.

  • Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., & Gaidon, A. (2020). 3d packing for self-supervised monocular depth estimation. In CVPR.

  • Guizilini, V., Hou, R., Li, J., Ambrus, R., & Gaidon, A. (2020). Semantically-guided representation learning for self-supervised monocular depth. In ICLR.

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In ICCV.

  • Hur, J., & Roth, S. (2020). Self-supervised monocular scene flow estimation. In CVPR.

  • Jaderberg, M., Simonyan, K., & Zisserman, A., et al. (2015). Spatial transformer networks. In NIPS.

  • Janai, J., Guney, F., Ranjan, A., Black, M., & Geiger, A. (2018). Unsupervised learning of multi-frame optical flow with occlusions. In ECCV.

  • Kingma, D.P., & Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR.

  • Klingner, M., Termöhlen, J. A., Mikolajczyk, J., & Fingscheidt, T. (2020). Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In ECCV.

  • Lee, S., Im, S., Lin, S., & Kweon, I.S. (2019). Learning residual flow as dynamic motion from stereo videos. In IROS.

  • Lee, S., Im, S., Lin, S., & Kweon, I. S. (2021). Learning monocular depth in dynamic scenes via instance-aware projection consistency. In AAAI.

  • Lee, S., Kim, J., Oh, T. H., Jeong, Y., Yoo, D., Lin, S., & Kweon, I. S. (2019). Visuomotor understanding for representation learning of driving scenes. In BMVC.

  • Lee, S., Rameau, F., Pan, F., Kweon, I. S. (2021). Attentive and contrastive learning for joint depth and motion field estimation. In ICCV.

  • Li, H., Gordon, A., Zhao, H., Casser, V., & Angelova, A. (2020). Unsupervised monocular depth learning in dynamic scenes. In CoRL.

  • Liu, P., Lyu, M., King, I., & Xu, J. (2019). Selflow: Self-supervised learning of optical flow. In CVPR.

  • Liu, S., Qi, L., Qin, H., Shi, J., & Jia, J. (2018). Path aggregation network for instance segmentation. In CVPR.

  • Luo, C., Yang, Z., Wang, P., Wang, Y., Xu, W., Nevatia, R., & Yuille, A. (2019). Every pixel counts++: Joint learning of geometry and motion with 3d holistic understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI).

  • Lv, Z., Kim, K., Troccoli, A., Sun, D., Rehg, J. M., & Kautz, J. (2018). Learning rigidity in dynamic scenes with a moving camera for 3d motion field estimation. In ECCV.

  • Mahjourian, R., Wicke, M., & Angelova, A. (2018). Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In CVPR.

  • Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., & Brox, T. (2016). A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR.

  • Meister, S., Hur, J., & Roth, S. (2018). Unflow: Unsupervised learning of optical flow with a bidirectional census loss. In AAAI.

  • Ošep, A., Mehner, W., Mathias, M., & Leibe, B. (2017). Combined image-and world-space tracking in traffic scenes. In ICRA.

  • Ošep, A., Mehner, W., Voigtlaender, P., & Leibe, B. (2018). Track, then decide: Category-agnostic vision-based multi-object tracking. In ICRA.

  • Otsu, N. (1979). A threshold selection method from gray-level histograms. IEEE Transactions on Systems: Man, and Cybernetics, 9, 62–66.

    Google Scholar 

  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., & Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. In NeurIPS.

  • Pillai, S., Ambruş, R., & Gaidon, A. (2019). Superdepth: Self-supervised, super-resolved monocular depth estimation. In ICRA.

  • Ranjan, A., Jampani, V., Kim, K., Sun, D., Wulff, J., & Black, M. J. (2019). Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In CVPR.

  • Shashua, A., Gdalyahu, Y., & Hayun, G. (2004). Pedestrian detection for driving assistance systems: Single-frame classification and system level performance. In IEEE intelligent vehicles symposium.

  • Shin, K., Kwon, Y. P., & Tomizuka, M. (2019). Roarnet: A robust 3d object detection based on region approximation refinement. In 2019 IEEE intelligent vehicles symposium (IV).

  • Sun, D., Yang, X., Liu, M. Y., & Kautz, J. (2018). Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In CVPR.

  • Voigtlaender, P., Krause, M., Osep, A., Luiten, J., Sekar, B.B.G., Geiger, A., & Leibe, B.(2019). Mots: Multi-object tracking and segmentation. In CVPR.

  • Wang, C., Buenaposada, J. M., Zhu, R., & Lucey, S. (2018). Learning depth from monocular videos using direct methods. In CVPR.

  • Wang, Y., Wang, P., Yang, Z., Luo, C., Yang, Y., & Xu, W. (2019). Unos: Unified unsupervised optical-flow and stereo-depth estimation by watching videos. In CVPR.

  • Wang, Y., Yang, Y., Yang, Z., Zhao, L., Wang, P., & Xu, W. (2018). Occlusion aware unsupervised learning of optical flow. In CVPR.

  • Wang, Z., Bovik, A.C., Sheikh, H.R., & Simoncelli, E.P. (2004). Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing (TIP).

  • Yang, Z., Wang, P., Wang, Y., Xu, W., & Nevatia, R. (2018). Lego: Learning edge with geometry all at once by watching videos. In CVPR.

  • Yin, Z., & Shi, J. (2018). Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In CVPR.

  • Zhang, C., Benz, P., Argaw, D. M., Lee, S., Kim, J., Rameau, F., Bazin, J. C., & Kweon, I. S. (2021). Resnet or densenet? introducing dense shortcuts to resnet. In WACV.

  • Zhang, C., Rameau, F., Lee, S., Kim, J., Benz, P., Argaw, D. M., Bazin, J. C., & Kweon, I. S. (2019). Revisiting residual networks with nonlinear shortcuts. In BMVC.

  • Zhou, T., Brown, M., Snavely, N., & Lowe, D. G. (2017). Unsupervised learning of depth and ego-motion from video. In CVPR.

Download references

Acknowledgements

This work was supported by the KENTECH Research Grant (KRG2022-01-003), the DGIST R &D Program of the Ministry of Science and ICT (20-CoE-IT-01), and the International Research and Development Program of the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT under Grant NRF-2021K1A3A1A21040016.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Seokju Lee.

Additional information

Communicated by Akihiro Sugimoto.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Parts of this paper were published at 35th AAAI Conference on Artificial Intelligence (AAAI 2021): “Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection Consistency” (Lee et al., 2021), and IEEE International Conference on Computer Vision (ICCV 2021): “Attentive and Contrastive Learning for Joint Depth and Motion Field Estimation” (Lee et al., 2021). We extend our previous works with a unified joint training pipeline of depth and motion field, a novel and generic motion regularization technique, and other extensive experiments.

https://github.com/SeokjuLee/Insta-DM .

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lee, S., Rameau, F., Im, S. et al. Self-Supervised Monocular Depth and Motion Learning in Dynamic Scenes: Semantic Prior to Rescue. Int J Comput Vis 130, 2265–2285 (2022). https://doi.org/10.1007/s11263-022-01641-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-022-01641-5

Keywords

Navigation