Skip to main content

Self-supervised Learning of Depth and Camera Motion from 360\(^\circ \) Videos

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11365))

Abstract

As 360\(^{\circ }\) cameras become prevalent in many autonomous systems (e.g., self-driving cars and drones), efficient 360\(^{\circ }\) perception becomes more and more important. We propose a novel self-supervised learning approach for predicting the omnidirectional depth and camera motion from a 360\(^{\circ }\) video. In particular, starting from the SfMLearner, which is designed for cameras with normal field-of-view, we introduce three key features to process 360\(^{\circ }\) images efficiently. Firstly, we convert each image from equirectangular projection to cubic projection in order to avoid image distortion. In each network layer, we use Cube Padding (CP), which pads intermediate features from adjacent faces, to avoid image boundaries. Secondly, we propose a novel “spherical” photometric consistency constraint on the whole viewing sphere. In this way, no pixel will be projected outside the image boundary which typically happens in images with normal field-of-view. Finally, rather than naively estimating six independent camera motions (i.e., naively applying SfM-Learner to each face on a cube), we propose a novel camera pose consistency loss to ensure the estimated camera motions reaching consensus. To train and evaluate our approach, we collect a new PanoSUNCG dataset containing a large amount of 360\(^{\circ }\) videos with groundtruth depth and camera motion. Our approach achieves state-of-the-art depth prediction and camera motion estimation on PanoSUNCG with faster inference speed comparing to equirectangular. In real-world indoor videos, our approach can also achieve qualitatively reasonable depth prediction by acquiring model pre-trained on PanoSUNCG.

F.-E. Wang, H.-N. Hu and H.-T. Cheng—Contribute equally to this paper.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://aliensunmin.github.io/project/360-depth/.

References

  1. Byravan, A., Fox, D.: SE3-nets: learning rigid body motion using deep neural networks. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 173–180. IEEE (2017)

    Google Scholar 

  2. Caruso, D., Engel, J., Cremers, D.: Large-scale direct slam for omnidirectional cameras. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 141–148. IEEE (2015)

    Google Scholar 

  3. Chang, P., Hebert, M.: Omni-directional structure from motion. In: Proceedings of the 2000 IEEE Workshop on Omnidirectional Vision, pp. 127–133 (2000)

    Google Scholar 

  4. Cheng, H.T., Chao, C.H., Dong, J.D., Wen, H.K., Liu, T.L., Sun, M.: Cube padding for weakly-supervised saliency prediction in 360\(^\circ \) videos. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

    Google Scholar 

  5. Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: large-scale direct monocular SLAM. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 834–849. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_54

    Chapter  Google Scholar 

  6. Flynn, J., Neulander, I., Philbin, J., Snavely, N.: Deepstereo: learning to predict new views from the world’s imagery. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5515–5524 (2016)

    Google Scholar 

  7. Garg, R., B.G., V.K., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: geometry to the rescue. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 740–756. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_45

    Chapter  Google Scholar 

  8. Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR, vol. 2, p. 7 (2017)

    Google Scholar 

  9. Guan, H., Smith, W.A.P.: Structure-from-motion in spherical video using the von mises-fisher distribution. IEEE Trans. Image Process. 26(2), 711–723 (2017)

    Article  MathSciNet  Google Scholar 

  10. Häne, C., et al.: 3D visual perception for self-driving cars using a multi-camera system: calibration, mapping, localization, and obstacle detection. Image Vis. Comput. (IMAVIS) 68, 14–27 (2017). Special Issue “Automotive Vision”

    Article  Google Scholar 

  11. Hu, H.N., Lin, Y.C., Liu, M.Y., Cheng, H.T., Chang, Y.J., Sun, M.: Deep 360 pilot: learning a deep agent for piloting through 360\(^\circ \) sports videos. In: CVPR (2017)

    Google Scholar 

  12. Im, S., Ha, H., Rameau, F., Jeon, H.-G., Choe, G., Kweon, I.S.: All-around depth from small motion with a spherical panoramic camera. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 156–172. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_10

    Chapter  Google Scholar 

  13. Kangni, F., Laganiere, R.: Orientation and pose recovery from spherical panoramas. In: 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8, October 2007

    Google Scholar 

  14. Kos, A., Tomazic, S., Umek, A.: Evaluation of smartphone inertial sensor performance for cross-platform mobile applications. Sensors 16, 477 (2016)

    Article  Google Scholar 

  15. Lai, W.S., Huang, Y., Joshi, N., Buehler, C., Yang, M.H., Kang, S.B.: Semantic-driven generation of hyperlapse from 360\(^\circ \) video. TVCG 24(9), 2610–2621 (2017)

    Google Scholar 

  16. Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: 2016 Fourth International Conference on 3D Vision (3DV), pp. 239–248. IEEE (2016)

    Google Scholar 

  17. Lin, Y.C., Chang, Y.J., Hu, H.N., Cheng, H.T., Huang, C.W., Sun, M.: Tell me where to look: investigating ways for assisting focus in 360\(^{\circ }\) video. In: CHI (2017)

    Google Scholar 

  18. Mahjourian, R., Wicke, M., Angelova, A.: Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

    Google Scholar 

  19. Mayer, N., et al.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4040–4048 (2016)

    Google Scholar 

  20. Pagani, A., Stricker, D.: Structure from motion using full spherical panoramic cameras. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 375–382, November 2011

    Google Scholar 

  21. Paszke, A., Chintala, S.: Pytorch. https://github.com/apaszke/pytorch-dist

  22. Pathak, S., Moro, A., Fujii, H., Yamashita, A., Asama, H.: 3D reconstruction of structures using spherical cameras with small motion. In: 2016 16th International Conference on Control, Automation and Systems (ICCAS), pp. 117–122, October 2016

    Google Scholar 

  23. Schönbein, M., Geiger, A.: Omnidirectional 3D reconstruction in augmented manhattan worlds. In: 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 716–723, September 2014

    Google Scholar 

  24. Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)

    Google Scholar 

  25. Su, Y.C., Grauman, K.: Learning spherical convolution for fast features from 360\(^{\circ }\) imagery. In: NIPS (2017)

    Google Scholar 

  26. Su, Y.C., Grauman, K.: Making 360\(^{\circ }\) video watchable in 2D: learning videography for click free viewing. In: CVPR (2017)

    Google Scholar 

  27. Su, Y.C., Jayaraman, D., Grauman, K.: Pano2Vid: automatic cinematography for watching 360\(^{\circ }\) videos. In: ACCV (2016)

    Google Scholar 

  28. Tateno, K., Tombari, F., Laina, I., Navab, N.: CNN-SLAM: real-time dense monocular slam with learned depth prediction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  29. Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., Fragkiadaki, K.: SfM-net: learning of structure and motion from video. CoRR abs/1704.07804 (2017)

    Google Scholar 

  30. Wang, C., Miguel Buenaposada, J., Zhu, R., Lucey, S.: Learning depth from monocular videos using direct methods. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

    Google Scholar 

  31. Wang, F.E., et al.: Technical report of self-supervised 360 depth (2018). https://aliensunmin.github.io/project/360-depth/

  32. Wang, T.H., Huang, H.J., Lin, J.T., Hu, C.W., Zeng, K.H., Sun, M.: Omnidirectional CNN for Visual Place Recognition and Navigation. CoRR abs/1803.04228v1 (2018)

    Google Scholar 

  33. Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR, vol. 2, p. 7 (2017)

    Google Scholar 

Download references

Acknowledgements

We thank MOST-107-2634-F-007-007, MOST-107-2218-E-007-047 and MEDIATEK for their support.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Fu-En Wang , Hou-Ning Hu , Hsien-Tzu Cheng , Juan-Ting Lin , Shang-Ta Yang , Meng-Li Shih , Hung-Kuo Chu or Min Sun .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, FE. et al. (2019). Self-supervised Learning of Depth and Camera Motion from 360\(^\circ \) Videos. In: Jawahar, C., Li, H., Mori, G., Schindler, K. (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science(), vol 11365. Springer, Cham. https://doi.org/10.1007/978-3-030-20873-8_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-20873-8_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-20872-1

  • Online ISBN: 978-3-030-20873-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics