Self-supervised Learning of Depth and Camera Motion from 360 $$^\circ $$ Videos

Wang, Fu-En; Hu, Hou-Ning; Cheng, Hsien-Tzu; Lin, Juan-Ting; Yang, Shang-Ta; Shih, Meng-Li; Chu, Hung-Kuo; Sun, Min

doi:10.1007/978-3-030-20873-8_4

Self-supervised Learning of Depth and Camera Motion from 360$^\circ $ Videos

Fu-En Wang¹⁸,
Hou-Ning Hu¹⁸,
Hsien-Tzu Cheng¹⁸,
Juan-Ting Lin¹⁸,
Shang-Ta Yang¹⁹,
Meng-Li Shih¹⁸,
Hung-Kuo Chu¹⁹ &
…
Min Sun¹⁸

Conference paper
First Online: 26 May 2019

2835 Accesses
19 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11365))

Abstract

As 360$^{\circ }$ cameras become prevalent in many autonomous systems (e.g., self-driving cars and drones), efficient 360$^{\circ }$ perception becomes more and more important. We propose a novel self-supervised learning approach for predicting the omnidirectional depth and camera motion from a 360$^{\circ }$ video. In particular, starting from the SfMLearner, which is designed for cameras with normal field-of-view, we introduce three key features to process 360$^{\circ }$ images efficiently. Firstly, we convert each image from equirectangular projection to cubic projection in order to avoid image distortion. In each network layer, we use Cube Padding (CP), which pads intermediate features from adjacent faces, to avoid image boundaries. Secondly, we propose a novel “spherical” photometric consistency constraint on the whole viewing sphere. In this way, no pixel will be projected outside the image boundary which typically happens in images with normal field-of-view. Finally, rather than naively estimating six independent camera motions (i.e., naively applying SfM-Learner to each face on a cube), we propose a novel camera pose consistency loss to ensure the estimated camera motions reaching consensus. To train and evaluate our approach, we collect a new PanoSUNCG dataset containing a large amount of 360$^{\circ }$ videos with groundtruth depth and camera motion. Our approach achieves state-of-the-art depth prediction and camera motion estimation on PanoSUNCG with faster inference speed comparing to equirectangular. In real-world indoor videos, our approach can also achieve qualitatively reasonable depth prediction by acquiring model pre-trained on PanoSUNCG.

F.-E. Wang, H.-N. Hu and H.-T. Cheng—Contribute equally to this paper.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://aliensunmin.github.io/project/360-depth/.

References

Byravan, A., Fox, D.: SE3-nets: learning rigid body motion using deep neural networks. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 173–180. IEEE (2017)
Google Scholar
Caruso, D., Engel, J., Cremers, D.: Large-scale direct slam for omnidirectional cameras. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 141–148. IEEE (2015)
Google Scholar
Chang, P., Hebert, M.: Omni-directional structure from motion. In: Proceedings of the 2000 IEEE Workshop on Omnidirectional Vision, pp. 127–133 (2000)
Google Scholar
Cheng, H.T., Chao, C.H., Dong, J.D., Wen, H.K., Liu, T.L., Sun, M.: Cube padding for weakly-supervised saliency prediction in 360$^\circ $ videos. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018
Google Scholar
Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: large-scale direct monocular SLAM. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 834–849. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_54
Chapter Google Scholar
Flynn, J., Neulander, I., Philbin, J., Snavely, N.: Deepstereo: learning to predict new views from the world’s imagery. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5515–5524 (2016)
Google Scholar
Garg, R., B.G., V.K., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: geometry to the rescue. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 740–756. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_45
Chapter Google Scholar
Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR, vol. 2, p. 7 (2017)
Google Scholar
Guan, H., Smith, W.A.P.: Structure-from-motion in spherical video using the von mises-fisher distribution. IEEE Trans. Image Process. 26(2), 711–723 (2017)
Article MathSciNet Google Scholar
Häne, C., et al.: 3D visual perception for self-driving cars using a multi-camera system: calibration, mapping, localization, and obstacle detection. Image Vis. Comput. (IMAVIS) 68, 14–27 (2017). Special Issue “Automotive Vision”
Article Google Scholar
Hu, H.N., Lin, Y.C., Liu, M.Y., Cheng, H.T., Chang, Y.J., Sun, M.: Deep 360 pilot: learning a deep agent for piloting through 360$^\circ $ sports videos. In: CVPR (2017)
Google Scholar
Im, S., Ha, H., Rameau, F., Jeon, H.-G., Choe, G., Kweon, I.S.: All-around depth from small motion with a spherical panoramic camera. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 156–172. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_10
Chapter Google Scholar
Kangni, F., Laganiere, R.: Orientation and pose recovery from spherical panoramas. In: 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8, October 2007
Google Scholar
Kos, A., Tomazic, S., Umek, A.: Evaluation of smartphone inertial sensor performance for cross-platform mobile applications. Sensors 16, 477 (2016)
Article Google Scholar
Lai, W.S., Huang, Y., Joshi, N., Buehler, C., Yang, M.H., Kang, S.B.: Semantic-driven generation of hyperlapse from 360$^\circ $ video. TVCG 24(9), 2610–2621 (2017)
Google Scholar
Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: 2016 Fourth International Conference on 3D Vision (3DV), pp. 239–248. IEEE (2016)
Google Scholar
Lin, Y.C., Chang, Y.J., Hu, H.N., Cheng, H.T., Huang, C.W., Sun, M.: Tell me where to look: investigating ways for assisting focus in 360$^{\circ }$ video. In: CHI (2017)
Google Scholar
Mahjourian, R., Wicke, M., Angelova, A.: Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018
Google Scholar
Mayer, N., et al.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4040–4048 (2016)
Google Scholar
Pagani, A., Stricker, D.: Structure from motion using full spherical panoramic cameras. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 375–382, November 2011
Google Scholar
Paszke, A., Chintala, S.: Pytorch. https://github.com/apaszke/pytorch-dist
Pathak, S., Moro, A., Fujii, H., Yamashita, A., Asama, H.: 3D reconstruction of structures using spherical cameras with small motion. In: 2016 16th International Conference on Control, Automation and Systems (ICCAS), pp. 117–122, October 2016
Google Scholar
Schönbein, M., Geiger, A.: Omnidirectional 3D reconstruction in augmented manhattan worlds. In: 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 716–723, September 2014
Google Scholar
Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)
Google Scholar
Su, Y.C., Grauman, K.: Learning spherical convolution for fast features from 360$^{\circ }$ imagery. In: NIPS (2017)
Google Scholar
Su, Y.C., Grauman, K.: Making 360$^{\circ }$ video watchable in 2D: learning videography for click free viewing. In: CVPR (2017)
Google Scholar
Su, Y.C., Jayaraman, D., Grauman, K.: Pano2Vid: automatic cinematography for watching 360$^{\circ }$ videos. In: ACCV (2016)
Google Scholar
Tateno, K., Tombari, F., Laina, I., Navab, N.: CNN-SLAM: real-time dense monocular slam with learned depth prediction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., Fragkiadaki, K.: SfM-net: learning of structure and motion from video. CoRR abs/1704.07804 (2017)
Google Scholar
Wang, C., Miguel Buenaposada, J., Zhu, R., Lucey, S.: Learning depth from monocular videos using direct methods. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018
Google Scholar
Wang, F.E., et al.: Technical report of self-supervised 360 depth (2018). https://aliensunmin.github.io/project/360-depth/
Wang, T.H., Huang, H.J., Lin, J.T., Hu, C.W., Zeng, K.H., Sun, M.: Omnidirectional CNN for Visual Place Recognition and Navigation. CoRR abs/1803.04228v1 (2018)
Google Scholar
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR, vol. 2, p. 7 (2017)
Google Scholar

Download references

Acknowledgements

We thank MOST-107-2634-F-007-007, MOST-107-2218-E-007-047 and MEDIATEK for their support.

Author information

Authors and Affiliations

Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan
Fu-En Wang, Hou-Ning Hu, Hsien-Tzu Cheng, Juan-Ting Lin, Meng-Li Shih & Min Sun
Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan
Shang-Ta Yang & Hung-Kuo Chu

Authors

Fu-En Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hou-Ning Hu
View author publications
You can also search for this author in PubMed Google Scholar
Hsien-Tzu Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Juan-Ting Lin
View author publications
You can also search for this author in PubMed Google Scholar
Shang-Ta Yang
View author publications
You can also search for this author in PubMed Google Scholar
Meng-Li Shih
View author publications
You can also search for this author in PubMed Google Scholar
Hung-Kuo Chu
View author publications
You can also search for this author in PubMed Google Scholar
Min Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Fu-En Wang , Hou-Ning Hu , Hsien-Tzu Cheng , Juan-Ting Lin , Shang-Ta Yang , Meng-Li Shih , Hung-Kuo Chu or Min Sun .

Editor information

Editors and Affiliations

IIIT Hyderabad, Hyderabad, India
C.V. Jawahar
ANU, Canberra, ACT, Australia
Hongdong Li
Simon Fraser University, Burnaby, BC, Canada
Greg Mori
ETH Zurich, Zurich, Zürich, Switzerland
Konrad Schindler

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, FE. et al. (2019). Self-supervised Learning of Depth and Camera Motion from 360$^\circ $ Videos. In: Jawahar, C., Li, H., Mori, G., Schindler, K. (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science(), vol 11365. Springer, Cham. https://doi.org/10.1007/978-3-030-20873-8_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-20873-8_4
Published: 26 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20872-1
Online ISBN: 978-3-030-20873-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics