research-article

V2Depth: Monocular Depth Estimation via Feature-Level Virtual-View Simulation and Refinement

Authors:

Xianzhi LiAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 688 - 697

https://doi.org/10.1145/3581783.3611751

Published: 27 October 2023 Publication History

Abstract

Due to the lack of spatial cues giving merely a single image, many monocular depth estimation methods have been developed to leverage stereo or multi-view images to learn the spatial information of a scene in a self-supervised manner. However, these methods have limited performance gain since they are not able to exploit sufficient 3D geometry cues during inference, where only monocular images are available. In this work, we present V2Depth, a novel coarse-to-fine framework with Virtual View feature simulation for supervised monocular Depth estimation. Specifically, we first design a virtual-view feature simulator by leveraging the technique of novel view synthesis and contrastive learning to generate virtual view feature maps. In this way, we explicitly provide representative spatial geometry for subsequent depth estimation in both the training and inference stages. Then we introduce a 3DVA-Refiner to iteratively optimize the predicted depth map. During the optimization process, 3D-aware virtual attention is developed to capture the global spatial-context correlations to maintain the feature consistency of different views and estimation integrity of the 3D scene such as objects with occlusion relationships. Decisive improvements over state-of-the-art approaches on three benchmark datasets across all metrics demonstrate the superiority of our method.

References

[1]

Shubhra Aich, Jean Marie Uwabeza Vianney, Md Amirul Islam, and Mannat Kaur Bingbing Liu. 2021. Bidirectional attention network for monocular depth estimation. In IEEE Int. Conf. on Robotics and Automation (ICRA). IEEE, 11746--11752.

Digital Library

[2]

Gwangbin Bae, Ignas Budvytis, and Roberto Cipolla. 2022. Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 2842--2851.

[3]

Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. 2021. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In IEEE Int. Conf. on Computer Vision (ICCV). 5855--5864.

[4]

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. 2022. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 5470--5479.

[5]

Zuria Bauer, Zuoyue Li, Sergio Orts-Escolano, Miguel Cazorla, Marc Pollefeys, and Martin R Oswald. 2021. NVS-MonoDepth: Improving monocular depth prediction with novel view synthesis. In 2021 International Conference on 3D Vision (3DV). IEEE, 848--858.

[6]

Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. 2021. AdaBins: Depth estimation using adaptive bins. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 4009--4018.

[7]

Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. 2022. Tensorf: Tensorial radiance fields. In Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXII. Springer, 333--350.

[8]

Zhi Chen, Xiaoqing Ye, Wei Yang, Zhenbo Xu, Xiao Tan, Zhikang Zou, Errui Ding, Xinming Zhang, and Liusheng Huang. 2021. Revealing the reciprocal relations between self-supervised stereo and monocular depth estimation. In IEEE Int. Conf. on Computer Vision (ICCV). 15529--15538.

[9]

Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan. 2022. Depth-supervised nerf: Fewer views and faster training for free. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 12882--12891.

[10]

Helisa Dhamo, Nassir Navab, and Federico Tombari. 2019. Object-driven multi-layer scene decomposition from a single image. In IEEE Int. Conf. on Computer Vision (ICCV). 5369--5378.

[11]

Helisa Dhamo, Keisuke Tateno, Iro Laina, Nassir Navab, and Federico Tombari. 2019. Peeking behind objects: Layered depth prediction from a single image. Pattern Recognition Letters 125 (2019), 333--340.

Digital Library

[12]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Int. Conf. on Learning Representations (ICLR). OpenReview.net.

[13]

Ruofei Du, Eric Turner, Maksym Dzitsiuk, Luca Prasso, Ivo Duarte, Jason Dourgarian, Joao Afonso, Jose Pascoal, Josh Gladstone, Nuno Cruces, et al. 2020. DepthLab: Real-time 3D interaction with depth maps for mobile augmented reality. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology (UIST). 829--843.

Digital Library

[14]

David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth map prediction from a single image using a multi-scale deep network. Conference and Workshop on Neural Information Processing Systems (NeurIPS) (2014), 2366--2347.

[15]

Fatima El Jamiy and Ronald Marsh. 2019. Survey on depth perception in head mounted displays: distance estimation in virtual reality, augmented reality, and mixed reality. IET Image Processing 13, 5 (2019), 707--712.

[16]

Ziyue Feng, Liang Yang, Longlong Jing, Haiyan Wang, YingLi Tian, and Bing Li. 2022. Disentangling Object Motion and Occlusion for Unsupervised Multi-frame Monocular Depth. In European Conf. on Computer Vision (ECCV). 228--244.

[17]

Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. 2018. Deep ordinal regression network for monocular depth estimation. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 2002--2011.

[18]

Adrien Gaidon, Qiao Wang, Yohann Cabon, and Eleonora Vig. 2016. Virtual worlds as proxy for multi-object tracking analysis. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 4340--4349.

[19]

Ravi Garg, Vijay Kumar Bg, Gustavo Carneiro, and Ian Reid. 2016. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European Conf. on Computer Vision (ECCV). Springer, 740--756.

[20]

Andreas Geiger, Philip Lenz, and Raquel Urtasun. 2012. Are we ready for autonomous driving? The KITTI vision benchmark suite. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 3354--3361.

[21]

Clement Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. 2019. Digging into self-supervised monocular depth estimation. In IEEE Int. Conf. on Computer Vision (ICCV). 3828--3838.

[22]

Clément Godard, Oisin Mac Aodha, and Gabriel J Brostow. 2017. Unsupervised monocular depth estimation with left-right consistency. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 270--279.

[23]

Xiaodong Gu, Weihao Yuan, Zuozhuo Dai, Siyu Zhu, Chengzhou Tang, Zilong Dong, and Ping Tan. 2023. DRO: Deep Recurrent Optimizer for Video to Depth. IEEE Robotics and Automation Letters (RA-L) (2023).

[24]

Vitor Guizilini, Rares Ambrus, Wolfram Burgard, and Adrien Gaidon. 2021. Sparse auxiliary networks for unified monocular depth prediction and completion. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 11078--11088.

[25]

Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, and Adrien Gaidon. 2020. 3d packing for self-supervised monocular depth estimation. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 2485--2494.

[26]

Vitor Guizilini, Rare s, Ambrus, Dian Chen, Sergey Zakharov, and Adrien Gaidon. 2022. Multi-Frame Self-Supervised Depth with Transformers. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 160--170.

[27]

Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), Vol. 2. IEEE, 1735--1742.

Digital Library

[28]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 770--778.

[29]

Heiko Hirschmuller. 2006. Stereo vision in structured environments by consistent semi-global matching. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Vol. 2. IEEE, 2386--2393.

Digital Library

[30]

Heiko Hirschmuller. 2007. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on pattern analysis and machine intelligence 30, 2 (2007), 328--341.

Digital Library

[31]

Ajay Jain, Matthew Tancik, and Pieter Abbeel. 2021. Putting nerf on a diet: Semantically consistent few-shot view synthesis. In IEEE Int. Conf. on Computer Vision (ICCV). 5885--5894.

[32]

Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, and Junmo Kim. 2022. Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth. arXiv preprint arXiv:2201.07436 (2022).

[33]

Mijeong Kim, Seonguk Seo, and Bohyung Han. 2022. Infonerf: Ray entropy minimization for few-shot neural volume rendering. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 12912--12921.

[34]

Marvin Klingner, Jan-Aike Termöhlen, Jonas Mikolajczyk, and Tim Fingscheidt. 2020. Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In European Conf. on Computer Vision (ECCV). Springer, 582--600.

Digital Library

[35]

Varun Ravi Kumar, Marvin Klingner, Senthil Yogamani, Stefan Milz, Tim Fingscheidt, and Patrick Mader. 2021. Syndistnet: Self-supervised monocular fisheye camera distance estimation synergized with semantic segmentation for autonomous driving. In Proceedings of the IEEE/CVF winter conference on applications of computer vision. 61--71.

[36]

Varun Ravi Kumar, Senthil Yogamani, Markus Bach, Christian Witt, Stefan Milz, and Patrick Mäder. 2020. Unrectdepthnet: Self-supervised monocular depth estimation using a generic framework for handling common camera distortion models. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 8177--8183.

[37]

Tristan Laidlow, Jan Czarnowski, and Stefan Leutenegger. 2019. DeepFusion: Real-time dense 3D reconstruction for monocular SLAM using single-view depth and gradient predictions. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 4068--4074.

Digital Library

[38]

Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. 2019. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326 (2019).

[39]

Seokju Lee, Sunghoon Im, Stephen Lin, and In So Kweon. 2021. Learning monocular depth in dynamic scenes via instance-aware projection consistency. In AAAI Conf. on Artificial Intell. (AAAI). 1863--1872.

[40]

Sihaeng Lee, Janghyeon Lee, Byungju Kim, Eojindl Yi, and Junmo Kim. 2021. Patch-wise attention network for monocular depth estimation. In AAAI Conf. on Artificial Intell. (AAAI). 1873--1881.

[41]

Bo Li, Chunhua Shen, Yuchao Dai, Anton Van Den Hengel, and Mingyi He. 2015. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 1119--1127.

[42]

Jiaxin Li, Zijian Feng, Qi She, Henghui Ding, Changhu Wang, and Gim Hee Lee. 2021. Mine: Towards continuous depth mpi with nerf for novel view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12578--12588.

[43]

Yingyan Li, Yuntao Chen, Jiawei He, and Zhaoxiang Zhang. 2022. Densely Constrained Depth Estimator for Monocular 3D Object Detection. In European Conf. on Computer Vision (ECCV). Springer, 718--734.

Digital Library

[44]

Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker, Noah Snavely, Ce Liu, and William T Freeman. 2019. Learning the depths of moving people by watching frozen people. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 4521--4530.

[45]

Ce Liu, Suryansh Kumar, Shuhang Gu, Radu Timofte, and Luc Van Gool. 2023. Single Image Depth Prediction Made Better: A Multivariate Gaussian Take. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).

[46]

Fayao Liu, Chunhua Shen, and Guosheng Lin. 2015. Deep convolutional neural fields for depth estimation from a single image. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 5162--5170.

[47]

Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. 2020. Neural sparse voxel fields. Advances in Neural Information Processing Systems 33 (2020), 15651--15663.

[48]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE Int. Conf. on Computer Vision (ICCV). 10012--10022.

[49]

Chenxu Luo, Zhenheng Yang, Peng Wang, Yang Wang, Wei Xu, Ram Nevatia, and Alan Yuille. 2019. Every pixel counts: Joint learning of geometry and motion with 3d holistic understanding. IEEE transactions on pattern analysis and machine intelligence 42, 10 (2019), 2624--2641.

[50]

Nelson Max. 1995. Optical models for direct volume rendering. IEEE Transactions on Visualization and Computer Graphics 1, 2 (1995), 99--108.

Digital Library

[51]

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. 2021. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 65, 1 (2021), 99--106.

Digital Library

[52]

Norman Müller, Andrea Simonelli, Lorenzo Porzi, Samuel Rota Bulò, Matthias Nießner, and Peter Kontschieder. 2022. Autorf: Learning 3d object radiance fields from single view observations. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 3971--3980.

[53]

Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. 2022. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG) 41, 4 (2022), 1--15.

Digital Library

[54]

Fuseini Mumuni and Alhassan Mumuni. 2022. Bayesian cue integration of structure from motion and CNN-based monocular depth estimation for autonomous robot navigation. International Journal of Intelligent Robotics and Applications 6, 2 (2022), 191--206.

[55]

Michael Niemeyer, Jonathan T Barron, Ben Mildenhall, Mehdi SM Sajjadi, Andreas Geiger, and Noha Radwan. 2022. Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 5480--5490.

[56]

Matthias Ochs, Adrian Kretz, and Rudolf Mester. 2019. Sdnet: Semantically guided depth estimation network. In Pattern Recognition: 41st DAGM German Conference, DAGM GCPR 2019, Dortmund, Germany, September 10-13, 2019, Proceedings 41. Springer, 288--302.

Digital Library

[57]

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).

[58]

Vaishakh Patil, Christos Sakaridis, Alexander Liniger, and Luc Van Gool. 2022. P3Depth: Monocular Depth Estimation with a Piecewise Planarity Prior. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 1610--1621.

[59]

Sudeep Pillai, Rare? Ambru?, and Adrien Gaidon. 2019. Superdepth: Self-supervised, super-resolved monocular depth estimation. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 9250--9256.

Digital Library

[60]

Matteo Poggi, Filippo Aleotti, Fabio Tosi, and Stefano Mattoccia. 2020. On the uncertainty of self-supervised monocular depth estimation. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 3227--3237.

[61]

Siyuan Qiao, Yukun Zhu, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. 2021. Vip-deeplab: Learning visual perception with depth-aware video panoptic segmentation. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 3997--4008.

[62]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.

[63]

René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. 2021. Vision transformers for dense prediction. In IEEE Int. Conf. on Computer Vision (ICCV). 12179--12188.

[64]

René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. 2020. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence 44, 3 (2020), 1623--1637.

[65]

Manuel Rey-Area, Mingze Yuan, and Christian Richardt. 2022. 360MonoDepth: High-Resolution 360deg Monocular Depth Estimation. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 3762--3772.

[66]

Elisa Ricci, Wanli Ouyang, Xiaogang Wang, Nicu Sebe, et al. 2018. Monocular depth estimation using multi-scale continuous CRFs as sequential deep networks. IEEE Trans. Pattern Anal. & Mach. Intell. 41, 6 (2018), 1426--1440.

[67]

Barbara Roessle, Jonathan T Barron, Ben Mildenhall, Pratul P Srinivasan, and Matthias Nießner. 2022. Dense depth priors for neural radiance fields from sparse input views. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 12892--12901.

[68]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision 115 (2015), 211--252.

[69]

Jonathan Shade, Steven Gortler, Li-wei He, and Richard Szeliski. 1998. Layered depth images. In Proceedings of the 25th annual conference on Computer graphics and interactive techniques. 231--242.

Digital Library

[70]

Prafull Sharma, Ayush Tewari, Yilun Du, Sergey Zakharov, Rares Ambrus, Adrien Gaidon, William T Freeman, Fredo Durand, Joshua B Tenenbaum, and Vincent Sitzmann. 2022. Seeing 3D Objects in a Single Image via Self-Supervised Static-Dynamic Disentanglement. arXiv preprint arXiv:2207.11232 (2022).

[71]

Kihyuk Sohn. 2016. Improved deep metric learning with multi-class n-pair loss objective. Advances in neural information processing systems 29 (2016).

[72]

Pratul P Srinivasan, Richard Tucker, Jonathan T Barron, Ravi Ramamoorthi, Ren Ng, and Noah Snavely. 2019. Pushing the boundaries of view extrapolation with multiplane images. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 175--184.

[73]

Towaki Takikawa, Joey Litalien, Kangxue Yin, Karsten Kreis, Charles Loop, Derek Nowrouzezahrai, Alec Jacobson, Morgan McGuire, and Sanja Fidler. 2021. Neural geometric level of detail: Real-time rendering with implicit 3D shapes. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 11358--11367.

[74]

Chengzhou Tang and Ping Tan. 2018. BA-Net: Dense Bundle Adjustment Networks. In Int. Conf. on Learning Representations (ICLR).

[75]

Zachary Teed and Jia Deng. 2019. DeepV2D: Video to Depth with Differentiable Structure from Motion. In Int. Conf. on Learning Representations (ICLR).

[76]

Zachary Teed and Jia Deng. 2020. RAFT: Recurrent all-pairs field transforms for optical flow. In European Conf. on Computer Vision (ECCV). 402--419.

Digital Library

[77]

Richard Tucker and Noah Snavely. 2020. Single-view view synthesis with multi-plane images. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 551--560.

[78]

Shubham Tulsiani, Richard Tucker, and Noah Snavely. 2018. Layer-structured 3d scene inference via view synthesis. In Proceedings of the European Conference on Computer Vision (ECCV). 302--317.

Digital Library

[79]

Jamie Watson, Michael Firman, Gabriel J Brostow, and Daniyar Turmukhambetov. 2019. Self-supervised monocular depth hints. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2162--2171.

[80]

Olivia Wiles, Georgia Gkioxari, Richard Szeliski, and Justin Johnson. 2020. Synsin: End-to-end view synthesis from a single image. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 7467--7477.

[81]

Felix Wimbauer, Nan Yang, Christian Rupprecht, and Daniel Cremers. 2023. Behind the Scenes: Density Fields for Single View Reconstruction. (2023).

[82]

Felix Wimbauer, Nan Yang, Lukas Von Stumberg, Niclas Zeller, and Daniel Cremers. 2021. MonoRec: Semi-supervised dense reconstruction in dynamic environments from a single moving camera. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 6112--6122.

[83]

Cho-Ying Wu, Jialiang Wang, Michael Hall, Ulrich Neumann, and Shuochen Su. 2022. Toward Practical Monocular Indoor Depth Estimation. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 3814--3824.

[84]

Zizhang Wu, Zhi-Gang Fan, Zhuozheng Li, Jizheng Wang, Tianhao Xu, Qiang Tang, Fan Wang, and Zhengbo Luo. 2022. Monocular Fisheye Depth Estimation for Automated Valet Parking: Dataset, Baseline and Deep Optimizers. In International Conference on Intelligent Transportation Systems (ITSC). IEEE, 01-07.

[85]

Dan Xu, Wei Wang, Hao Tang, Hong Liu, Nicu Sebe, and Elisa Ricci. 2018. Structured attention guided convolutional neural fields for monocular depth estimation. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 3917--3925.

[86]

Guorun Yang, Xiao Song, Chaoqin Huang, Zhidong Deng, Jianping Shi, and Bolei Zhou. 2019. Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 899--908.

[87]

Wei Yin, Yifan Liu, Chunhua Shen, and Youliang Yan. 2019. Enforcing geometric constraints of virtual normal for depth prediction. In IEEE Int. Conf. on Computer Vision (ICCV). 5684--5693.

[88]

Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. 2021. pixelnerf: Neural radiance fields from one or few images. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 4578--4587.

[89]

Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and Ping Tan. 2022. NeWCRFs: Neural Window Fully-connected CRFs for Monocular Depth Estimation. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 3916--3925.

[90]

Zhenyu Zhang, Zhen Cui, Chunyan Xu, Yan Yan, Nicu Sebe, and Jian Yang. 2019. Pattern-affinitive propagation across depth, surface normal and semantic segmentation. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 4106--4115.

[91]

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. 2018. Stereo magnification: Learning view synthesis using multiplane images. arXiv preprint arXiv:1805.09817 (2018).

Cited By

Liu YXue FMing AZhao MMa HSebe NCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One ModelProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681405(3469-3478)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681405
Liu RLiu ZZhang HZhang GZhang JSunbo Sheng WLiu XJin YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)ColVO: Colonoscopic Visual Odometry Considering Geometric and Photometric ConsistencyProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681286(8100-8109)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681286
Yang ZLi LZhang JWang TSun YYan CCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Domain Shared and Specific Prompt Learning for Incremental Monocular Depth EstimationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681155(8306-8315)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681155

Index Terms

V2Depth: Monocular Depth Estimation via Feature-Level Virtual-View Simulation and Refinement
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

DPDFormer: A Coarse-to-Fine Model for Monocular Depth Estimation
Monocular depth estimation attracts great attention from computer vision researchers for its convenience in acquiring environment depth information. Recently classification-based MDE methods show its promising performance and begin to act as an essential ...
Improving Unsupervised Learning of Monocular Depth and Ego-Motion via Stereo Network
Pattern Recognition and Computer Vision
Abstract
Unsupervised learning of monocular depth and ego-motion is a challenging task, which uses the photometric loss as the supervision to train the networks. Although existing unsupervised methods can get rid of expensive annotations, they are still ...
Transferring knowledge from monocular completion for self-supervised monocular depth estimation
Abstract
Monocular depth estimation is a very challenging task in computer vision, with the goal to predict per-pixel depth from a single RGB image. Supervised learning methods require large amounts of depth measurement data, which are time-consuming and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
202
Total Downloads

Downloads (Last 12 months)88
Downloads (Last 6 weeks)9

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu YXue FMing AZhao MMa HSebe NCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One ModelProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681405(3469-3478)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681405
Liu RLiu ZZhang HZhang GZhang JSunbo Sheng WLiu XJin YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)ColVO: Colonoscopic Visual Odometry Considering Geometric and Photometric ConsistencyProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681286(8100-8109)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681286
Yang ZLi LZhang JWang TSun YYan CCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Domain Shared and Specific Prompt Learning for Incremental Monocular Depth EstimationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681155(8306-8315)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681155

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten