ABSTRACT
Accurately measuring the absolute depth of every pixel captured by an imaging sensor is of critical importance in real-time applications such as autonomous navigation, augmented reality and robotics. In order to predict dense depth, a general approach is to fuse sensor inputs from different modalities such as LiDAR, camera and other time-of-flight sensors. LiDAR and other time-of-flight sensors provide accurate depth data but are quite sparse, both spatially and temporally. To augment missing depth information, generally RGB guidance is leveraged due to its high resolution information. Due to the reliance on multiple sensor modalities, design for robustness and adaptation is essential. In this work, we propose a transformer-like self-attention based generative adversarial network to estimate dense depth using RGB and sparse depth data. We introduce a novel training recipe for making the model robust so that it works even when one of the input modalities is not available. The multi-head self-attention mechanism can dynamically attend to most salient parts of the RGB image or corresponding sparse depth data producing the most competitive results. Our proposed network also requires less memory for training and inference compared to other existing heavily residual connection based convolutional neural networks, making it more suitable for resource-constrained edge applications. The source code is available at: https://github.com/kocchop/robust-multimodal-fusion-gan
- Wouter Van Gansbeke, Davy Neven, Bert De Brabandere, and Luc Van Gool. 2019. Sparse and noisy lidar completion with rgb guidance and uncertainty. In 2019 16th International Conference on Machine Vision Applications (MVA), 1--6. doi: 10.23919/MVA.2019.8757939.Google ScholarCross Ref
- Fangchang Ma and Sertac Karaman. 2018. Sparse-to-dense: depth prediction from sparse depth samples and a single image. In 2018 IEEE International Conference on Robotics and Automation (ICRA), 4796--4803. doi: 10.1109/ICRA. 2018.8460184.Google ScholarDigital Library
- Amir Atapour-Abarghouei and Toby P. Breckon. 2019. Veritatem dies aperit - temporally consistent depth prediction enabled by a multi-task geometric and semantic scene understanding approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). (June 2019).Google Scholar
- Minseok Kim, Sung Ho Choi, Kyeong-Beom Park, and Jae Yeol Lee. 2021. A hybrid approach to industrial augmented reality using deep learning-based facility segmentation and depth prediction. Sensors, 21, 1. issn: 1424-8220. doi: 10.3390/s21010307.Google Scholar
- MO Kadry. 2022. [online]. Available: https://www.cubictelecom.com/blog/selfdriving-cars-future-of-autonomous-vehicles-automotive-vehicles-2030/. (2022).Google Scholar
- Radu Horaud, Miles Hansard, Georgios Evangelidis, and Clément Ménier. 2016. An overview of depth cameras and range scanners based on time-of-flight technologies. Machine vision and applications, 27, 7, 1005--1020.Google Scholar
- Max Jaderberg, Karen Simonyan, Andrew Zisserman, and koray kavukcuoglu koray. 2015. Spatial transformer networks. In Advances in Neural Information Processing Systems. C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors. Volume 28. Curran Associates, Inc. https://proceedings.neurips.cc/ paper/2015/file/33ceb07bf4eeb3da587e268d663aba1a-Paper.pdf.Google Scholar
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning (Proceedings of Machine Learning Research). Francis Bach and David Blei, editors. Volume 37. PMLR, Lille, France, (July 2015), 2048--2057. https://proceedings.mlr.press/v37/xuc15. html.Google Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors. Volume 30. Curran Associates, Inc. https://proceedings.neurips.cc/ paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.Google ScholarDigital Library
- Dana Lahat, Tülay Adali, and Christian Jutten. 2015. Multimodal data fusion: an overview of methods, challenges, and prospects. Proceedings of the IEEE, 103, 9, 1449--1477. doi: 10.1109/JPROC.2015.2460697.Google ScholarCross Ref
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in neural information processing systems, 27.Google Scholar
- Xudong Mao and Qing Li. 2021. Generative Adversarial Networks for Image Generation. Springer.Google Scholar
- Ming-Yu Liu, Xun Huang, Jiahui Yu, Ting-Chun Wang, and Arun Mallya. 2020. Generative adversarial networks for image and video synthesis: algorithms and applications. CoRR, abs/2008.02793.Google Scholar
- Qi Wei, Chao Yuan, and Amit Chakraborty. 2021. Generative adversarial networks for time series. US Patent App. 17/271,646. (November 2021).Google Scholar
- Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. 2018. Esrgan: enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops. (September 2018).Google Scholar
- Abdul Jabbar, Xi Li, and Bourahla Omar. 2021. A survey on generative adversarial networks: variants, applications, and training. ACM Comput. Surv., 54, 8, Article 157, (October 2021), 49 pages. issn: 0360-0300. doi: 10.1145/3463475.Google Scholar
- David Eigen and Rob Fergus. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). (December 2015).Google ScholarDigital Library
- David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth map prediction from a single image using a multi-scale deep network. In Advances in Neural Information Processing Systems. Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors. Volume 27. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2014/file/7bccfde7714a1ebadf06c5f4cea752c1- Paper.pdf.Google ScholarDigital Library
- Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. 2016. Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38, 10, 2024--2039. doi: 10.1109/TPAMI.2015.2505283.Google ScholarDigital Library
- Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. 2016. Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth International Conference on 3D Vision (3DV), 239--248. doi: 10.1109/3DV.2016.32.Google ScholarCross Ref
- Lee-Kang Liu, Stanley H. Chan, and Truong Q. Nguyen. 2015. Depth reconstruction from sparse samples: representation, algorithm, and sampling. IEEE Transactions on Image Processing, 24, 6, 1983--1996. doi: 10.1109/TIP.2015.2409551.Google ScholarDigital Library
- Simon Hawe, Martin Kleinsteuber, and Klaus Diepold. 2011. Dense disparity maps from sparse disparity measurements. In 2011 International Conference on Computer Vision, 2126--2133. doi: 10.1109/ICCV.2011.6126488.Google ScholarDigital Library
- Abdelrahman Eldesokey, Michael Felsberg, and Fahad Shahbaz Khan. 2018. Propagating confidences through cnns for sparse data regression. arXiv preprint arXiv:1805.11913.Google Scholar
- Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke, Thomas Brox, and Andreas Geiger. 2017. Sparsity invariant cnns. CoRR, abs/1708.06500.Google Scholar
- Zixuan Huang, Junming Fan, Shenggan Cheng, Shuai Yi, Xiaogang Wang, and Hongsheng Li. 2020. Hms-net: hierarchical multi-scale sparsity-invariant network for sparse depth completion. IEEE Transactions on Image Processing, 29, 3429--3441. doi: 10.1109/TIP.2019.2960589.Google ScholarDigital Library
- Kaiyue Lu, Nick Barnes, Saeed Anwar, and Liang Zheng. 2022. Depth completion auto-encoder. In 2022 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), 63--73. doi: 10.1109/WACVW54805. 2022.00012.Google ScholarCross Ref
- Maximilian Jaritz, Raoul De Charette, Emilie Wirbel, Xavier Perrotton, and Fawzi Nashashibi. 2018. Sparse and dense data with cnns: depth completion and semantic segmentation. In 2018 International Conference on 3D Vision (3DV), 52--60. doi: 10.1109/3DV.2018.00017.Google ScholarCross Ref
- Jason Ku, Ali Harakeh, and Steven L. Waslander. 2018. In defense of classical image processing: fast depth completion on the cpu. In 2018 15th Conference on Computer and Robot Vision (CRV), 16--22. doi: 10.1109/CRV.2018.00013.Google Scholar
- 2021. Sparse to dense depth completion using a generative adversarial network with intelligent sampling strategies. Proceedings of the 29th ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, 5528--5536. isbn: 9781450386517. https://doi.org/10.1145/3474085. 3475688.Google Scholar
- Jiashen Hua and Xiaojin Gong. 2018. A normalized convolutional neural network for guided sparse depth upsampling. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18. International Joint Conferences on Artificial Intelligence Organization, (July 2018), 2283--2290. doi: 10.24963/ijcai.2018/316.Google ScholarCross Ref
- Maximilian Jaritz, Raoul de Charette, Émilie Wirbel, Xavier Perrotton, and Fawzi Nashashibi. 2018. Sparse and dense data with cnns: depth completion and semantic segmentation. CoRR, abs/1808.00769.Google Scholar
- Jie Tang, Fei-Peng Tian,Wei Feng, Jian Li, and Ping Tan. 2021. Learning guided convolutional network for depth completion. IEEE Transactions on Image Processing, 30, 1116--1129. doi: 10.1109/TIP.2020.3040528.Google ScholarDigital Library
- Jiaxiong Qiu, Zhaopeng Cui, Yinda Zhang, Xingdi Zhang, Shuaicheng Liu, Bing Zeng, and Marc Pollefeys. 2019. Deeplidar: deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). (June 2019).Google ScholarCross Ref
- Zhiqiang Yan, Kun Wang, Xiang Li, Zhenyu Zhang, Baobei Xu, Jun Li, and Jian Yang. 2021. Rignet: repetitive image guided network for depth completion. CoRR, abs/2107.13802.Google Scholar
- Mu Hu, Shuling Wang, Bin Li, Shiyu Ning, Li Fan, and Xiaojin Gong. 2021. Penet: towards precise and efficient image guided depth completion. In 2021 IEEE International Conference on Robotics and Automation (ICRA), 13656--13662. doi: 10.1109/ICRA48506.2021.9561035.Google ScholarDigital Library
- Xinjing Cheng, Peng Wang, Chenye Guan, and Ruigang Yang. 2020. Cspn: learning context and resource aware convolutional spatial propagation networks for depth completion. 34, (April 2020), 10615--10622. doi: 10.1609/aaai. v34i07.6635.Google Scholar
- Xinjing Cheng, Peng Wang, and Ruigang Yang. 2020. Learning depth with convolutional spatial propagation network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42, 10, 2361--2379. doi: 10.1109/TPAMI.2019.2947374.Google ScholarDigital Library
- Chongzhen Zhang, Yang Tang, Chaoqiang Zhao, Qiyu Sun, Zhencheng Ye, and Jürgen Kurths. 2021. Multitask gans for semantic segmentation and depth completion with cycle consistency. IEEE Transactions on Neural Networks and Learning Systems, 32, 12, 5404--5415. doi: 10.1109/TNNLS.2021.3072883.Google ScholarCross Ref
- Kranti Kumar Parida, Siddharth Srivastava, and Gaurav Sharma. 2021. Beyond image to depth: improving depth prediction using echoes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8268--8277.Google ScholarCross Ref
- Aditya Prakash, Kashyap Chitta, and Andreas Geiger. 2021. Multi-modal fusion transformer for end-to-end autonomous driving. In Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
- René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. 2021. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). (October 2021), 12179--12188.Google ScholarCross Ref
- Mario Bijelic, Tobias Gruber, Fahim Mannan, Florian Kraus, Werner Ritter, Klaus Dietmayer, and Felix Heide. 2020. Seeing through fog without seeing fog: deep multimodal sensor fusion in unseen adverse weather. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (June 2020).Google ScholarCross Ref
- Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.Google Scholar
- Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative adversarial text to image synthesis. In International conference on machine learning. PMLR, 1060--1069.Google Scholar
- Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-toimage translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1125--1134.Google Scholar
- Alexia Jolicoeur-Martineau. 2019. The relativistic discriminator: a key element missing from standard GAN. In International Conference on Learning Representations. https://openreview.net/forum?id=S1erHoR5t7.Google Scholar
- Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations. https://openreview.net/ forum?id=YicbFdNTTy.Google Scholar
- Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. 2015. ShapeNet: An Information-Rich 3D Model Repository. Technical report arXiv:1512.03012 [cs.GR]. Stanford University - Princeton University - Toyota Technological Institute at Chicago.Google Scholar
- Pushmeet Kohli Nathan Silberman Derek Hoiem and Rob Fergus. 2012. Indoor segmentation and support inference from rgbd images. In ECCV.Google Scholar
- Diederik P. Kingma and Jimmy Ba. 2014. Adam: a method for stochastic optimization. (2014). eprint: arXiv:1412.6980.Google Scholar
Index Terms
- Robust Multimodal Depth Estimation using Transformer based Generative Adversarial Networks
Recommendations
Sparse to Dense Depth Completion using a Generative Adversarial Network with Intelligent Sampling Strategies
MM '21: Proceedings of the 29th ACM International Conference on MultimediaPredicting dense depth accurately is essential for 3D scene understanding applications such as autonomous driving and robotics. However, the depth obtained from commercially available LiDAR and Time-of-Flight sensors is very sparse. With RGB color ...
RGB-D SLAM with Deep Depth Completion
Artificial Intelligence and Soft ComputingAbstractRGB-D indoor mapping has been an active research topic in the last decade with the release of various depth sensors. Researchers proposed impressive SLAM systems such as ORB-SLAM2. However, the depth sensors are sensitive to illumination ...
Depth completion towards different sensor configurations via relative depth map estimation and scale recovery
AbstractDepth completion, which combines additional sparse depth information from the range sensors, substantially improves the accuracy of monocular depth estimation, especially using the deep-learning-based methods. However, these methods ...
Highlights- A novel and universal two-stage depth completion approach is proposed.
- The ...
Comments