skip to main content
10.1145/3503161.3548418acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Robust Multimodal Depth Estimation using Transformer based Generative Adversarial Networks

Published:10 October 2022Publication History

ABSTRACT

Accurately measuring the absolute depth of every pixel captured by an imaging sensor is of critical importance in real-time applications such as autonomous navigation, augmented reality and robotics. In order to predict dense depth, a general approach is to fuse sensor inputs from different modalities such as LiDAR, camera and other time-of-flight sensors. LiDAR and other time-of-flight sensors provide accurate depth data but are quite sparse, both spatially and temporally. To augment missing depth information, generally RGB guidance is leveraged due to its high resolution information. Due to the reliance on multiple sensor modalities, design for robustness and adaptation is essential. In this work, we propose a transformer-like self-attention based generative adversarial network to estimate dense depth using RGB and sparse depth data. We introduce a novel training recipe for making the model robust so that it works even when one of the input modalities is not available. The multi-head self-attention mechanism can dynamically attend to most salient parts of the RGB image or corresponding sparse depth data producing the most competitive results. Our proposed network also requires less memory for training and inference compared to other existing heavily residual connection based convolutional neural networks, making it more suitable for resource-constrained edge applications. The source code is available at: https://github.com/kocchop/robust-multimodal-fusion-gan

References

  1. Wouter Van Gansbeke, Davy Neven, Bert De Brabandere, and Luc Van Gool. 2019. Sparse and noisy lidar completion with rgb guidance and uncertainty. In 2019 16th International Conference on Machine Vision Applications (MVA), 1--6. doi: 10.23919/MVA.2019.8757939.Google ScholarGoogle ScholarCross RefCross Ref
  2. Fangchang Ma and Sertac Karaman. 2018. Sparse-to-dense: depth prediction from sparse depth samples and a single image. In 2018 IEEE International Conference on Robotics and Automation (ICRA), 4796--4803. doi: 10.1109/ICRA. 2018.8460184.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Amir Atapour-Abarghouei and Toby P. Breckon. 2019. Veritatem dies aperit - temporally consistent depth prediction enabled by a multi-task geometric and semantic scene understanding approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). (June 2019).Google ScholarGoogle Scholar
  4. Minseok Kim, Sung Ho Choi, Kyeong-Beom Park, and Jae Yeol Lee. 2021. A hybrid approach to industrial augmented reality using deep learning-based facility segmentation and depth prediction. Sensors, 21, 1. issn: 1424-8220. doi: 10.3390/s21010307.Google ScholarGoogle Scholar
  5. MO Kadry. 2022. [online]. Available: https://www.cubictelecom.com/blog/selfdriving-cars-future-of-autonomous-vehicles-automotive-vehicles-2030/. (2022).Google ScholarGoogle Scholar
  6. Radu Horaud, Miles Hansard, Georgios Evangelidis, and Clément Ménier. 2016. An overview of depth cameras and range scanners based on time-of-flight technologies. Machine vision and applications, 27, 7, 1005--1020.Google ScholarGoogle Scholar
  7. Max Jaderberg, Karen Simonyan, Andrew Zisserman, and koray kavukcuoglu koray. 2015. Spatial transformer networks. In Advances in Neural Information Processing Systems. C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors. Volume 28. Curran Associates, Inc. https://proceedings.neurips.cc/ paper/2015/file/33ceb07bf4eeb3da587e268d663aba1a-Paper.pdf.Google ScholarGoogle Scholar
  8. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning (Proceedings of Machine Learning Research). Francis Bach and David Blei, editors. Volume 37. PMLR, Lille, France, (July 2015), 2048--2057. https://proceedings.mlr.press/v37/xuc15. html.Google ScholarGoogle Scholar
  9. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors. Volume 30. Curran Associates, Inc. https://proceedings.neurips.cc/ paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Dana Lahat, Tülay Adali, and Christian Jutten. 2015. Multimodal data fusion: an overview of methods, challenges, and prospects. Proceedings of the IEEE, 103, 9, 1449--1477. doi: 10.1109/JPROC.2015.2460697.Google ScholarGoogle ScholarCross RefCross Ref
  11. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in neural information processing systems, 27.Google ScholarGoogle Scholar
  12. Xudong Mao and Qing Li. 2021. Generative Adversarial Networks for Image Generation. Springer.Google ScholarGoogle Scholar
  13. Ming-Yu Liu, Xun Huang, Jiahui Yu, Ting-Chun Wang, and Arun Mallya. 2020. Generative adversarial networks for image and video synthesis: algorithms and applications. CoRR, abs/2008.02793.Google ScholarGoogle Scholar
  14. Qi Wei, Chao Yuan, and Amit Chakraborty. 2021. Generative adversarial networks for time series. US Patent App. 17/271,646. (November 2021).Google ScholarGoogle Scholar
  15. Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. 2018. Esrgan: enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops. (September 2018).Google ScholarGoogle Scholar
  16. Abdul Jabbar, Xi Li, and Bourahla Omar. 2021. A survey on generative adversarial networks: variants, applications, and training. ACM Comput. Surv., 54, 8, Article 157, (October 2021), 49 pages. issn: 0360-0300. doi: 10.1145/3463475.Google ScholarGoogle Scholar
  17. David Eigen and Rob Fergus. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). (December 2015).Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth map prediction from a single image using a multi-scale deep network. In Advances in Neural Information Processing Systems. Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors. Volume 27. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2014/file/7bccfde7714a1ebadf06c5f4cea752c1- Paper.pdf.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. 2016. Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38, 10, 2024--2039. doi: 10.1109/TPAMI.2015.2505283.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. 2016. Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth International Conference on 3D Vision (3DV), 239--248. doi: 10.1109/3DV.2016.32.Google ScholarGoogle ScholarCross RefCross Ref
  21. Lee-Kang Liu, Stanley H. Chan, and Truong Q. Nguyen. 2015. Depth reconstruction from sparse samples: representation, algorithm, and sampling. IEEE Transactions on Image Processing, 24, 6, 1983--1996. doi: 10.1109/TIP.2015.2409551.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Simon Hawe, Martin Kleinsteuber, and Klaus Diepold. 2011. Dense disparity maps from sparse disparity measurements. In 2011 International Conference on Computer Vision, 2126--2133. doi: 10.1109/ICCV.2011.6126488.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Abdelrahman Eldesokey, Michael Felsberg, and Fahad Shahbaz Khan. 2018. Propagating confidences through cnns for sparse data regression. arXiv preprint arXiv:1805.11913.Google ScholarGoogle Scholar
  24. Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke, Thomas Brox, and Andreas Geiger. 2017. Sparsity invariant cnns. CoRR, abs/1708.06500.Google ScholarGoogle Scholar
  25. Zixuan Huang, Junming Fan, Shenggan Cheng, Shuai Yi, Xiaogang Wang, and Hongsheng Li. 2020. Hms-net: hierarchical multi-scale sparsity-invariant network for sparse depth completion. IEEE Transactions on Image Processing, 29, 3429--3441. doi: 10.1109/TIP.2019.2960589.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Kaiyue Lu, Nick Barnes, Saeed Anwar, and Liang Zheng. 2022. Depth completion auto-encoder. In 2022 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), 63--73. doi: 10.1109/WACVW54805. 2022.00012.Google ScholarGoogle ScholarCross RefCross Ref
  27. Maximilian Jaritz, Raoul De Charette, Emilie Wirbel, Xavier Perrotton, and Fawzi Nashashibi. 2018. Sparse and dense data with cnns: depth completion and semantic segmentation. In 2018 International Conference on 3D Vision (3DV), 52--60. doi: 10.1109/3DV.2018.00017.Google ScholarGoogle ScholarCross RefCross Ref
  28. Jason Ku, Ali Harakeh, and Steven L. Waslander. 2018. In defense of classical image processing: fast depth completion on the cpu. In 2018 15th Conference on Computer and Robot Vision (CRV), 16--22. doi: 10.1109/CRV.2018.00013.Google ScholarGoogle Scholar
  29. 2021. Sparse to dense depth completion using a generative adversarial network with intelligent sampling strategies. Proceedings of the 29th ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, 5528--5536. isbn: 9781450386517. https://doi.org/10.1145/3474085. 3475688.Google ScholarGoogle Scholar
  30. Jiashen Hua and Xiaojin Gong. 2018. A normalized convolutional neural network for guided sparse depth upsampling. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18. International Joint Conferences on Artificial Intelligence Organization, (July 2018), 2283--2290. doi: 10.24963/ijcai.2018/316.Google ScholarGoogle ScholarCross RefCross Ref
  31. Maximilian Jaritz, Raoul de Charette, Émilie Wirbel, Xavier Perrotton, and Fawzi Nashashibi. 2018. Sparse and dense data with cnns: depth completion and semantic segmentation. CoRR, abs/1808.00769.Google ScholarGoogle Scholar
  32. Jie Tang, Fei-Peng Tian,Wei Feng, Jian Li, and Ping Tan. 2021. Learning guided convolutional network for depth completion. IEEE Transactions on Image Processing, 30, 1116--1129. doi: 10.1109/TIP.2020.3040528.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Jiaxiong Qiu, Zhaopeng Cui, Yinda Zhang, Xingdi Zhang, Shuaicheng Liu, Bing Zeng, and Marc Pollefeys. 2019. Deeplidar: deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). (June 2019).Google ScholarGoogle ScholarCross RefCross Ref
  34. Zhiqiang Yan, Kun Wang, Xiang Li, Zhenyu Zhang, Baobei Xu, Jun Li, and Jian Yang. 2021. Rignet: repetitive image guided network for depth completion. CoRR, abs/2107.13802.Google ScholarGoogle Scholar
  35. Mu Hu, Shuling Wang, Bin Li, Shiyu Ning, Li Fan, and Xiaojin Gong. 2021. Penet: towards precise and efficient image guided depth completion. In 2021 IEEE International Conference on Robotics and Automation (ICRA), 13656--13662. doi: 10.1109/ICRA48506.2021.9561035.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Xinjing Cheng, Peng Wang, Chenye Guan, and Ruigang Yang. 2020. Cspn: learning context and resource aware convolutional spatial propagation networks for depth completion. 34, (April 2020), 10615--10622. doi: 10.1609/aaai. v34i07.6635.Google ScholarGoogle Scholar
  37. Xinjing Cheng, Peng Wang, and Ruigang Yang. 2020. Learning depth with convolutional spatial propagation network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42, 10, 2361--2379. doi: 10.1109/TPAMI.2019.2947374.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Chongzhen Zhang, Yang Tang, Chaoqiang Zhao, Qiyu Sun, Zhencheng Ye, and Jürgen Kurths. 2021. Multitask gans for semantic segmentation and depth completion with cycle consistency. IEEE Transactions on Neural Networks and Learning Systems, 32, 12, 5404--5415. doi: 10.1109/TNNLS.2021.3072883.Google ScholarGoogle ScholarCross RefCross Ref
  39. Kranti Kumar Parida, Siddharth Srivastava, and Gaurav Sharma. 2021. Beyond image to depth: improving depth prediction using echoes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8268--8277.Google ScholarGoogle ScholarCross RefCross Ref
  40. Aditya Prakash, Kashyap Chitta, and Andreas Geiger. 2021. Multi-modal fusion transformer for end-to-end autonomous driving. In Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  41. René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. 2021. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). (October 2021), 12179--12188.Google ScholarGoogle ScholarCross RefCross Ref
  42. Mario Bijelic, Tobias Gruber, Fahim Mannan, Florian Kraus, Werner Ritter, Klaus Dietmayer, and Felix Heide. 2020. Seeing through fog without seeing fog: deep multimodal sensor fusion in unseen adverse weather. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (June 2020).Google ScholarGoogle ScholarCross RefCross Ref
  43. Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.Google ScholarGoogle Scholar
  44. Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative adversarial text to image synthesis. In International conference on machine learning. PMLR, 1060--1069.Google ScholarGoogle Scholar
  45. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-toimage translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1125--1134.Google ScholarGoogle Scholar
  46. Alexia Jolicoeur-Martineau. 2019. The relativistic discriminator: a key element missing from standard GAN. In International Conference on Learning Representations. https://openreview.net/forum?id=S1erHoR5t7.Google ScholarGoogle Scholar
  47. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations. https://openreview.net/ forum?id=YicbFdNTTy.Google ScholarGoogle Scholar
  48. Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. 2015. ShapeNet: An Information-Rich 3D Model Repository. Technical report arXiv:1512.03012 [cs.GR]. Stanford University - Princeton University - Toyota Technological Institute at Chicago.Google ScholarGoogle Scholar
  49. Pushmeet Kohli Nathan Silberman Derek Hoiem and Rob Fergus. 2012. Indoor segmentation and support inference from rgbd images. In ECCV.Google ScholarGoogle Scholar
  50. Diederik P. Kingma and Jimmy Ba. 2014. Adam: a method for stochastic optimization. (2014). eprint: arXiv:1412.6980.Google ScholarGoogle Scholar

Index Terms

  1. Robust Multimodal Depth Estimation using Transformer based Generative Adversarial Networks

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              MM '22: Proceedings of the 30th ACM International Conference on Multimedia
              October 2022
              7537 pages
              ISBN:9781450392037
              DOI:10.1145/3503161

              Copyright © 2022 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 10 October 2022

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              Overall Acceptance Rate995of4,171submissions,24%

              Upcoming Conference

              MM '24
              MM '24: The 32nd ACM International Conference on Multimedia
              October 28 - November 1, 2024
              Melbourne , VIC , Australia
            • Article Metrics

              • Downloads (Last 12 months)132
              • Downloads (Last 6 weeks)12

              Other Metrics

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader