research-article

Robust Multimodal Depth Estimation using Transformer based Generative Adversarial Networks

Authors:
Md Fahim Faysal Khan

The Pennsylvania State University, University Park, PA, USA

The Pennsylvania State University, University Park, PA, USA
View Profile

,
Anusha Devulapally

The Pennsylvania State University, University Park, PA, USA

The Pennsylvania State University, University Park, PA, USA
View Profile

,
Siddharth Advani

Samsung Electronics America, Plano, TX, USA

Samsung Electronics America, Plano, TX, USA
View Profile

,
Vijaykrishnan Narayanan

The Pennsylvania State University, University Park, PA, USA

The Pennsylvania State University, University Park, PA, USA
View Profile

MM '22: Proceedings of the 30th ACM International Conference on MultimediaOctober 2022Pages 3559–3568https://doi.org/10.1145/3503161.3548418

Published:10 October 2022Publication History

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 3559–3568

ABSTRACT

Accurately measuring the absolute depth of every pixel captured by an imaging sensor is of critical importance in real-time applications such as autonomous navigation, augmented reality and robotics. In order to predict dense depth, a general approach is to fuse sensor inputs from different modalities such as LiDAR, camera and other time-of-flight sensors. LiDAR and other time-of-flight sensors provide accurate depth data but are quite sparse, both spatially and temporally. To augment missing depth information, generally RGB guidance is leveraged due to its high resolution information. Due to the reliance on multiple sensor modalities, design for robustness and adaptation is essential. In this work, we propose a transformer-like self-attention based generative adversarial network to estimate dense depth using RGB and sparse depth data. We introduce a novel training recipe for making the model robust so that it works even when one of the input modalities is not available. The multi-head self-attention mechanism can dynamically attend to most salient parts of the RGB image or corresponding sparse depth data producing the most competitive results. Our proposed network also requires less memory for training and inference compared to other existing heavily residual connection based convolutional neural networks, making it more suitable for resource-constrained edge applications. The source code is available at: https://github.com/kocchop/robust-multimodal-fusion-gan

References

Wouter Van Gansbeke, Davy Neven, Bert De Brabandere, and Luc Van Gool. 2019. Sparse and noisy lidar completion with rgb guidance and uncertainty. In 2019 16th International Conference on Machine Vision Applications (MVA), 1--6. doi: 10.23919/MVA.2019.8757939.Google ScholarCross Ref
Fangchang Ma and Sertac Karaman. 2018. Sparse-to-dense: depth prediction from sparse depth samples and a single image. In 2018 IEEE International Conference on Robotics and Automation (ICRA), 4796--4803. doi: 10.1109/ICRA. 2018.8460184.Google ScholarDigital Library
Amir Atapour-Abarghouei and Toby P. Breckon. 2019. Veritatem dies aperit - temporally consistent depth prediction enabled by a multi-task geometric and semantic scene understanding approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). (June 2019).Google Scholar
Minseok Kim, Sung Ho Choi, Kyeong-Beom Park, and Jae Yeol Lee. 2021. A hybrid approach to industrial augmented reality using deep learning-based facility segmentation and depth prediction. Sensors, 21, 1. issn: 1424-8220. doi: 10.3390/s21010307.Google Scholar
MO Kadry. 2022. [online]. Available: https://www.cubictelecom.com/blog/selfdriving-cars-future-of-autonomous-vehicles-automotive-vehicles-2030/. (2022).Google Scholar
Radu Horaud, Miles Hansard, Georgios Evangelidis, and Clément Ménier. 2016. An overview of depth cameras and range scanners based on time-of-flight technologies. Machine vision and applications, 27, 7, 1005--1020.Google Scholar
Max Jaderberg, Karen Simonyan, Andrew Zisserman, and koray kavukcuoglu koray. 2015. Spatial transformer networks. In Advances in Neural Information Processing Systems. C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors. Volume 28. Curran Associates, Inc. https://proceedings.neurips.cc/ paper/2015/file/33ceb07bf4eeb3da587e268d663aba1a-Paper.pdf.Google Scholar
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning (Proceedings of Machine Learning Research). Francis Bach and David Blei, editors. Volume 37. PMLR, Lille, France, (July 2015), 2048--2057. https://proceedings.mlr.press/v37/xuc15. html.Google Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors. Volume 30. Curran Associates, Inc. https://proceedings.neurips.cc/ paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.Google ScholarDigital Library
Dana Lahat, Tülay Adali, and Christian Jutten. 2015. Multimodal data fusion: an overview of methods, challenges, and prospects. Proceedings of the IEEE, 103, 9, 1449--1477. doi: 10.1109/JPROC.2015.2460697.Google ScholarCross Ref
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in neural information processing systems, 27.Google Scholar
Xudong Mao and Qing Li. 2021. Generative Adversarial Networks for Image Generation. Springer.Google Scholar
Ming-Yu Liu, Xun Huang, Jiahui Yu, Ting-Chun Wang, and Arun Mallya. 2020. Generative adversarial networks for image and video synthesis: algorithms and applications. CoRR, abs/2008.02793.Google Scholar
Qi Wei, Chao Yuan, and Amit Chakraborty. 2021. Generative adversarial networks for time series. US Patent App. 17/271,646. (November 2021).Google Scholar
Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. 2018. Esrgan: enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops. (September 2018).Google Scholar
Abdul Jabbar, Xi Li, and Bourahla Omar. 2021. A survey on generative adversarial networks: variants, applications, and training. ACM Comput. Surv., 54, 8, Article 157, (October 2021), 49 pages. issn: 0360-0300. doi: 10.1145/3463475.Google Scholar
David Eigen and Rob Fergus. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). (December 2015).Google ScholarDigital Library
David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth map prediction from a single image using a multi-scale deep network. In Advances in Neural Information Processing Systems. Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors. Volume 27. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2014/file/7bccfde7714a1ebadf06c5f4cea752c1- Paper.pdf.Google ScholarDigital Library
Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. 2016. Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38, 10, 2024--2039. doi: 10.1109/TPAMI.2015.2505283.Google ScholarDigital Library
Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. 2016. Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth International Conference on 3D Vision (3DV), 239--248. doi: 10.1109/3DV.2016.32.Google ScholarCross Ref
Lee-Kang Liu, Stanley H. Chan, and Truong Q. Nguyen. 2015. Depth reconstruction from sparse samples: representation, algorithm, and sampling. IEEE Transactions on Image Processing, 24, 6, 1983--1996. doi: 10.1109/TIP.2015.2409551.Google ScholarDigital Library
Simon Hawe, Martin Kleinsteuber, and Klaus Diepold. 2011. Dense disparity maps from sparse disparity measurements. In 2011 International Conference on Computer Vision, 2126--2133. doi: 10.1109/ICCV.2011.6126488.Google ScholarDigital Library
Abdelrahman Eldesokey, Michael Felsberg, and Fahad Shahbaz Khan. 2018. Propagating confidences through cnns for sparse data regression. arXiv preprint arXiv:1805.11913.Google Scholar
Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke, Thomas Brox, and Andreas Geiger. 2017. Sparsity invariant cnns. CoRR, abs/1708.06500.Google Scholar
Zixuan Huang, Junming Fan, Shenggan Cheng, Shuai Yi, Xiaogang Wang, and Hongsheng Li. 2020. Hms-net: hierarchical multi-scale sparsity-invariant network for sparse depth completion. IEEE Transactions on Image Processing, 29, 3429--3441. doi: 10.1109/TIP.2019.2960589.Google ScholarDigital Library
Kaiyue Lu, Nick Barnes, Saeed Anwar, and Liang Zheng. 2022. Depth completion auto-encoder. In 2022 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), 63--73. doi: 10.1109/WACVW54805. 2022.00012.Google ScholarCross Ref
Maximilian Jaritz, Raoul De Charette, Emilie Wirbel, Xavier Perrotton, and Fawzi Nashashibi. 2018. Sparse and dense data with cnns: depth completion and semantic segmentation. In 2018 International Conference on 3D Vision (3DV), 52--60. doi: 10.1109/3DV.2018.00017.Google ScholarCross Ref
Jason Ku, Ali Harakeh, and Steven L. Waslander. 2018. In defense of classical image processing: fast depth completion on the cpu. In 2018 15th Conference on Computer and Robot Vision (CRV), 16--22. doi: 10.1109/CRV.2018.00013.Google Scholar
2021. Sparse to dense depth completion using a generative adversarial network with intelligent sampling strategies. Proceedings of the 29th ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, 5528--5536. isbn: 9781450386517. https://doi.org/10.1145/3474085. 3475688.Google Scholar
Jiashen Hua and Xiaojin Gong. 2018. A normalized convolutional neural network for guided sparse depth upsampling. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18. International Joint Conferences on Artificial Intelligence Organization, (July 2018), 2283--2290. doi: 10.24963/ijcai.2018/316.Google ScholarCross Ref
Maximilian Jaritz, Raoul de Charette, Émilie Wirbel, Xavier Perrotton, and Fawzi Nashashibi. 2018. Sparse and dense data with cnns: depth completion and semantic segmentation. CoRR, abs/1808.00769.Google Scholar
Jie Tang, Fei-Peng Tian,Wei Feng, Jian Li, and Ping Tan. 2021. Learning guided convolutional network for depth completion. IEEE Transactions on Image Processing, 30, 1116--1129. doi: 10.1109/TIP.2020.3040528.Google ScholarDigital Library
Jiaxiong Qiu, Zhaopeng Cui, Yinda Zhang, Xingdi Zhang, Shuaicheng Liu, Bing Zeng, and Marc Pollefeys. 2019. Deeplidar: deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). (June 2019).Google ScholarCross Ref
Zhiqiang Yan, Kun Wang, Xiang Li, Zhenyu Zhang, Baobei Xu, Jun Li, and Jian Yang. 2021. Rignet: repetitive image guided network for depth completion. CoRR, abs/2107.13802.Google Scholar
Mu Hu, Shuling Wang, Bin Li, Shiyu Ning, Li Fan, and Xiaojin Gong. 2021. Penet: towards precise and efficient image guided depth completion. In 2021 IEEE International Conference on Robotics and Automation (ICRA), 13656--13662. doi: 10.1109/ICRA48506.2021.9561035.Google ScholarDigital Library
Xinjing Cheng, Peng Wang, Chenye Guan, and Ruigang Yang. 2020. Cspn: learning context and resource aware convolutional spatial propagation networks for depth completion. 34, (April 2020), 10615--10622. doi: 10.1609/aaai. v34i07.6635.Google Scholar
Xinjing Cheng, Peng Wang, and Ruigang Yang. 2020. Learning depth with convolutional spatial propagation network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42, 10, 2361--2379. doi: 10.1109/TPAMI.2019.2947374.Google ScholarDigital Library
Chongzhen Zhang, Yang Tang, Chaoqiang Zhao, Qiyu Sun, Zhencheng Ye, and Jürgen Kurths. 2021. Multitask gans for semantic segmentation and depth completion with cycle consistency. IEEE Transactions on Neural Networks and Learning Systems, 32, 12, 5404--5415. doi: 10.1109/TNNLS.2021.3072883.Google ScholarCross Ref
Kranti Kumar Parida, Siddharth Srivastava, and Gaurav Sharma. 2021. Beyond image to depth: improving depth prediction using echoes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8268--8277.Google ScholarCross Ref
Aditya Prakash, Kashyap Chitta, and Andreas Geiger. 2021. Multi-modal fusion transformer for end-to-end autonomous driving. In Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. 2021. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). (October 2021), 12179--12188.Google ScholarCross Ref
Mario Bijelic, Tobias Gruber, Fahim Mannan, Florian Kraus, Werner Ritter, Klaus Dietmayer, and Felix Heide. 2020. Seeing through fog without seeing fog: deep multimodal sensor fusion in unseen adverse weather. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (June 2020).Google ScholarCross Ref
Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.Google Scholar
Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative adversarial text to image synthesis. In International conference on machine learning. PMLR, 1060--1069.Google Scholar
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-toimage translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1125--1134.Google Scholar
Alexia Jolicoeur-Martineau. 2019. The relativistic discriminator: a key element missing from standard GAN. In International Conference on Learning Representations. https://openreview.net/forum?id=S1erHoR5t7.Google Scholar
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations. https://openreview.net/ forum?id=YicbFdNTTy.Google Scholar
Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. 2015. ShapeNet: An Information-Rich 3D Model Repository. Technical report arXiv:1512.03012 [cs.GR]. Stanford University - Princeton University - Toyota Technological Institute at Chicago.Google Scholar
Pushmeet Kohli Nathan Silberman Derek Hoiem and Rob Fergus. 2012. Indoor segmentation and support inference from rgbd images. In ECCV.Google Scholar
Diederik P. Kingma and Jimmy Ba. 2014. Adam: a method for stochastic optimization. (2014). eprint: arXiv:1412.6980.Google Scholar

Index Terms

Robust Multimodal Depth Estimation using Transformer based Generative Adversarial Networks
1. Computing methodologies
2. Hardware
  1. Communication hardware, interfaces and storage
    1. Sensor applications and deployments
  2. Robustness
    1. Fault tolerance

Recommendations

Sparse to Dense Depth Completion using a Generative Adversarial Network with Intelligent Sampling Strategies
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Predicting dense depth accurately is essential for 3D scene understanding applications such as autonomous driving and robotics. However, the depth obtained from commercially available LiDAR and Time-of-Flight sensors is very sparse. With RGB color ...
Read More
RGB-D SLAM with Deep Depth Completion
Artificial Intelligence and Soft Computing
Abstract
RGB-D indoor mapping has been an active research topic in the last decade with the release of various depth sensors. Researchers proposed impressive SLAM systems such as ORB-SLAM2. However, the depth sensors are sensitive to illumination ...
Read More
Depth completion towards different sensor configurations via relative depth map estimation and scale recovery
Abstract
Depth completion, which combines additional sparse depth information from the range sensors, substantially improves the accuracy of monocular depth estimation, especially using the deep-learning-based methods. However, these methods ...
Highlights
- A novel and universal two-stage depth completion approach is proposed.
- The ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
October 2022
7537 pages
ISBN:9781450392037
DOI:10.1145/3503161
General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 October 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
depth completion
generative adversarial nertworks (gan)
multimodal sensing
robustness
sensor failure
sensor fusion
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 253
  Total Downloads
- Downloads (Last 12 months)132
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Robust Multimodal Depth Estimation using Transformer based Generative Adversarial Networks

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Sparse to Dense Depth Completion using a Generative Adversarial Network with Intelligent Sampling Strategies

RGB-D SLAM with Deep Depth Completion

Depth completion towards different sensor configurations via relative depth map estimation and scale recovery