skip to main content
10.1145/3627631.3627634acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicvgipConference Proceedingsconference-collections
research-article

MMAG: Mutually Motivated Attention Gates for Simultaneous Extraction of Contextual and Spatial Information from a Monocular Image

Published: 31 January 2024 Publication History

Abstract

In order to effectively interact with its environment, an agent must possess the ability to comprehend the ’what’ and the ’where.’ Two vision-based approaches, namely semantic segmentation and depth estimation, can be employed to provide this information. This paper introduces a unified model that combines these two tasks and utilizes a shared latent space. The chosen approach for the model involves an encoder-decoder architecture accompanied by a novel attention gate mechanism called MMAG, which has been proven to be highly effective. Furthermore, a common skip connection method, in conjunction with the proposed attention gate, is included to emphasize crucial features for both prediction purposes. By leveraging the complementary information within the shared representations, the model is capable of generating more accurate predictions compared to other joint-prediction models, all while utilizing fewer network parameters. To enhance performance, dilated layers are incorporated, allowing for focused attention through different receptive fields. The model’s performance has been evaluated on the NYU Depth v2 and Camvid datasets, demonstrating improved results on both tasks when compared to other state-of-the-art models. With its reduced parameter count, this model is particularly suitable for low-cost robots and has undergone successful testing for inference on the LoCoBot robot.

References

[1]
[1] Husbands P, Shim Y, Garvie M, Dewar A, Domcsek N, Graham P, Knight J, Nowotny T, Philippides A (2021) Recent advances in evolutionary and bio-inspired adaptive robotics: Exploiting embodied dynamics. Appl Intell 51(9):6467–6496
[2]
[2] Fang B, Mei G, Yuan X, Wang L, Wang Z, Wang J (2021) Visual slam for robot navigation in healthcare facility. Pattern Recogn 113:107822. https://doi.org/10.1016/j.patcog.2021.107822
[3]
[3] Zhang Z, Cui Z, Xu C, Jie Z, Li X, Yang J (2020) Joint task- recursive learning for rgb-d scene understanding. IEEE Trans Pattern Anal Mach Intell 42(10):2608–2623. https://doi.org/10. 1109/TPAMI.2019.2926728
[4]
[4]Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556
[5]
[5] Ronneberger, O., Fischer, P., Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Navab, N., Hornegger, J., Wells, W., Frangi, A. (eds) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. MICCAI 2015. Lecture Notes in Computer Science(), vol 9351. Springer, Cham
[6]
[6] V. Badrinarayanan, A. Kendall and R. Cipolla, "SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2481-2495, 1 Dec. 2017.
[7]
[7]G. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object classes in video: A high-definition ground truth database,” PRL, vol. 30(2), pp. 88– 97, 2009
[8]
[8] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In Computer Vision – ECCV 2012, pages 746–760, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg
[9]
[9] Singh, Aditya, et al. "Efficient deep learning-based semantic mapping approach using monocular vision for resource-limited mobile robots." Neural Computing and Applications 34.18 (2022): 15617-15631.
[10]
[10] Singh, Aditya, et al. "Reliable Scene Recognition Approach for Mobile Robots with Limited Resources Based on Deep Learning and Neuro-Fuzzy Inference." Traitement du Signal 39.4 (2022).
[11]
[11] Zhou, T., Brown, M., Snavely, N., & Lowe, D. G. (2017). Unsupervised learning of depth and ego-motion from video. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1851-1858). IEEE. https://doi.org/10.1109/CVPR.2017.199
[12]
[12] D. Eigen, C. Puhrsch, R. Fergus. Depth map prediction from a single image using a multi-scale deep network, in: Advances in Neural Information Processing Systems, 2014, pp. 2366–2374
[13]
[13] D. Eigen, R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. Proceedings of the IEEE International Conference on Computer Vision(2015), pp. 2650-2658
[14]
[14] Cheng, X., Wang, P. and Yang, R., 2018. Depth estimation via affinity learned with convolutional spatial propagation network. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 103-119)
[15]
[15] Yin W, Liu Y, Shen C (2021) Virtual normal: Enforcing geometric constraints for accurate and robust depth prediction. IEEE Trans Pattern Anal Mach Intell: 1–1. https://doi.org/10.1109/TPAMI. 2021.3097396
[16]
[16]Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings ofthe IEEE conference on computer vision and pattern recognition, pp. 3431–3440 (2015)
[17]
[17]L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. Proceedings of the European Conference on Computer Vision (ECCV) (2018), pp. 801-818
[18]
[18]L.-C. Chen, G. Papandreou, F. Schroff, H. Adam, Rethinking atrous convolution for semantic image segmentation, arXiv preprint arXiv:1706.05587
[19]
[19]Yu C, Wang J, Gao C, Yu G, Shen C, Sang N (2020). Context prior for scene segmentation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
[20]
[20]Wu Y, Jiang J, Huang Z, Tian Y (2021) Fpanet: Feature pyramid aggregation network for real-time semantic segmentation. Appl Intell:1–18. https://doi.org/10.1007/s10489-021-02603-z
[21]
[21]Xu D, Wang W, Tang H, Liu H, Sebe N, Ricci E (2018) Structured attention guided convolutional neural fields for monocular depth estimation. In: 2018 IEEE/CVF conference on computer vision and pattern recognition
[22]
[22]Oktay, Ozan and Schlemper, Jo and Folgoc, Loic Le and Lee, Matthew and Heinrich, Mattias and Misawa, Kazunari and Mori, Kensaku and McDonagh, Steven and Hammerla, Nils Y and Kainz, Bernhard and Glocker, Ben and Rueckert, Daniel. Attention U-Net: Learning Where to Look for the Pancreas. arXiv (2018)
[23]
[23]Liu J, Wang Y, Li Y, Fu J, Li J, Lu H (2018) Collaborative deconvolutional neural networks for joint depth estimation and semantic segmentation. IEEE Trans Neural Netw Learn Syst 29(11):5655– 5666. https://doi.org/10.1109/TNNLS.2017.2787781
[24]
[24]Xu D, Ouyang W, Wang X, Sebe N (2018) PAD-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In: 2018 IEEE/CVF Conference on computer vision and pattern recognition
[25]
[25]He L, Lu J, Wang G, Song S, Zhou J (2021) SOSD-net: Joint semantic object segmentation and depth estimation from monocular images. Neurocomputing 440:251–263. https://doi.org/10 1016/j.neucom.2021.01.126
[26]
[26]Gao, Tianxiao & Wei, Wu & Cai, Zhongbin & Fan, Zhun & Xie, Sheng & Wang, Xinmei & Yu, Qiuda. (2022). CI-Net: a joint depth estimation and semantic segmentation network using contextual information. Applied Intelligence. 52. 10.1007/s10489-022-03401-x
[27]
[27] Alhashim, Ibraheem and Wonka, Peter. High Quality Monocular Depth Estimation via Transfer Learning. arXiv e-prints, abs/1812.11941, 2018
[28]
[28]Lin X, Sanchez-Escobedo D, Casas JR, Pardas̀ M (2019) Depth estimation and semantic segmentation from a single rgb image using a hybrid convolutional neural network. Sensors 19(8). https://doi.org/10.3390/s19081795

Index Terms

  1. MMAG: Mutually Motivated Attention Gates for Simultaneous Extraction of Contextual and Spatial Information from a Monocular Image
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        ICVGIP '23: Proceedings of the Fourteenth Indian Conference on Computer Vision, Graphics and Image Processing
        December 2023
        352 pages
        ISBN:9798400716256
        DOI:10.1145/3627631
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 31 January 2024

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. attention gate mechanism
        2. encoder-decoder
        3. monocular depth estimation
        4. semantic segmentation

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Conference

        ICVGIP '23

        Acceptance Rates

        Overall Acceptance Rate 95 of 286 submissions, 33%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 31
          Total Downloads
        • Downloads (Last 12 months)31
        • Downloads (Last 6 weeks)5
        Reflects downloads up to 17 Feb 2025

        Other Metrics

        Citations

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media