research-article

MMAG: Mutually Motivated Attention Gates for Simultaneous Extraction of Contextual and Spatial Information from a Monocular Image

Authors:

Vandana Kushwaha,

G C NandiAuthors Info & Claims

ICVGIP '23: Proceedings of the Fourteenth Indian Conference on Computer Vision, Graphics and Image Processing

Article No.: 3, Pages 1 - 7

https://doi.org/10.1145/3627631.3627634

Published: 31 January 2024 Publication History

Abstract

In order to effectively interact with its environment, an agent must possess the ability to comprehend the ’what’ and the ’where.’ Two vision-based approaches, namely semantic segmentation and depth estimation, can be employed to provide this information. This paper introduces a unified model that combines these two tasks and utilizes a shared latent space. The chosen approach for the model involves an encoder-decoder architecture accompanied by a novel attention gate mechanism called MMAG, which has been proven to be highly effective. Furthermore, a common skip connection method, in conjunction with the proposed attention gate, is included to emphasize crucial features for both prediction purposes. By leveraging the complementary information within the shared representations, the model is capable of generating more accurate predictions compared to other joint-prediction models, all while utilizing fewer network parameters. To enhance performance, dilated layers are incorporated, allowing for focused attention through different receptive fields. The model’s performance has been evaluated on the NYU Depth v2 and Camvid datasets, demonstrating improved results on both tasks when compared to other state-of-the-art models. With its reduced parameter count, this model is particularly suitable for low-cost robots and has undergone successful testing for inference on the LoCoBot robot.

References

[1]

[1] Husbands P, Shim Y, Garvie M, Dewar A, Domcsek N, Graham P, Knight J, Nowotny T, Philippides A (2021) Recent advances in evolutionary and bio-inspired adaptive robotics: Exploiting embodied dynamics. Appl Intell 51(9):6467–6496

Digital Library

[2]

[2] Fang B, Mei G, Yuan X, Wang L, Wang Z, Wang J (2021) Visual slam for robot navigation in healthcare facility. Pattern Recogn 113:107822. https://doi.org/10.1016/j.patcog.2021.107822

[3]

[3] Zhang Z, Cui Z, Xu C, Jie Z, Li X, Yang J (2020) Joint task- recursive learning for rgb-d scene understanding. IEEE Trans Pattern Anal Mach Intell 42(10):2608–2623. https://doi.org/10. 1109/TPAMI.2019.2926728

Digital Library

[4]

[4]Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556

[5]

[5] Ronneberger, O., Fischer, P., Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Navab, N., Hornegger, J., Wells, W., Frangi, A. (eds) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. MICCAI 2015. Lecture Notes in Computer Science(), vol 9351. Springer, Cham

[6]

[6] V. Badrinarayanan, A. Kendall and R. Cipolla, "SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2481-2495, 1 Dec. 2017.

[7]

[7]G. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object classes in video: A high-definition ground truth database,” PRL, vol. 30(2), pp. 88– 97, 2009

Digital Library

[8]

[8] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In Computer Vision – ECCV 2012, pages 746–760, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg

Digital Library

[9]

[9] Singh, Aditya, et al. "Efficient deep learning-based semantic mapping approach using monocular vision for resource-limited mobile robots." Neural Computing and Applications 34.18 (2022): 15617-15631.

[10]

[10] Singh, Aditya, et al. "Reliable Scene Recognition Approach for Mobile Robots with Limited Resources Based on Deep Learning and Neuro-Fuzzy Inference." Traitement du Signal 39.4 (2022).

[11]

[11] Zhou, T., Brown, M., Snavely, N., & Lowe, D. G. (2017). Unsupervised learning of depth and ego-motion from video. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1851-1858). IEEE. https://doi.org/10.1109/CVPR.2017.199

[12]

[12] D. Eigen, C. Puhrsch, R. Fergus. Depth map prediction from a single image using a multi-scale deep network, in: Advances in Neural Information Processing Systems, 2014, pp. 2366–2374

[13]

[13] D. Eigen, R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. Proceedings of the IEEE International Conference on Computer Vision(2015), pp. 2650-2658

[14]

[14] Cheng, X., Wang, P. and Yang, R., 2018. Depth estimation via affinity learned with convolutional spatial propagation network. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 103-119)

[15]

[15] Yin W, Liu Y, Shen C (2021) Virtual normal: Enforcing geometric constraints for accurate and robust depth prediction. IEEE Trans Pattern Anal Mach Intell: 1–1. https://doi.org/10.1109/TPAMI. 2021.3097396

[16]

[16]Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings ofthe IEEE conference on computer vision and pattern recognition, pp. 3431–3440 (2015)

[17]

[17]L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. Proceedings of the European Conference on Computer Vision (ECCV) (2018), pp. 801-818

Digital Library

[18]

[18]L.-C. Chen, G. Papandreou, F. Schroff, H. Adam, Rethinking atrous convolution for semantic image segmentation, arXiv preprint arXiv:1706.05587

[19]

[19]Yu C, Wang J, Gao C, Yu G, Shen C, Sang N (2020). Context prior for scene segmentation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

[20]

[20]Wu Y, Jiang J, Huang Z, Tian Y (2021) Fpanet: Feature pyramid aggregation network for real-time semantic segmentation. Appl Intell:1–18. https://doi.org/10.1007/s10489-021-02603-z

Digital Library

[21]

[21]Xu D, Wang W, Tang H, Liu H, Sebe N, Ricci E (2018) Structured attention guided convolutional neural fields for monocular depth estimation. In: 2018 IEEE/CVF conference on computer vision and pattern recognition

[22]

[22]Oktay, Ozan and Schlemper, Jo and Folgoc, Loic Le and Lee, Matthew and Heinrich, Mattias and Misawa, Kazunari and Mori, Kensaku and McDonagh, Steven and Hammerla, Nils Y and Kainz, Bernhard and Glocker, Ben and Rueckert, Daniel. Attention U-Net: Learning Where to Look for the Pancreas. arXiv (2018)

[23]

[23]Liu J, Wang Y, Li Y, Fu J, Li J, Lu H (2018) Collaborative deconvolutional neural networks for joint depth estimation and semantic segmentation. IEEE Trans Neural Netw Learn Syst 29(11):5655– 5666. https://doi.org/10.1109/TNNLS.2017.2787781

[24]

[24]Xu D, Ouyang W, Wang X, Sebe N (2018) PAD-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In: 2018 IEEE/CVF Conference on computer vision and pattern recognition

[25]

[25]He L, Lu J, Wang G, Song S, Zhou J (2021) SOSD-net: Joint semantic object segmentation and depth estimation from monocular images. Neurocomputing 440:251–263. https://doi.org/10 1016/j.neucom.2021.01.126

[26]

[26]Gao, Tianxiao & Wei, Wu & Cai, Zhongbin & Fan, Zhun & Xie, Sheng & Wang, Xinmei & Yu, Qiuda. (2022). CI-Net: a joint depth estimation and semantic segmentation network using contextual information. Applied Intelligence. 52. 10.1007/s10489-022-03401-x

[27]

[27] Alhashim, Ibraheem and Wonka, Peter. High Quality Monocular Depth Estimation via Transfer Learning. arXiv e-prints, abs/1812.11941, 2018

[28]

[28]Lin X, Sanchez-Escobedo D, Casas JR, Pardas̀ M (2019) Depth estimation and semantic segmentation from a single rgb image using a hybrid convolutional neural network. Sensors 19(8). https://doi.org/10.3390/s19081795

Index Terms

MMAG: Mutually Motivated Attention Gates for Simultaneous Extraction of Contextual and Spatial Information from a Monocular Image
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
      2. Computer vision tasks
        Scene understanding
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Self-attention neural architecture search for semantic image segmentation
Abstract
Self-attention can capture long-distance dependencies and is widely used in semantic segmentation. Existing methods mainly use two kinds of self-attentions, i.e., spatial attention and channel attention, which can capture the relations in H W ...
A serial semantic segmentation model based on encoder-decoder architecture
Highlights
- Developed a novel encoder-decoder semantic segmentation model named SSS-Former.
- A lighter ResSSS -Former is proposed for faster real-time segmentation.
- Demonstrated impact of new network structures on SSS-Former and ResSSS -Former.
Abstract
The thriving progress of Convolutional Neural Networks (CNNs) and the outstanding efficacy of Visual Transformers (ViTs) have delivered impressive outcomes in the domain of semantic segmentation. However, each model in isolation entails a trade-...
Improving Unsupervised Learning of Monocular Depth and Ego-Motion via Stereo Network
Pattern Recognition and Computer Vision
Abstract
Unsupervised learning of monocular depth and ego-motion is a challenging task, which uses the photometric loss as the supervision to train the networks. Although existing unsupervised methods can get rid of expensive annotations, they are still ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICVGIP '23: Proceedings of the Fourteenth Indian Conference on Computer Vision, Graphics and Image Processing

December 2023

352 pages

ISBN:9798400716256

DOI:10.1145/3627631

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 January 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICVGIP '23

ICVGIP '23: Indian Conference on Computer Vision, Graphics and Image Processing

December 15 - 17, 2023

Rupnagar, India

Acceptance Rates

Overall Acceptance Rate 95 of 286 submissions, 33%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
31
Total Downloads

Downloads (Last 12 months)31
Downloads (Last 6 weeks)5

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten