research-article

Lifting the Veil of Frequency in Joint Segmentation and Depth Estimation

Authors:

Errui DingAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 944 - 952

https://doi.org/10.1145/3474085.3475277

Published: 17 October 2021 Publication History

Abstract

Joint learning of scene parsing and depth estimation remains a challenging task due to the rivalry between the two tasks. In this paper, we revisit the mutual enhancement for joint semantic segmentation and depth estimation. Inspired by the observation that the competition and cooperation could be reflected in the feature frequency components of different tasks, we propose a Frequency Aware Feature Enhancement (FAFE) network that can effectively enhance the reciprocal relationship whereas avoiding the competition. In FAFE, a frequency disentanglement module is proposed to fetch the favorable frequency component sets for each task and resolve the discordance between the two tasks. For task cooperation, we introduce a re-calibration unit to aggregate features of the two tasks, so as to complement task information with each other. Accordingly, the learning of each task can be boosted by the complementary task appropriately. Besides, a novel local-aware consistency loss function is proposed to impose on the predicted segmentation and depth so as to strengthen the cooperation. With the FAFE network and new local-aware consistency loss encapsulated into the multi-task learning network, the proposed approach achieves superior performance over previous state-of-the-art methods. Extensive experiments and ablation studies on multi-task datasets demonstrate the effectiveness of our proposed approach.

Supplementary Material

MP4 File (mm21_video.mp4)

Presentation video

Download
45.44 MB

References

[1]

Ibraheem Alhashim and Peter Wonka. 2018. High quality monocular depth estimation via transfer learning. arXiv preprint arXiv:1812.11941 (2018).

[2]

Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. 2020. AdaBins: Depth Estimation using Adaptive Bins. arXiv preprint arXiv:2011.14141 (2020).

[3]

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2014. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062 (2014).

[4]

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2017a. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, Vol. 40, 4 (2017), 834--848.

[5]

Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. 2017b. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017).

[6]

Carl Doersch and Andrew Zisserman. 2017. Multi-task self-supervised visual learning. In Proceedings of the IEEE International Conference on Computer Vision. 2051--2060.

[7]

Nikita Dvornik, Konstantin Shmelkov, Julien Mairal, and Cordelia Schmid. 2017. Blitznet: A real-time deep network for scene understanding. In Proceedings of the IEEE international conference on computer vision. 4154--4162.

[8]

David Eigen and Rob Fergus. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE international conference on computer vision. 2650--2658.

Digital Library

[9]

David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth map prediction from a single image using a multi-scale deep network. arXiv preprint arXiv:1406.2283 (2014).

[10]

Theodoros Evgeniou and Massimiliano Pontil. 2004. Regularized multi--task learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. 109--117.

Digital Library

[11]

Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. 1440--1448.

Digital Library

[12]

Saurabh Gupta, Ross Girshick, Pablo Arbeláez, and Jitendra Malik. 2014. Learning rich features from RGB-D images for object detection and segmentation. In European conference on computer vision. Springer, 345--360.

[13]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961--2969.

[14]

Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7132--7141.

[15]

Alex Kendall, Vijay Badrinarayanan, and Roberto Cipolla. 2015. Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. arXiv preprint arXiv:1511.02680 (2015).

[16]

Abhishek Kumar and Hal Daume III. 2012. Learning task grouping and overlap in multi-task learning. arXiv preprint arXiv:1206.6417 (2012).

Digital Library

[17]

Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. 2019. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326 (2019).

[18]

Bo Li, Chunhua Shen, Yuchao Dai, Anton Van Den Hengel, and Mingyi He. 2015. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1119--1127.

[19]

Xiangtai Li, Xia Li, Li Zhang, Guangliang Cheng, Jianping Shi, Zhouchen Lin, Shaohua Tan, and Yunhai Tong. 2020. Improving semantic segmentation via decoupled body and edge supervision. arXiv preprint arXiv:2007.10035 (2020).

[20]

Chao Liang, Zhipeng Zhang, Yi Lu, Xue Zhou, Bing Li, Xiyong Ye, and Jianxiao Zou. 2020. Rethinking the competition between detection and ReID in Multi-Object Tracking. arXiv preprint arXiv:2010.12138 (2020).

[21]

Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. 2017. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1925--1934.

[22]

Guosheng Lin, Chunhua Shen, Anton Van Den Hengel, and Ian Reid. 2016. Efficient piecewise training of deep structured models for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3194--3203.

[23]

Fayao Liu, Chunhua Shen, and Guosheng Lin. 2015a. Deep convolutional neural fields for depth estimation from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5162--5170.

[24]

Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. 2015b. Learning depth from single monocular images using deep convolutional neural fields. IEEE transactions on pattern analysis and machine intelligence, Vol. 38, 10 (2015), 2024--2039.

Digital Library

[25]

Jianbo Liu, Yongcheng Liu, Ying Wang, Véronique Prinet, Shiming Xiang, and Chunhong Pan. 2020. Decoupled representation learning for skeleton-based gesture recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5751--5760.

[26]

Shikun Liu, Edward Johns, and Andrew J Davison. 2019. End-to-end multi-task learning with attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1871--1880.

[27]

Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3431--3440.

[28]

Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. 2016. Cross-stitch networks for multi-task learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3994--4003.

[29]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497 (2015).

[30]

Anirban Roy and Sinisa Todorovic. 2016. Monocular depth estimation using neural regression forest. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5506--5514.

[31]

Ashutosh Saxena, Sung H Chung, Andrew Y Ng, et al. 2005. Learning depth from single monocular images. In NIPS, Vol. 18. 1--8.

Digital Library

[32]

Ashutosh Saxena, Min Sun, and Andrew Y Ng. 2008. Make3d: Learning 3d scene structure from a single still image. IEEE transactions on pattern analysis and machine intelligence, Vol. 31, 5 (2008), 824--840.

Digital Library

[33]

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from rgbd images. In European conference on computer vision. Springer, 746--760.

Digital Library

[34]

Vivek Kumar Singh, Mohamed Abdel-Nasser, Hatem A Rashwan, Farhan Akram, Nidhi Pandey, Alain Lalande, Benoit Presles, Santiago Romani, and Domenec Puig. 2019. FCA-net: Adversarial learning for skin lesion segmentation based on multi-scale features and factorized channel attention. IEEE Access, Vol. 7 (2019), 130552--130565.

[35]

Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5693--5703.

[36]

Simon Vandenhende, Stamatios Georgoulis, Wouter Van Gansbeke, Marc Proesmans, Dengxin Dai, and Luc Van Gool. 2021. Multi-Task Learning for Dense Prediction Tasks: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).

[37]

Simon Vandenhende, Stamatios Georgoulis, and Luc Van Gool. 2020. Mti-net: Multi-scale task interaction networks for multi-task learning. In European Conference on Computer Vision. Springer, 527--543.

Digital Library

[38]

Peng Wang, Xiaohui Shen, Zhe Lin, Scott Cohen, Brian Price, and Alan L Yuille. 2015. Towards unified depth and semantic prediction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2800--2809.

[39]

Zhangyang Wang, Ding Liu, Shiyu Chang, Qing Ling, Yingzhen Yang, and Thomas S Huang. 2016. D3: Deep dual-domain based fast restoration of JPEG-compressed images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2764--2772.

[40]

Jun Wei, Shuhui Wang, Zhe Wu, Chi Su, Qingming Huang, and Qi Tian. 2020. Label decoupling framework for salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13025--13034.

[41]

Mingqing Xiao, Shuxin Zheng, Chang Liu, Yaolong Wang, Di He, Guolin Ke, Jiang Bian, Zhouchen Lin, and Tie-Yan Liu. 2020. Invertible image rescaling. In European Conference on Computer Vision. Springer, 126--144.

Digital Library

[42]

Dan Xu, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. 2018a. Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 675--684.

[43]

Dan Xu, Wei Wang, Hao Tang, Hong Liu, Nicu Sebe, and Elisa Ricci. 2018b. Structured attention guided convolutional neural fields for monocular depth estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3917--3925.

[44]

Hao Zhang, Mengmeng Wang, Yong Liu, and Yi Yuan. 2020. FDN: Feature Decoupling Network for Head Pose Estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12789--12796.

[45]

Zhenyu Zhang, Zhen Cui, Chunyan Xu, Zequn Jie, Xiang Li, and Jian Yang. 2018. Joint task-recursive learning for semantic segmentation and depth estimation. In Proceedings of the European Conference on Computer Vision (ECCV). 235--251.

Digital Library

[46]

Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2881--2890.

[47]

Ling Zhou, Zhen Cui, Chunyan Xu, Zhenyu Zhang, Chaoqun Wang, Tong Zhang, and Jian Yang. 2020. Pattern-structure diffusion for multi-task learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4514--4523.

Cited By

Liao MTian SZhang YHua GYou RZou WLi X(2025)Class-discriminative domain generalization for semantic segmentationImage and Vision Computing10.1016/j.imavis.2024.105393154(105393)Online publication date: Feb-2025
https://doi.org/10.1016/j.imavis.2024.105393
Cong RWu CSong XZhang WKwong SLi HJi P(2024)SRNSD: Structure-Regularized Night-Time Self-Supervised Monocular Depth Estimation for Outdoor ScenesIEEE Transactions on Image Processing10.1109/TIP.2024.346503433(5538-5550)Online publication date: 2024
https://doi.org/10.1109/TIP.2024.3465034

Index Terms

Lifting the Veil of Frequency in Joint Segmentation and Depth Estimation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

Joint Task-Recursive Learning for Semantic Segmentation and Depth Estimation
Computer Vision – ECCV 2018
Abstract
In this paper, we propose a novel joint Task-Recursive Learning (TRL) framework for the closing-loop semantic segmentation and monocular depth estimation tasks. TRL can recursively refine the results of both tasks through serialized task-level ...
KIL: Knowledge Interactiveness Learning for Joint Depth Estimation and Semantic Segmentation
Pattern Recognition
Abstract
Depth estimation and semantic segmentation are two important yet challenging tasks in the field of pixel-level scene understanding. Previous works often solve the two tasks as the parallel decoding/modeling process, but cannot well consider ...
IIMT-net: Poly-1 weights balanced multi-task network for semantic segmentation and depth estimation using interactive information
Abstract
Semantic segmentation and depth estimation are two basic researchable problems in computer vision. In common, we explore the two tasks separately. However, in some scenes, such as autonomous driving, they need be done at the same time. Meanwhile, ...
Graphical abstract

Display Omitted
Highlights
- Propose a novel multi-task network for semantic segmentation and depth estimation.
- Add information interactive modules to communicate information between branches.
- Propose a new task-balancing strategy Poly-1 weights to balance two ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
178
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)1

Reflects downloads up to 23 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liao MTian SZhang YHua GYou RZou WLi X(2025)Class-discriminative domain generalization for semantic segmentationImage and Vision Computing10.1016/j.imavis.2024.105393154(105393)Online publication date: Feb-2025
https://doi.org/10.1016/j.imavis.2024.105393
Cong RWu CSong XZhang WKwong SLi HJi P(2024)SRNSD: Structure-Regularized Night-Time Self-Supervised Monocular Depth Estimation for Outdoor ScenesIEEE Transactions on Image Processing10.1109/TIP.2024.346503433(5538-5550)Online publication date: 2024
https://doi.org/10.1109/TIP.2024.3465034

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten