research-article

Task-Interaction-Free Multi-Task Learning with Efficient Hierarchical Feature Representation

Authors:

Shalayiding Sirejiding,

Bayram Bayramli,

Tamam Alsarhan,

Yue DingAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 6103 - 6112

https://doi.org/10.1145/3664647.3681183

Published: 28 October 2024 Publication History

Abstract

Traditional multi-task learning often relies on explicit task interaction mechanisms to enhance multi-task performance. However, these approaches encounter challenges such as negative transfer when jointly learning multiple weakly correlated tasks. Additionally, these methods handle encoded features at a large scale, which escalates computational complexity to ensure dense prediction task performance. In this study, we introduce a Task-Interaction-Free Network (TIF) for multi-task learning, which diverges from explicitly designed task interaction mechanisms. Firstly, we present a Scale Attentive-Feature Fusion Module (SAFF) to enhance each scale in the shared encoder to have rich task-agnostic encoded features. Subsequently, our proposed task and scale-specific decoders efficiently decode the enhanced features shared across tasks without necessitating task-interaction modules. Concretely, we utilize a Self-Feature Distillation Module (SFD) to explore task-specific features at lower scales and the Low-To-High Scale Feature Diffusion Module (LTHD) to diffuse global pixel relationships from low-level to high-level scales. Experiments on publicly available multi-task learning datasets validate that our TIF attains state-of-the-art performance.

References

[1]

David Brüggemann, Menelaos Kanakis, Anton Obukhov, Stamatios Georgoulis, and Luc Van Gool. 2021. Exploring relational context for multi-task dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision. 15869--15878.

[2]

Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, and Alan Yuille. 2014. Detect what you can: Detecting and representing objects using holistic models and body parts. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1971--1978.

Digital Library

[3]

Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. 2018. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International conference on machine learning (ICML). PMLR, 794--803.

[4]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR).

[5]

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. International journal of computer vision, Vol. 88 (2010), 303--338.

Digital Library

[6]

Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. 2019. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3146--3154.

[7]

Kaifeng Gao, Long Chen, Yifeng Huang, and Jun Xiao. 2021. Video relation detection via tracklet based visual transformer. In Proceedings of the 29th ACM international conference on multimedia. 4833--4837.

Digital Library

[8]

Yuan Gao, Jiayi Ma, Mingbo Zhao, Wei Liu, and Alan L Yuille. 2019. Nddr-cnn: Layerwise feature fusing in multi-task cnns by neural discriminative dimensionality reduction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3205--3214.

[9]

Vitor Guizilini, Igor Vasiljevic, Dian Chen, Rare? Ambru?, and Adrien Gaidon. 2023. Towards zero-shot scale-aware monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9233--9243.

[10]

Alex Kendall, Yarin Gal, and Roberto Cipolla. 2018. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7482--7491.

[11]

Liyang Liu, Yi Li, Zhanghui Kuang, J Xue, Yimin Chen, Wenming Yang, Qingmin Liao, and Wayne Zhang. 2021. Towards impartial multi-task learning. In International Conference on Learning Representations (ICLR).

[12]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012--10022.

[13]

Zhu Liu, Jinyuan Liu, Benzhuang Zhang, Long Ma, Xin Fan, and Risheng Liu. 2023. PAIF: Perception-aware infrared-visible image fusion for attack-tolerant semantic segmentation. In Proceedings of the 31st ACM International Conference on Multimedia. 3706--3714.

Digital Library

[14]

Yuxiang Lu, Shalayiding Sirejiding, Yue Ding, Chunlin Wang, and Hongtao Lu. 2024. Prompt guided transformer for multi-task dense prediction. IEEE Transactions on Multimedia (2024).

Digital Library

[15]

S Mahdi H Miangoleh, Zoya Bylinskii, Eric Kee, Eli Shechtman, and Yaugiz Aksoy. 2023. Realistic saliency guided image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 186--194.

[16]

Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. 2016. Cross-stitch networks for multi-task learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3994--4003.

[17]

Sebastian Ruder, Joachim Bingel, Isabelle Augenstein, and Anders Søgaard. 2019. Latent multi-task architecture learning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 4822--4829.

Digital Library

[18]

Ozan Sener and Vladlen Koltun. 2018. Multi-task learning as multi-objective optimization. Advances in neural information processing systems, Vol. 31 (2018).

[19]

Younggyo Seo, Junsu Kim, Stephen James, Kimin Lee, Jinwoo Shin, and Pieter Abbeel. 2023. Multi-view masked world models for visual robotic manipulation. In International Conference on Machine Learning. PMLR, 30613--30632.

[20]

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from rgbd images. In Computer Vision--ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7--13, 2012, Proceedings, Part V 12. Springer, 746--760.

[21]

Shalayiding Sirejiding, Bayram Bayramli, Yuxiang Lu, Suizhi Huang, Hongtao Lu, and Yue Ding. 2024. Adaptive Task-Wise Message Passing for Multi-Task Learning: A Spatial Interaction Perspective. IEEE Transactions on Circuits and Systems for Video Technology (2024), 1--1. https://doi.org/10.1109/TCSVT.2024.3399613

Digital Library

[22]

Shalayiding Sirejiding, Yuxiang Lu, Hongtao Lu, and Yue Ding. 2023. Scale-aware task message transferring for multi-task learning. In 2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1859--1864.

[23]

Guolei Sun, Thomas Probst, Danda Pani Paudel, Nikola Popović, Menelaos Kanakis, Jagruti Patel, Dengxin Dai, and Luc Van Gool. 2021. Task switching network for multi-task learning. In Proceedings of the IEEE/CVF international conference on computer vision. 8291--8300.

[24]

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning (ICML). PMLR, 10347--10357.

[25]

Simon Vandenhende, Stamatios Georgoulis, Bert De Brabandere, and Luc Van Gool. 2019. Branched multi-task networks: deciding what layers to share. arXiv preprint arXiv:1904.02920 (2019).

[26]

Simon Vandenhende, Stamatios Georgoulis, Wouter Van Gansbeke, Marc Proesmans, Dengxin Dai, and Luc Van Gool. 2021. Multi-task learning for dense prediction tasks: A survey. IEEE transactions on pattern analysis and machine intelligence, Vol. 44, 7 (2021), 3614--3633.

[27]

Simon Vandenhende, Stamatios Georgoulis, and Luc Van Gool. 2020. Mti-net: Multi-scale task interaction networks for multi-task learning. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part IV 16. Springer, 527--543.

[28]

Matthew Wallingford, Hao Li, Alessandro Achille, Avinash Ravichandran, Charless Fowlkes, Rahul Bhotika, and Stefano Soatto. 2022. Task adaptive parameter sharing for multi-task learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7561--7570.

[29]

Hengyi Wang, Jingwen Wang, and Lourdes Agapito. 2023. Co-slam: Joint coordinate and sparse parametric encodings for neural real-time slam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13293--13302.

[30]

Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. 2020. Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence, Vol. 43, 10 (2020), 3349--3364.

[31]

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7794--7803.

[32]

Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. 2019. SNAS: stochastic neural architecture search. In International Conference on Learning Representations (ICLR).

[33]

Dan Xu, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. 2018. Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In Proceedings of the IEEE conference on computer vision and pattern recognition. 675--684.

[34]

Jiacong Xu, Zixiang Xiong, and Shankar P Bhattacharyya. 2023. PIDNet: A real-time semantic segmentation network inspired by PID controllers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 19529--19539.

[35]

Weijian Xu, Yifan Xu, Tyler Chang, and Zhuowen Tu. 2021. Co-scale conv-attentional image transformers. In Proceedings of the IEEE/CVF international conference on computer vision. 9981--9990.

[36]

Yangyang Xu, Xiangtai Li, Haobo Yuan, Yibo Yang, and Lefei Zhang. 2023. Multi-task learning with multi-query transformer for dense prediction. IEEE Transactions on Circuits and Systems for Video Technology (2023).

[37]

Yangyang Xu, Yibo Yang, and Lefei Zhang. 2023. DeMT: Deformable mixer transformer for multi-task learning of dense prediction. In Proceedings of the AAAI conference on artificial intelligence, Vol. 37. 3072--3080.

Digital Library

[38]

Jianwei Yang, Chunyuan Li, Xiyang Dai, and Jianfeng Gao. 2022. Focal modulation networks. Advances in Neural Information Processing Systems, Vol. 35 (2022), 4203--4217.

[39]

Yuqi Yang, Peng-Tao Jiang, Qibin Hou, Hao Zhang, Jinwei Chen, and Bo Li. 2024. Multi-Task Dense Prediction via Mixture of Low-Rank Experts. 27927--27937.

[40]

Hanrong Ye and Dan Xu. 2022. Inverted pyramid multi-task transformer for dense scene understanding. In European Conference on Computer Vision. Springer, 514--530.

Digital Library

[41]

Hanrong Ye and Dan Xu. 2023. Taskexpert: Dynamically assembling multi-task representations with memorial mixture-of-experts. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 21828--21837.

[42]

Hanrong Ye and Dan Xu. 2023. TaskPrompter: Spatial-Channel Multi-Task Prompting for Dense Scene Understanding. In The Eleventh International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=-CwPopPJda

[43]

Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, and Shuicheng Yan. 2022. Volo: Vision outlooker for visual recognition. IEEE transactions on pattern analysis and machine intelligence, Vol. 45, 5 (2022), 6575--6586.

[44]

Zhenyu Zhang, Zhen Cui, Chunyan Xu, Zequn Jie, Xiang Li, and Jian Yang. 2018. Joint task-recursive learning for semantic segmentation and depth estimation. In Proceedings of the European Conference on Computer Vision (ECCV). 235--251.

Digital Library

[45]

Zhenyu Zhang, Zhen Cui, Chunyan Xu, Yan Yan, Nicu Sebe, and Jian Yang. 2019. Pattern-affinitive propagation across depth, surface normal and semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4106--4115.

[46]

Xiangyun Zhao, Haoxiang Li, Xiaohui Shen, Xiaodan Liang, and Ying Wu. 2018. A modulation module for multi-task learning with applications in image retrieval. In Proceedings of the European Conference on Computer Vision (ECCV). 401--416.

Digital Library

[47]

Ling Zhou, Zhen Cui, Chunyan Xu, Zhenyu Zhang, Chaoqun Wang, Tong Zhang, and Jian Yang. 2020. Pattern-structure diffusion for multi-task learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

Index Terms

Task-Interaction-Free Multi-Task Learning with Efficient Hierarchical Feature Representation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
        Hierarchical representations
      2. Computer vision tasks
        Scene understanding

Recommendations

Metric-Guided Multi-task Learning
Foundations of Intelligent Systems
Abstract
Multi-task learning (MTL) aims to solve multiple related learning tasks simultaneously so that the useful information in one specific task can be utilized by other tasks in order to improve the learning performance of all tasks. Many ...
Deep Multi-task Augmented Feature Learning via Hierarchical Graph Neural Network
Machine Learning and Knowledge Discovery in Databases. Research Track
Abstract
Deep multi-task learning attracts much attention in recent years as it achieves good performance in many applications. Feature learning is important to deep multi-task learning for sharing common information among tasks. In this paper, we propose ...
Prototype Feature Extraction for Multi-task Learning
WWW '22: Proceedings of the ACM Web Conference 2022

Multi-task learning (MTL) has been widely utilized in various industrial scenarios, such as recommender systems and search engines. MTL can improve learning efficiency and prediction accuracy by exploiting commonalities and differences across tasks. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Shanghai Munici- pal Science and Technology Major Project

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
87
Total Downloads

Downloads (Last 12 months)87
Downloads (Last 6 weeks)6

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten