MTMamba: Enhancing Multi-task Dense Scene Understanding by Mamba-Based Decoders

Lin, Baijiong; Jiang, Weisen; Chen, Pengguang; Zhang, Yu; Liu, Shu; Chen, Ying-Cong

doi:10.1007/978-3-031-72897-6_18

Baijiong Lin^13,16,
Weisen Jiang^14,15,
Pengguang Chen¹⁷,
Yu Zhang¹⁵,
Shu Liu¹⁷ &
…
Ying-Cong Chen^13,14,16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15128))

Included in the following conference series:

European Conference on Computer Vision

276 Accesses

Abstract

Multi-task dense scene understanding, which learns a model for multiple dense prediction tasks, has a wide range of application scenarios. Modeling long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba, a novel Mamba-based architecture for multi-task scene understanding. It contains two types of core blocks: self-task Mamba (STM) block and cross-task Mamba (CTM) block. STM handles long-range dependency by leveraging Mamba, while CTM explicitly models task interactions to facilitate information exchange across tasks. Experiments on NYUDv2 and PASCAL-Context datasets demonstrate the superior performance of MTMamba over Transformer-based and CNN-based methods. Notably, on the PASCAL-Context dataset, MTMamba achieves improvements of +2.08, +5.01, and +4.90 over the previous best methods in the tasks of semantic segmentation, human parsing, and object boundary detection, respectively. The code is available at https://github.com/EnVision-Research/MTMamba.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Inverted Pyramid Multi-task Transformer for Dense Scene Understanding

PAG-Unet: multi-task dense scene understanding with pixel-attention-guided Unet

Article 24 February 2025

GiT: Towards Generalist Vision Transformer Through Universal Language Interface

References

Behrouz, A., Hashemi, F.: Graph Mamba: towards learning on graphs with state space models. arXiv preprint arXiv:2402.08678 (2024)
Bello, I., Zoph, B., Vaswani, A., Shlens, J., Le, Q.V.: Attention augmented convolutional networks. In: IEEE/CVF International Conference on Computer Vision (2019)
Google Scholar
Brüggemann, D., Kanakis, M., Obukhov, A., Georgoulis, S., Van Gool, L.: Exploring relational context for multi-task dense prediction. In: IEEE/CVF International Conference on Computer Vision (2021)
Google Scholar
Cao, H., et al.: Swin-unet: unet-like pure transformer for medical image segmentation. In: European Conference on Computer Vision, pp. 205–218. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-25066-8_9
Chen, C.T.: Linear System Theory and Design. Saunders college publishing, Philadelphia (1984)
Google Scholar
Chen, X., Mottaghi, R., Liu, X., Fidler, S., Urtasun, R., Yuille, A.: Detect what you can: detecting and representing objects using holistic models and body parts. In: IEEE Conference on Computer Vision and Pattern Recognition (2014)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (2009)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16$\times $16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Google Scholar
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. Int. J. Comput. Vision 88, 303–338 (2010)
Article Google Scholar
Friston, K.J., Harrison, L., Penny, W.: Dynamic causal modelling. Neuroimage 19, 1273–1302 (2003)
Article Google Scholar
Fu, D.Y., Dao, T., Saab, K.K., Thomas, A.W., Rudra, A., Re, C.: Hungry hungry hippos: towards language modeling with state space models. In: International Conference on Learning Representations (2023)
Google Scholar
Grazzi, R., Siems, J., Schrodi, S., Brox, T., Hutter, F.: Is mamba capable of in-context learning? arXiv preprint arXiv:2402.03170 (2024)
Gu, A., Dao, T.: Mamba: linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 (2023)
Gu, A., Goel, K., Re, C.: Efficiently modeling long sequences with structured state spaces. In: International Conference on Learning Representations (2022)
Google Scholar
Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T., Rudra, A., Ré, C.: Combining recurrent, convolutional, and continuous-time models with linear state space layers. In: Neural Information Processing Systems (2021)
Google Scholar
Hafner, D., Lillicrap, T., Ba, J., Norouzi, M.: Dream to control: learning behaviors by latent imagination. In: International Conference on Learning Representations (2020)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
Hespanha, J.P.: Linear Systems Theory. Princeton University Press, Princeton (2018)
Book Google Scholar
Hur, K., et al.: Genhpf: general healthcare predictive framework for multi-task multi-source learning. IEEE J. Biomed. Health Inf. (2023)
Google Scholar
Ishihara, K., Kanervisto, A., Miura, J., Hautamaki, V.: Multi-task learning with attention for end-to-end autonomous driving. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM (2017)
Google Scholar
Liang, D., et al.: PointMamba: a simple state space model for point cloud analysis. arXiv preprint arXiv:2402.10739 (2024)
Liang, X., Liang, X., Xu, H.: Multi-task perception for autonomous driving. In: Autonomous Driving Perception: Fundamentals and Applications, pp. 281–321. Springer, Heidelberg (2023). https://doi.org/10.1007/978-981-99-4287-9_9
Lin, B., et al.: Dual-balancing for multi-task learning. arXiv preprint arXiv:2308.12029 (2023)
Lin, B., Ye, F., Zhang, Y., Tsang, I.: Reasonable effectiveness of random weighting: a litmus test for multi-task learning. Trans. Mach. Learn. Res. (2022)
Google Scholar
Liu, B., Liu, X., Jin, X., Stone, P., Liu, Q.: Conflict-averse gradient descent for multi-task learning. In: Neural Information Processing Systems (2021)
Google Scholar
Liu, Y., et al.: Vmamba: visual state space model. arXiv preprint arXiv:2401.10166 (2024)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: IEEE/CVF International Conference on Computer Vision (2021)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019)
Google Scholar
Ma, J., Li, F., Wang, B.: U-mamba: enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722 (2024)
Maninis, K.K., Radosavovic, I., Kokkinos, I.: Attentive single-tasking of multiple tasks. In: Computer Vision and Pattern Recognition (2019)
Google Scholar
Mehta, H., Gupta, A., Cutkosky, A., Neyshabur, B.: Long range language modeling via gated state spaces. In: International Conference on Learning Representations (2023)
Google Scholar
Misra, I., Shrivastava, A., Gupta, A., Hebert, M.: Cross-stitch networks for multi-task learning. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
Sener, O., Koltun, V.: Multi-task learning as multi-objective optimization. In: Neural Information Processing Systems (2018)
Google Scholar
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54
Chapter Google Scholar
Vandenhende, S., Georgoulis, S., Van Gansbeke, W., Proesmans, M., Dai, D., Van Gool, L.: Multi-task learning for dense prediction tasks: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44(7), 3614–3633 (2021)
Google Scholar
Vandenhende, S., Georgoulis, S., Van Gool, L.: MTI-net: multi-scale task interaction networks for multi-task learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 527–543. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_31
Chapter Google Scholar
Wang, C., Tsepa, O., Ma, J., Wang, B.: Graph-Mamba: towards long-range graph sequence modeling with selective state spaces. arXiv preprint arXiv:2402.00789 (2024)
Wang, J., Gangavarapu, T., Yan, J.N., Rush, A.M.: MambaByte: token-free selective state space model. arXiv preprint arXiv:2401.13660 (2024)
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Conference on Empirical Methods in Natural Language Processing (2020)
Google Scholar
Xing, Z., Ye, T., Yang, Y., Liu, G., Zhu, L.: Segmamba: long-range sequential modeling mamba for 3d medical image segmentation. arXiv preprint arXiv:2401.13560 (2024)
Xu, D., Ouyang, W., Wang, X., Sebe, N.: Pad-net: multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)
Google Scholar
Xu, Y., Li, X., Yuan, H., Yang, Y., Zhang, L.: Multi-task learning with multi-query transformer for dense prediction. IEEE Trans. Circuits Syst. Video Technol. 34(2), 1228–1240 (2024)
Article Google Scholar
Ye, F., Lin, B., Cao, X., Zhang, Y., Tsang, I.: A first-order multi-gradient algorithm for multi-objective bi-level optimization. arXiv preprint arXiv:2401.09257 (2024)
Ye, F., Lin, B., Yue, Z., Guo, P., Xiao, Q., Zhang, Y.: Multi-objective meta learning. In: Neural Information Processing Systems (2021)
Google Scholar
Ye, F., Lyu, Y., Wang, X., Zhang, Y., Tsang, I.: Adaptive stochastic gradient algorithm for black-box multi-objective learning. In: International Conference on Learning Representations (2024)
Google Scholar
Ye, H., Xu, D.: Inverted pyramid multi-task transformer for dense scene understanding. In: European Conference on Computer Vision (2022)
Google Scholar
Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., Finn, C.: Gradient surgery for multi-task learning. In: Neural Information Processing Systems (2020)
Google Scholar
Ze, Y., et al.: Gnfactor: multi-task real robot learning with generalizable neural feature fields. In: Conference on Robot Learning (2023)
Google Scholar
Zhang, T., Li, X., Yuan, H., Ji, S., Yan, S.: Point could mamba: point cloud learning via state space model. arXiv preprint arXiv:2403.00762 (2024)
Zhang, Y., Yang, Q.: A survey on multi-task learning. IEEE Trans. Knowl. Data Eng. 34(12), 5586–5609 (2022)
Article Google Scholar
Zhang, Z., Cui, Z., Xu, C., Yan, Y., Sebe, N., Yang, J.: Pattern-affinitive propagation across depth, surface normal and semantic segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019)
Google Scholar
Zhou, L., et al.: Pattern-structure diffusion for multi-task learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020)
Google Scholar
Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., Wang, X.: Vision mamba: efficient visual representation learning with bidirectional state space model. In: International Conference on Machine Learning (2024)
Google Scholar

Download references

Acknowledgements

This work is supported by Guangzhou-HKUST(GZ) Joint Funding Scheme (No. 2024A03J0241).

Author information

Authors and Affiliations

The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
Baijiong Lin & Ying-Cong Chen
The Hong Kong University of Science and Technology, Hong Kong, China
Weisen Jiang & Ying-Cong Chen
Southern University of Science and Technology, Shenzhen, China
Weisen Jiang & Yu Zhang
HKUST(GZ) - SmartMore Joint Lab, Guangzhou, China
Baijiong Lin & Ying-Cong Chen
SmartMore, Hong Kong, China
Pengguang Chen & Shu Liu

Authors

Baijiong Lin
View author publications
You can also search for this author in PubMed Google Scholar
Weisen Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Pengguang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Shu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Ying-Cong Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Shu Liu or Ying-Cong Chen .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lin, B., Jiang, W., Chen, P., Zhang, Y., Liu, S., Chen, YC. (2025). MTMamba: Enhancing Multi-task Dense Scene Understanding by Mamba-Based Decoders. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15128. Springer, Cham. https://doi.org/10.1007/978-3-031-72897-6_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-72897-6_18
Published: 02 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72896-9
Online ISBN: 978-3-031-72897-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

MTMamba: Enhancing Multi-task Dense Scene Understanding by Mamba-Based Decoders