Abstract
Multi-task dense scene understanding, which learns a model for multiple dense prediction tasks, has a wide range of application scenarios. Modeling long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba, a novel Mamba-based architecture for multi-task scene understanding. It contains two types of core blocks: self-task Mamba (STM) block and cross-task Mamba (CTM) block. STM handles long-range dependency by leveraging Mamba, while CTM explicitly models task interactions to facilitate information exchange across tasks. Experiments on NYUDv2 and PASCAL-Context datasets demonstrate the superior performance of MTMamba over Transformer-based and CNN-based methods. Notably, on the PASCAL-Context dataset, MTMamba achieves improvements of +2.08, +5.01, and +4.90 over the previous best methods in the tasks of semantic segmentation, human parsing, and object boundary detection, respectively. The code is available at https://github.com/EnVision-Research/MTMamba.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Behrouz, A., Hashemi, F.: Graph Mamba: towards learning on graphs with state space models. arXiv preprint arXiv:2402.08678 (2024)
Bello, I., Zoph, B., Vaswani, A., Shlens, J., Le, Q.V.: Attention augmented convolutional networks. In: IEEE/CVF International Conference on Computer Vision (2019)
Brüggemann, D., Kanakis, M., Obukhov, A., Georgoulis, S., Van Gool, L.: Exploring relational context for multi-task dense prediction. In: IEEE/CVF International Conference on Computer Vision (2021)
Cao, H., et al.: Swin-unet: unet-like pure transformer for medical image segmentation. In: European Conference on Computer Vision, pp. 205–218. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-25066-8_9
Chen, C.T.: Linear System Theory and Design. Saunders college publishing, Philadelphia (1984)
Chen, X., Mottaghi, R., Liu, X., Fidler, S., Urtasun, R., Yuille, A.: Detect what you can: detecting and representing objects using holistic models and body parts. In: IEEE Conference on Computer Vision and Pattern Recognition (2014)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (2009)
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. Int. J. Comput. Vision 88, 303–338 (2010)
Friston, K.J., Harrison, L., Penny, W.: Dynamic causal modelling. Neuroimage 19, 1273–1302 (2003)
Fu, D.Y., Dao, T., Saab, K.K., Thomas, A.W., Rudra, A., Re, C.: Hungry hungry hippos: towards language modeling with state space models. In: International Conference on Learning Representations (2023)
Grazzi, R., Siems, J., Schrodi, S., Brox, T., Hutter, F.: Is mamba capable of in-context learning? arXiv preprint arXiv:2402.03170 (2024)
Gu, A., Dao, T.: Mamba: linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 (2023)
Gu, A., Goel, K., Re, C.: Efficiently modeling long sequences with structured state spaces. In: International Conference on Learning Representations (2022)
Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T., Rudra, A., Ré, C.: Combining recurrent, convolutional, and continuous-time models with linear state space layers. In: Neural Information Processing Systems (2021)
Hafner, D., Lillicrap, T., Ba, J., Norouzi, M.: Dream to control: learning behaviors by latent imagination. In: International Conference on Learning Representations (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)
Hespanha, J.P.: Linear Systems Theory. Princeton University Press, Princeton (2018)
Hur, K., et al.: Genhpf: general healthcare predictive framework for multi-task multi-source learning. IEEE J. Biomed. Health Inf. (2023)
Ishihara, K., Kanervisto, A., Miura, J., Hautamaki, V.: Multi-task learning with attention for end-to-end autonomous driving. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM (2017)
Liang, D., et al.: PointMamba: a simple state space model for point cloud analysis. arXiv preprint arXiv:2402.10739 (2024)
Liang, X., Liang, X., Xu, H.: Multi-task perception for autonomous driving. In: Autonomous Driving Perception: Fundamentals and Applications, pp. 281–321. Springer, Heidelberg (2023). https://doi.org/10.1007/978-981-99-4287-9_9
Lin, B., et al.: Dual-balancing for multi-task learning. arXiv preprint arXiv:2308.12029 (2023)
Lin, B., Ye, F., Zhang, Y., Tsang, I.: Reasonable effectiveness of random weighting: a litmus test for multi-task learning. Trans. Mach. Learn. Res. (2022)
Liu, B., Liu, X., Jin, X., Stone, P., Liu, Q.: Conflict-averse gradient descent for multi-task learning. In: Neural Information Processing Systems (2021)
Liu, Y., et al.: Vmamba: visual state space model. arXiv preprint arXiv:2401.10166 (2024)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: IEEE/CVF International Conference on Computer Vision (2021)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019)
Ma, J., Li, F., Wang, B.: U-mamba: enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722 (2024)
Maninis, K.K., Radosavovic, I., Kokkinos, I.: Attentive single-tasking of multiple tasks. In: Computer Vision and Pattern Recognition (2019)
Mehta, H., Gupta, A., Cutkosky, A., Neyshabur, B.: Long range language modeling via gated state spaces. In: International Conference on Learning Representations (2023)
Misra, I., Shrivastava, A., Gupta, A., Hebert, M.: Cross-stitch networks for multi-task learning. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)
Sener, O., Koltun, V.: Multi-task learning as multi-objective optimization. In: Neural Information Processing Systems (2018)
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54
Vandenhende, S., Georgoulis, S., Van Gansbeke, W., Proesmans, M., Dai, D., Van Gool, L.: Multi-task learning for dense prediction tasks: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44(7), 3614–3633 (2021)
Vandenhende, S., Georgoulis, S., Van Gool, L.: MTI-net: multi-scale task interaction networks for multi-task learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 527–543. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_31
Wang, C., Tsepa, O., Ma, J., Wang, B.: Graph-Mamba: towards long-range graph sequence modeling with selective state spaces. arXiv preprint arXiv:2402.00789 (2024)
Wang, J., Gangavarapu, T., Yan, J.N., Rush, A.M.: MambaByte: token-free selective state space model. arXiv preprint arXiv:2401.13660 (2024)
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Conference on Empirical Methods in Natural Language Processing (2020)
Xing, Z., Ye, T., Yang, Y., Liu, G., Zhu, L.: Segmamba: long-range sequential modeling mamba for 3d medical image segmentation. arXiv preprint arXiv:2401.13560 (2024)
Xu, D., Ouyang, W., Wang, X., Sebe, N.: Pad-net: multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)
Xu, Y., Li, X., Yuan, H., Yang, Y., Zhang, L.: Multi-task learning with multi-query transformer for dense prediction. IEEE Trans. Circuits Syst. Video Technol. 34(2), 1228–1240 (2024)
Ye, F., Lin, B., Cao, X., Zhang, Y., Tsang, I.: A first-order multi-gradient algorithm for multi-objective bi-level optimization. arXiv preprint arXiv:2401.09257 (2024)
Ye, F., Lin, B., Yue, Z., Guo, P., Xiao, Q., Zhang, Y.: Multi-objective meta learning. In: Neural Information Processing Systems (2021)
Ye, F., Lyu, Y., Wang, X., Zhang, Y., Tsang, I.: Adaptive stochastic gradient algorithm for black-box multi-objective learning. In: International Conference on Learning Representations (2024)
Ye, H., Xu, D.: Inverted pyramid multi-task transformer for dense scene understanding. In: European Conference on Computer Vision (2022)
Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., Finn, C.: Gradient surgery for multi-task learning. In: Neural Information Processing Systems (2020)
Ze, Y., et al.: Gnfactor: multi-task real robot learning with generalizable neural feature fields. In: Conference on Robot Learning (2023)
Zhang, T., Li, X., Yuan, H., Ji, S., Yan, S.: Point could mamba: point cloud learning via state space model. arXiv preprint arXiv:2403.00762 (2024)
Zhang, Y., Yang, Q.: A survey on multi-task learning. IEEE Trans. Knowl. Data Eng. 34(12), 5586–5609 (2022)
Zhang, Z., Cui, Z., Xu, C., Yan, Y., Sebe, N., Yang, J.: Pattern-affinitive propagation across depth, surface normal and semantic segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019)
Zhou, L., et al.: Pattern-structure diffusion for multi-task learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020)
Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., Wang, X.: Vision mamba: efficient visual representation learning with bidirectional state space model. In: International Conference on Machine Learning (2024)
Acknowledgements
This work is supported by Guangzhou-HKUST(GZ) Joint Funding Scheme (No. 2024A03J0241).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Lin, B., Jiang, W., Chen, P., Zhang, Y., Liu, S., Chen, YC. (2025). MTMamba: Enhancing Multi-task Dense Scene Understanding by Mamba-Based Decoders. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15128. Springer, Cham. https://doi.org/10.1007/978-3-031-72897-6_18
Download citation
DOI: https://doi.org/10.1007/978-3-031-72897-6_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72896-9
Online ISBN: 978-3-031-72897-6
eBook Packages: Computer ScienceComputer Science (R0)