Skip to main content

MTMamba: Enhancing Multi-task Dense Scene Understanding by Mamba-Based Decoders

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15128))

Included in the following conference series:

  • 276 Accesses

Abstract

Multi-task dense scene understanding, which learns a model for multiple dense prediction tasks, has a wide range of application scenarios. Modeling long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba, a novel Mamba-based architecture for multi-task scene understanding. It contains two types of core blocks: self-task Mamba (STM) block and cross-task Mamba (CTM) block. STM handles long-range dependency by leveraging Mamba, while CTM explicitly models task interactions to facilitate information exchange across tasks. Experiments on NYUDv2 and PASCAL-Context datasets demonstrate the superior performance of MTMamba over Transformer-based and CNN-based methods. Notably, on the PASCAL-Context dataset, MTMamba achieves improvements of +2.08, +5.01, and +4.90 over the previous best methods in the tasks of semantic segmentation, human parsing, and object boundary detection, respectively. The code is available at https://github.com/EnVision-Research/MTMamba.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Behrouz, A., Hashemi, F.: Graph Mamba: towards learning on graphs with state space models. arXiv preprint arXiv:2402.08678 (2024)

  2. Bello, I., Zoph, B., Vaswani, A., Shlens, J., Le, Q.V.: Attention augmented convolutional networks. In: IEEE/CVF International Conference on Computer Vision (2019)

    Google Scholar 

  3. Brüggemann, D., Kanakis, M., Obukhov, A., Georgoulis, S., Van Gool, L.: Exploring relational context for multi-task dense prediction. In: IEEE/CVF International Conference on Computer Vision (2021)

    Google Scholar 

  4. Cao, H., et al.: Swin-unet: unet-like pure transformer for medical image segmentation. In: European Conference on Computer Vision, pp. 205–218. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-25066-8_9

  5. Chen, C.T.: Linear System Theory and Design. Saunders college publishing, Philadelphia (1984)

    Google Scholar 

  6. Chen, X., Mottaghi, R., Liu, X., Fidler, S., Urtasun, R., Yuille, A.: Detect what you can: detecting and representing objects using holistic models and body parts. In: IEEE Conference on Computer Vision and Pattern Recognition (2014)

    Google Scholar 

  7. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (2009)

    Google Scholar 

  8. Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)

    Google Scholar 

  9. Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. Int. J. Comput. Vision 88, 303–338 (2010)

    Article  Google Scholar 

  10. Friston, K.J., Harrison, L., Penny, W.: Dynamic causal modelling. Neuroimage 19, 1273–1302 (2003)

    Article  Google Scholar 

  11. Fu, D.Y., Dao, T., Saab, K.K., Thomas, A.W., Rudra, A., Re, C.: Hungry hungry hippos: towards language modeling with state space models. In: International Conference on Learning Representations (2023)

    Google Scholar 

  12. Grazzi, R., Siems, J., Schrodi, S., Brox, T., Hutter, F.: Is mamba capable of in-context learning? arXiv preprint arXiv:2402.03170 (2024)

  13. Gu, A., Dao, T.: Mamba: linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 (2023)

  14. Gu, A., Goel, K., Re, C.: Efficiently modeling long sequences with structured state spaces. In: International Conference on Learning Representations (2022)

    Google Scholar 

  15. Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T., Rudra, A., Ré, C.: Combining recurrent, convolutional, and continuous-time models with linear state space layers. In: Neural Information Processing Systems (2021)

    Google Scholar 

  16. Hafner, D., Lillicrap, T., Ba, J., Norouzi, M.: Dream to control: learning behaviors by latent imagination. In: International Conference on Learning Representations (2020)

    Google Scholar 

  17. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)

    Google Scholar 

  18. Hespanha, J.P.: Linear Systems Theory. Princeton University Press, Princeton (2018)

    Book  Google Scholar 

  19. Hur, K., et al.: Genhpf: general healthcare predictive framework for multi-task multi-source learning. IEEE J. Biomed. Health Inf. (2023)

    Google Scholar 

  20. Ishihara, K., Kanervisto, A., Miura, J., Hautamaki, V.: Multi-task learning with attention for end-to-end autonomous driving. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)

    Google Scholar 

  21. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM (2017)

    Google Scholar 

  22. Liang, D., et al.: PointMamba: a simple state space model for point cloud analysis. arXiv preprint arXiv:2402.10739 (2024)

  23. Liang, X., Liang, X., Xu, H.: Multi-task perception for autonomous driving. In: Autonomous Driving Perception: Fundamentals and Applications, pp. 281–321. Springer, Heidelberg (2023). https://doi.org/10.1007/978-981-99-4287-9_9

  24. Lin, B., et al.: Dual-balancing for multi-task learning. arXiv preprint arXiv:2308.12029 (2023)

  25. Lin, B., Ye, F., Zhang, Y., Tsang, I.: Reasonable effectiveness of random weighting: a litmus test for multi-task learning. Trans. Mach. Learn. Res. (2022)

    Google Scholar 

  26. Liu, B., Liu, X., Jin, X., Stone, P., Liu, Q.: Conflict-averse gradient descent for multi-task learning. In: Neural Information Processing Systems (2021)

    Google Scholar 

  27. Liu, Y., et al.: Vmamba: visual state space model. arXiv preprint arXiv:2401.10166 (2024)

  28. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: IEEE/CVF International Conference on Computer Vision (2021)

    Google Scholar 

  29. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019)

    Google Scholar 

  30. Ma, J., Li, F., Wang, B.: U-mamba: enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722 (2024)

  31. Maninis, K.K., Radosavovic, I., Kokkinos, I.: Attentive single-tasking of multiple tasks. In: Computer Vision and Pattern Recognition (2019)

    Google Scholar 

  32. Mehta, H., Gupta, A., Cutkosky, A., Neyshabur, B.: Long range language modeling via gated state spaces. In: International Conference on Learning Representations (2023)

    Google Scholar 

  33. Misra, I., Shrivastava, A., Gupta, A., Hebert, M.: Cross-stitch networks for multi-task learning. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)

    Google Scholar 

  34. Sener, O., Koltun, V.: Multi-task learning as multi-objective optimization. In: Neural Information Processing Systems (2018)

    Google Scholar 

  35. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54

    Chapter  Google Scholar 

  36. Vandenhende, S., Georgoulis, S., Van Gansbeke, W., Proesmans, M., Dai, D., Van Gool, L.: Multi-task learning for dense prediction tasks: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44(7), 3614–3633 (2021)

    Google Scholar 

  37. Vandenhende, S., Georgoulis, S., Van Gool, L.: MTI-net: multi-scale task interaction networks for multi-task learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 527–543. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_31

    Chapter  Google Scholar 

  38. Wang, C., Tsepa, O., Ma, J., Wang, B.: Graph-Mamba: towards long-range graph sequence modeling with selective state spaces. arXiv preprint arXiv:2402.00789 (2024)

  39. Wang, J., Gangavarapu, T., Yan, J.N., Rush, A.M.: MambaByte: token-free selective state space model. arXiv preprint arXiv:2401.13660 (2024)

  40. Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Conference on Empirical Methods in Natural Language Processing (2020)

    Google Scholar 

  41. Xing, Z., Ye, T., Yang, Y., Liu, G., Zhu, L.: Segmamba: long-range sequential modeling mamba for 3d medical image segmentation. arXiv preprint arXiv:2401.13560 (2024)

  42. Xu, D., Ouyang, W., Wang, X., Sebe, N.: Pad-net: multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)

    Google Scholar 

  43. Xu, Y., Li, X., Yuan, H., Yang, Y., Zhang, L.: Multi-task learning with multi-query transformer for dense prediction. IEEE Trans. Circuits Syst. Video Technol. 34(2), 1228–1240 (2024)

    Article  Google Scholar 

  44. Ye, F., Lin, B., Cao, X., Zhang, Y., Tsang, I.: A first-order multi-gradient algorithm for multi-objective bi-level optimization. arXiv preprint arXiv:2401.09257 (2024)

  45. Ye, F., Lin, B., Yue, Z., Guo, P., Xiao, Q., Zhang, Y.: Multi-objective meta learning. In: Neural Information Processing Systems (2021)

    Google Scholar 

  46. Ye, F., Lyu, Y., Wang, X., Zhang, Y., Tsang, I.: Adaptive stochastic gradient algorithm for black-box multi-objective learning. In: International Conference on Learning Representations (2024)

    Google Scholar 

  47. Ye, H., Xu, D.: Inverted pyramid multi-task transformer for dense scene understanding. In: European Conference on Computer Vision (2022)

    Google Scholar 

  48. Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., Finn, C.: Gradient surgery for multi-task learning. In: Neural Information Processing Systems (2020)

    Google Scholar 

  49. Ze, Y., et al.: Gnfactor: multi-task real robot learning with generalizable neural feature fields. In: Conference on Robot Learning (2023)

    Google Scholar 

  50. Zhang, T., Li, X., Yuan, H., Ji, S., Yan, S.: Point could mamba: point cloud learning via state space model. arXiv preprint arXiv:2403.00762 (2024)

  51. Zhang, Y., Yang, Q.: A survey on multi-task learning. IEEE Trans. Knowl. Data Eng. 34(12), 5586–5609 (2022)

    Article  Google Scholar 

  52. Zhang, Z., Cui, Z., Xu, C., Yan, Y., Sebe, N., Yang, J.: Pattern-affinitive propagation across depth, surface normal and semantic segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019)

    Google Scholar 

  53. Zhou, L., et al.: Pattern-structure diffusion for multi-task learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020)

    Google Scholar 

  54. Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., Wang, X.: Vision mamba: efficient visual representation learning with bidirectional state space model. In: International Conference on Machine Learning (2024)

    Google Scholar 

Download references

Acknowledgements

This work is supported by Guangzhou-HKUST(GZ) Joint Funding Scheme (No. 2024A03J0241).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Shu Liu or Ying-Cong Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lin, B., Jiang, W., Chen, P., Zhang, Y., Liu, S., Chen, YC. (2025). MTMamba: Enhancing Multi-task Dense Scene Understanding by Mamba-Based Decoders. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15128. Springer, Cham. https://doi.org/10.1007/978-3-031-72897-6_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72897-6_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72896-9

  • Online ISBN: 978-3-031-72897-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics