Skip to main content

Mamba-FETrack: Frame-Event Tracking via State Space Model

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15042))

Included in the following conference series:

  • 180 Accesses

Abstract

RGB-Event based tracking is an emerging research topic, focusing on how to effectively integrate heterogeneous multi-modal data (synchronized exposure video frames and asynchronous pulse Event stream). Existing works typically employ Transformer based networks to handle these modalities and achieve decent accuracy through input-level or feature-level fusion on multiple datasets. However, these trackers require significant memory consumption and computational complexity due to the use of self-attention mechanism. This paper proposes a novel RGB-Event tracking framework, Mamba-FETrack, based on the State Space Model (SSM) to achieve high-performance tracking while effectively reducing computational costs and realizing more efficient tracking. Specifically, we adopt two modality-specific Mamba backbone networks to extract the features of RGB frames and Event streams. Then, we also propose to boost the interactive learning between the RGB and Event features using the Mamba network. The fused features will be fed into the tracking head for target object localization. Extensive experiments on FELT, FE108, and FE240hz datasets fully validated the efficiency and effectiveness of our proposed tracker. Specifically, our Mamba-based tracker achieves 43.5/55.6 on the SR/PR metric, while the ViT-S based tracker (OSTrack) obtains 40.0/50.9. The GPU memory cost of ours and ViT-S based tracker is 13.98GB and 15.44GB, which decreased about \(9.5\%\). The FLOPs and parameters of ours/ViT-S based OSTrack are 59GB/1076GB and 7MB/60MB, which decreased about \(94.5\%\) and \(88.3\%\), respectively. We hope this work can bring some new insights to the tracking field and greatly promote the application of the Mamba architecture in tracking. The source code of this work has been released on https://github.com/Event-AHU/Mamba_FETrack.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visual tracking. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4293–4302 (2015)

    Google Scholar 

  2. Jung, I., Son, J., Baek, M., Han, B.: Real-time mdnet. In: Proceedings of the European conference on computer vision (ECCV), pp. 83–98 (2018)

    Google Scholar 

  3. Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.: Fully-convolutional siamese networks for object tracking. In: Computer Vision-ECCV, Workshops: Amsterdam, The Netherlands, October 8–10 and 15–16, 2016, Proceedings, Part II 14, pp. 850–865. Springer (2016)

    Google Scholar 

  4. Xu, Y., Wang, Z., Li, Z., Yuan, Y., Yu, G.: Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 12 549–12 556 (2020)

    Google Scholar 

  5. Tao, R., Gavves, E., Smeulders, A.W.M.: Siamese instance search for tracking. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1420–1429 (2016)

    Google Scholar 

  6. Wang, X., Li, C., Luo, B., Tang, J.: Sint++: robust visual tracking via adversarial positive instance generation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4864–4873 (2018)

    Google Scholar 

  7. Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8122–8131 (2021)

    Google Scholar 

  8. Yan, B., Peng, H., Fu, J., Wang, D., Lu, H.: Learning spatio-temporal transformer for visual tracking. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10 428–10 437 (2021)

    Google Scholar 

  9. Cui, Y., Cheng, J., Wang, L., Wu, G.: Mixformer: end-to-end tracking with iterative mixed attention. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13 598–13 608 (2022)

    Google Scholar 

  10. Gao, S., Zhou, C., Ma, C., Wang, X., Yuan, J.: Aiatrack: attention in attention for transformer visual tracking. In: European Conference on Computer Vision, pp. 146–164. Springer (2022)

    Google Scholar 

  11. Liu, Z., Zhao, X., Huang, T., Hu, R., Zhou, Y., Bai, X.: Tanet: Robust 3d object detection from point clouds with triple attention. Proc. AAAI Conf. Artif. Intell. 34(07), 11 677–11 684 (2020)

    Google Scholar 

  12. Yang, D., Dyer, K., Wang, S.: Interpretable deep learning model for online multi-touch attribution. arXiv:2004.00384 (2020)

  13. Huang, L., Zhao, X., Huang, K.: Globaltrack: A simple and strong baseline for long-term tracking. Proc. AAAI Conf. Artif. Intell. 34(07), 11 037–11 044 (2020)

    Google Scholar 

  14. Gallego, G., Delbrück, T., Orchard, G., Bartolozzi, C., Taba, B., Censi, A., Leutenegger, S., Davison, A.J., Conradt, J., Daniilidis, K., Scaramuzza, D.: Event-based vision: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44, 154–180 (2019)

    Article  Google Scholar 

  15. Wang, X., Li, J., Zhu, L., Zhang, Z., Chen, Z., Li, X., Wang, Y., Tian, Y., Wu, F.: Visevent: Reliable object tracking via collaboration of frame and event flows. IEEE Trans. Cybern. 54, 1997–2010 (2021)

    Article  MATH  Google Scholar 

  16. Tang, C., Wang, X., Huang, J., Jiang, B., Zhu, L., Zhang, J., Wang, Y., Tian, Y.: Revisiting color-event based tracking: a unified network, dataset, and metric. arXiv:2211.11010 (2022)

  17. Zhang, J., Dong, B., Zhang, H., Ding, J., Heide, F., Yin, B., Yang, X.: Spiking transformers for event-based single object tracking. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8791–8800 (2022)

    Google Scholar 

  18. Wang, X., Wang, S., Ding, Y., Li, Y., Wu, W., Rong, Y., Kong, W., Huang, J., Li, S., Yang, H., et al.: State space model for new-generation network alternative to transformers: a survey. arXiv:2404.09516 (2024)

  19. Nguyen, E., Goel, K., Gu, A., Downs, G.W., Shah, P., Dao, T., Baccus, S.A., Ré, C.: S4nd: modeling images and videos as multidimensional signals using state spaces. arXiv:2210.06583 (2022)

  20. Smith,J., Warrington, A., Linderman, S.W.: Simplified state space layers for sequence modeling. arXiv:2208.04933 (2022)

  21. Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., Liu, Y.: Vmamba: visual state space model. arXiv:2401.10166 (2024)

  22. Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., Wang, X.: Vision mamba: efficient visual representation learning with bidirectional state space model. arXiv:2401.09417 (2024)

  23. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)

    Google Scholar 

  24. Li, S., Singh, H., Grover, A.: Mamba-nd: Selective state space modeling for multi-dimensional data. arXiv:2402.05892 (2024)

  25. Xing, Z., Ye, T., Yang, Y., Liu, G., Zhu, L.: Segmamba: long-range sequential modeling mamba for 3d medical image segmentation. arXiv:2401.13560 (2024)

  26. Ma, J., Li, F., Wang, B.: U-mamba: enhancing long-range dependency for biomedical image segmentation. arXiv:2401.04722 (2024)

  27. Ruan, J., Xiang, S.: Vm-unet: vision mamba unet for medical image segmentation. arXiv:2402.02491 (2024)

  28. Tang, S., Dunnmon, J.A., Liangqiong, Q., Saab, K.K., Baykaner, T., Lee-Messer, C., Rubin, D.L.: Modeling multivariate biosignals with graph neural networks and structured state space models. In: Conference on Health, Inference, and Learning, pp. 50–71. PMLR (2013)

    Google Scholar 

  29. Wang, C.X., Tsepa, O., Ma, J., Wang, B.: Graph-mamba: Towards long-range graph sequence modeling with selective state spaces. arXiv:2402.00789 (2024)

  30. Behrouz, A., Hashemi, F.: Graph mamba: Towards learning on graphs with state space models. arXiv:2402.08678 (2024)

  31. Liang, D., Zhou, X., Wang, X., Zhu, X., Xu, W., Zou, Z., Ye, X., Bai, X.: Pointmamba: a simple state space model for point cloud analysis. arXiv:2402.10739 (2024)

  32. Zhang, T., Li, X., Yuan, H., Ji, S., Yan, S.: Point cloud mamba: point cloud learning via state space model. arXiv:2403.00762 (2024)

  33. Liu, J., Yu, R., Wang, Y., Zheng, Y., Deng, T., Ye, W., Wang, H.: Point mamba: A novel point cloud backbone based on state space model with octree-based ordering strategy. arXiv:2403.06467 (2024)

  34. Zubi’c, N., Gehrig, M., Scaramuzza, D.: State space models for event cameras. arXiv:2402.15584 (2024)

  35. Islam, M.M., Bertasius, G.: Long movie clip classification with state-space video models. In: European Conference on Computer Vision, pp. 87–104. Springer (2022)

    Google Scholar 

  36. Wang, J., Zhu, W., Wang, P., Yu, X., Liu, L., Omar, M., Hamid, R.: Selective structured state-spaces for long-form video understanding. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6387–6397 (2023)

    Google Scholar 

  37. Wang, X., Huang, J., Wang, S., Tang, C., Jiang, B., Tian, Y., Tang, J., Luo, B.: Long-term frame-event visual tracking: Benchmark dataset and baseline. arXiv:2403.05839 (2024)

  38. Zhang, J., Yang, X., Fu, Y., Wei, X., Yin, B., Dong, B.: Object tracking by jointly exploiting frame and event domain. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13 023–13 032 (2021)

    Google Scholar 

  39. Ye, B., Chang, H., Ma, B., Shan, S., Chen, X.: Joint feature learning and relation modeling for tracking: a one-stream framework. In: European Conference on Computer Vision, pp. 341–357. Springer (2022)

    Google Scholar 

  40. He, X., Cao, K., Yan, K.R., Li, R., Xie, C., Zhang, J., Zhou, M.: Pan-mamba: Effective pan-sharpening with state space model. arXiv:2402.12192 (2024)

  41. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2018)

    Google Scholar 

  42. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32 (2019)

    Google Scholar 

  43. Zhang, J., Wang, Y., Liu, W., Li, M., Bai, J., Yin, B., Yang, X.: Frame-event alignment and fusion network for high frame rate tracking. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9781–9790 (2023)

    Google Scholar 

  44. Danelljan, M., Gool, L.V., Timofte, R.: Probabilistic regression for visual tracking. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7181–7190 (2020)

    Google Scholar 

  45. Bhat, G., Danelljan, M., Gool, L.V., Timofte, R.: Learning discriminative model prediction for tracking. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6181–6190 (2019)

    Google Scholar 

  46. Mayer, C., Danelljan, M., Bhat, G., Paul, M., Paudel, D.P., Yu, F., Gool, L.V.: Transforming model prediction for tracking. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8721–8730 (2022)

    Google Scholar 

  47. Yan, S., Yang, J., Käpylä, J., Zheng, F., Leonardis, A., Kämäräinen, J.-K.: Depthtrack: Unveiling the power of rgbd tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10 725–10 733 (2021)

    Google Scholar 

  48. Zhang, J., Yang, X., Fu, Y., Wei, X., Yin, B., Dong, B.: Object tracking by jointly exploiting frame and event domain. In: 2021 IEEE in CVF International Conference on Computer Vision (ICCV), pp. 13 023–13 032 (2021)

    Google Scholar 

  49. Zhang, P., Zhao, J., Wang, D., Lu, H., Ruan, X.: Visible-thermal uav tracking: a large-scale benchmark and new baseline. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8886–8895 (2022)

    Google Scholar 

  50. Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking with siamese region proposal network. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8971–8980 (2018)

    Google Scholar 

  51. Chen, Z., Zhong, B., Li, G., Zhang, S., Ji, R.: Siamese box adaptive network for visual tracking. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6667–6676 (2020)

    Google Scholar 

  52. Bhat, G., Danelljan, M., Van Gool, L., Timofte, R.: Know your surroundings: exploiting scene information for object tracking. In: Computer Vision-ECCV, 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16, pp. 205–221. Springer (2020)

    Google Scholar 

  53. Dong, X., Shen, J., Shao, L., Porikli, F.: Clnet: a compact latent network for fast adjusting siamese trackers. In: European Conference on Computer Vision, pp. 378–395. Springer (2020)

    Google Scholar 

  54. Zhu, Z., Hou, J., Wu, D.O.: Cross-modal orthogonal high-rank augmentation for rgb-event transformer-trackers. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 21 988–21 998 (2023)

    Google Scholar 

  55. Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: Atom: Accurate tracking by overlap maximization. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4655–4664 (2018)

    Google Scholar 

  56. Gao, S., Zhou, C., Zhang, J.: Generalized relation modeling for transformer tracking. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18 686–18 695 (2023)

    Google Scholar 

  57. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929 (2020)

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (No. 62102205, 62102207, 62076004); Anhui Provincial Key Research and Development Program under Grant 2022i01020014. The authors acknowledge the High-performance Computing Platform of Anhui University for providing computing resources.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiao Wang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2775 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Huang, J., Wang, S., Wang, S., Wu, Z., Wang, X., Jiang, B. (2025). Mamba-FETrack: Frame-Event Tracking via State Space Model. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15042. Springer, Singapore. https://doi.org/10.1007/978-981-97-8858-3_1

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-8858-3_1

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-8857-6

  • Online ISBN: 978-981-97-8858-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics