Skip to main content

InfMAE: A Foundation Model in the Infrared Modality

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15076))

Included in the following conference series:

  • 603 Accesses

Abstract

In recent years, foundation models have swept the computer vision field, facilitating the advancement of various tasks within different modalities. However, effectively designing an infrared foundation model remains an open question. In this paper, we introduce InfMAE, a foundation model tailored specifically for the infrared modality. Initially, we present Inf30, an infrared dataset developed to mitigate the scarcity of large-scale data for self-supervised learning within the infrared vision community. Moreover, considering the intrinsic characteristics of infrared images, we design an information-aware masking strategy. It allows for a greater emphasis on the regions with richer information in infrared images during the self-supervised learning process, which is conducive to learning strong representations. Additionally, to enhance generalization capabilities in downstream tasks, we employ a multi-scale encoder for latent representation learning. Finally, we develop an infrared decoder to reconstruct images. Extensive experiments show that our proposed method InfMAE outperforms other supervised and self-supervised learning methods in three key downstream tasks: infrared image semantic segmentation, object detection, and small target detection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Awais, M., et al.: Foundational models defining a new era in vision: a survey and outlook (2023)

    Google Scholar 

  2. Bao, H., Dong, L., Piao, S., Wei, F.: Beit: Bert pre-training of image transformers. In: International Conference on Learning Representations (2021)

    Google Scholar 

  3. Bommasani, R., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

  4. Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018)

    Google Scholar 

  5. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  6. Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)

    Google Scholar 

  7. Chen, F., et al.: Local patch network with global attention for infrared small target detection. IEEE Trans. Aerosp. Electron. Syst. 58(5), 3979–3991 (2022)

    Article  Google Scholar 

  8. Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 833–851. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_49

    Chapter  Google Scholar 

  9. Chen, X., Gao, C., Li, C., Yang, Y., Meng, D.: Infrared action detection in the dark via cross-stream attention mechanism. IEEE Trans. Multimedia 24, 288–300 (2021)

    Article  Google Scholar 

  10. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

    Google Scholar 

  11. Dong, X., Bao, J., et al.: PECO: perceptual codebook for Bert pre-training of vision transformers. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 552–560 (2023)

    Google Scholar 

  12. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2020)

    Google Scholar 

  13. Gao, C., et al.: Infar dataset: infrared action recognition at different times. Neurocomputing 212, 36–47 (2016)

    Article  Google Scholar 

  14. Gao, C., Meng, D., Yang, Y., Wang, Y., Zhou, X., Hauptmann, A.G.: Infrared patch-image model for small target detection in a single image. IEEE Trans. Image Process. 22(12), 4996–5009 (2013)

    Article  MathSciNet  Google Scholar 

  15. Gao, P., Ma, T., Li, H., Lin, Z., Dai, J., Qiao, Y.: MCMAE: masked convolution meets masked autoencoders. Adv. Neural. Inf. Process. Syst. 35, 35632–35644 (2022)

    Google Scholar 

  16. Girdhar, R., et al.: Imagebind one embedding space to bind them all. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pp. 15180–15190. IEEE (2023)

    Google Scholar 

  17. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)

    Google Scholar 

  18. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)

    Google Scholar 

  19. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. IEEE Computer Society (2016)

    Google Scholar 

  20. He, Z., Cao, Y., Dong, Y., Yang, J., Cao, Y., Tisse, C.L.: Single-image-based nonuniformity correction of uncooled long-wave infrared detectors: a deep-learning approach. Appl. Opt. 57(18), D155–D164 (2018)

    Article  Google Scholar 

  21. Kakogeorgiou, I., et al.: What to hide from your students: attention-guided masked image modeling. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13690, pp. 300–318. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20056-4_18

    Chapter  Google Scholar 

  22. Kirillov, A., et al.: Segment anything. CoRR abs/2304.02643 (2023)

    Google Scholar 

  23. Kolesnikov, A., et al.: Big Transfer (BiT): general visual representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 491–507. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_29

    Chapter  Google Scholar 

  24. Lab, A.S.: Thermal infrared dataset. https://projects.asl.ethz.ch/datasets/doku.php?id=ir:iricra2014

  25. Li, B., et al.: Dense nested attention network for infrared small target detection. IEEE Trans. Image Process. 32, 1745–1758 (2022)

    Article  Google Scholar 

  26. Li, G., Zheng, H., Liu, D., Wang, C., Su, B., Zheng, C.: SEMMAE: semantic-guided masking for learning masked autoencoders. Adv. Neural. Inf. Process. Syst. 35, 14290–14302 (2022)

    Google Scholar 

  27. Li, X., Wang, W., Yang, L., Yang, J.: Uniform masking: enabling MAE pre-training for pyramid-based vision transformers with locality. arXiv preprint arXiv:2205.10063 (2022)

  28. Li, Y., Liu, H., Tian, Z., Geng, W.: Near-infrared vascular image segmentation using improved level set method. Infrared Phys. Technol. 131, 104678 (2023)

    Article  Google Scholar 

  29. Li, Y., Xie, S., Chen, X., Dollar, P., He, K., Girshick, R.: Benchmarking detection transfer learning with vision transformers. arXiv preprint arXiv:2111.11429 (2021)

  30. Li, Z., et al.: MST: masked self-supervised transformer for visual representation. Adv. Neural. Inf. Process. Syst. 34, 13165–13176 (2021)

    Google Scholar 

  31. Li, Z.Y., Gao, S., Cheng, M.M.: Sere: exploring feature self-relation for self-supervised transformer. IEEE Trans. Pattern Anal. Mach. Intell. 45(12), 15619–15631 (2023)

    Article  Google Scholar 

  32. Lin, F., Ge, S., Bao, K., Yan, C., Zeng, D.: Learning shape-biased representations for infrared small target detection. IEEE Trans. Multimedia 1–12 (2023)

    Google Scholar 

  33. Liu, F., Gao, C., Chen, F., Meng, D., Zuo, W., Gao, X.: Infrared small and dim target detection with transformer under complex backgrounds. IEEE Trans. Image Process. 32, 5921–5932 (2023)

    Article  Google Scholar 

  34. Liu, J., et al.: Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5802–5811 (2022)

    Google Scholar 

  35. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)

    Google Scholar 

  36. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019. OpenReview.net (2019). https://openreview.net/forum?id=Bkg6RiCqY7

  37. Lüddecke, T., Ecker, A.S.: Image segmentation using text and image prompts. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18–24, 2022, pp. 7076–7086. IEEE (2022)

    Google Scholar 

  38. Madan, N., et al.: Self-supervised masked convolutional transformer block for anomaly detection. IEEE Trans. Pattern Anal. Mach. Intell. 46(1), 525–542 (2024)

    Article  Google Scholar 

  39. OpenAI: Gpt-4 technical report. PREPRINT (2023)

    Google Scholar 

  40. OTCBVS: Otcbvs benchmark dataset collection. https://vcipl-okstate.org/pbvs/bench/index.html

  41. Pan, H., Hong, Y., Sun, W., Jia, Y.: Deep dual-resolution networks for real-time and accurate semantic segmentation of traffic scenes. IEEE Trans. Intell. Transp. Syst. 24(3), 3448–3460 (2022)

    Article  Google Scholar 

  42. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  43. Reed, C.J., et al.: Scale-MAE: a scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023)

    Google Scholar 

  44. Reporter, T.M.: Infrared camera finds 6-year-old lost in deep woods. https://www.youtube.com/watch?v=-FajSFRlkIo

  45. Scheibenreif, L., Mommert, M., Borth, D.: Masked vision transformers for hyperspectral image classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2165–2175 (2023)

    Google Scholar 

  46. St-Charles, P.L., Bilodeau, G.A., Bergevin, R.: Mutual foreground segmentation with multispectral stereo pairs. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 375–384 (2017)

    Google Scholar 

  47. Sun, P., et al.: Sparse R-CNN: end-to-end object detection with learnable proposals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14454–14463 (2021)

    Google Scholar 

  48. Tang, L., Yuan, J., Zhang, H., Jiang, X., Ma, J.: PIAFusion: a progressive infrared and visible image fusion network based on illumination aware. Inf. Fus. 83–84, 79–92 (2022)

    Article  Google Scholar 

  49. Tian, Z., Liu, H., Li, Q.: VU-Net: a symmetric network-based method for near-infrared blood vessel image segmentation. In: Long, S., Dhillon, B.S. (eds.) MMESE 2023. LNCS, vol. 1069, pp. 275–280. Springer, Cham (2023). https://doi.org/10.1007/978-981-99-4882-6_39

  50. Tong, Z., Song, Y., Wang, J., Wang, L.: Videomae: masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural. Inf. Process. Syst. 35, 10078–10093 (2022)

    Google Scholar 

  51. ultralytics: ultralytics (2023). https://github.com/ultralytics/ultralytics

  52. Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  53. Wang, L., et al.: Videomae v2: scaling video masked autoencoders with dual masking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14549–14560 (2023)

    Google Scholar 

  54. Woo, S., et al.: Convnext v2: co-designing and scaling convnets with masked autoencoders. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16133–16142 (2023)

    Google Scholar 

  55. Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 432–448. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_26

    Chapter  Google Scholar 

  56. Xu, Z., Zhuang, J., Liu, Q., Zhou, J., Peng, S.: Benchmarking a large-scale fir dataset for on-road pedestrian detection. Infrared Phys. Technol. 96, 199–208 (2019)

    Article  Google Scholar 

  57. Yin, M., et al.: Disentangled non-local neural networks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 191–207. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_12

    Chapter  Google Scholar 

  58. Zhang, M., Zhang, R., Yang, Y., Bai, H., Zhang, J., Guo, J.: ISNet: shape matters for infrared small target detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 877–886 (2022)

    Google Scholar 

  59. Zhang, R., et al.: Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training. Adv. Neural. Inf. Process. Syst. 35, 27061–27074 (2022)

    Google Scholar 

  60. Zhang, X., Demiris, Y.: Visible and infrared image fusion using deep learning. IEEE Trans. Pattern Anal. Mach. Intell. 45(8), 10535–10554 (2023)

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported in part by the National Key R&D Program of China (2022YFA1004100), and in part by the National Natural Science Foundation of China (No. 62176035, 62201111, 12226004, 62272375), the Science and Technology Research Program of Chongqing Municipal Education Commission under Grant (No. KJZD-K202100606).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chenqiang Gao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, F., Gao, C., Zhang, Y., Guo, J., Wang, J., Meng, D. (2025). InfMAE: A Foundation Model in the Infrared Modality. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15076. Springer, Cham. https://doi.org/10.1007/978-3-031-72649-1_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72649-1_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72648-4

  • Online ISBN: 978-3-031-72649-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics