Abstract
In recent years, foundation models have swept the computer vision field, facilitating the advancement of various tasks within different modalities. However, effectively designing an infrared foundation model remains an open question. In this paper, we introduce InfMAE, a foundation model tailored specifically for the infrared modality. Initially, we present Inf30, an infrared dataset developed to mitigate the scarcity of large-scale data for self-supervised learning within the infrared vision community. Moreover, considering the intrinsic characteristics of infrared images, we design an information-aware masking strategy. It allows for a greater emphasis on the regions with richer information in infrared images during the self-supervised learning process, which is conducive to learning strong representations. Additionally, to enhance generalization capabilities in downstream tasks, we employ a multi-scale encoder for latent representation learning. Finally, we develop an infrared decoder to reconstruct images. Extensive experiments show that our proposed method InfMAE outperforms other supervised and self-supervised learning methods in three key downstream tasks: infrared image semantic segmentation, object detection, and small target detection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Awais, M., et al.: Foundational models defining a new era in vision: a survey and outlook (2023)
Bao, H., Dong, L., Piao, S., Wei, F.: Beit: Bert pre-training of image transformers. In: International Conference on Learning Representations (2021)
Bommasani, R., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)
Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)
Chen, F., et al.: Local patch network with global attention for infrared small target detection. IEEE Trans. Aerosp. Electron. Syst. 58(5), 3979–3991 (2022)
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 833–851. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_49
Chen, X., Gao, C., Li, C., Yang, Y., Meng, D.: Infrared action detection in the dark via cross-stream attention mechanism. IEEE Trans. Multimedia 24, 288–300 (2021)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Dong, X., Bao, J., et al.: PECO: perceptual codebook for Bert pre-training of vision transformers. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 552–560 (2023)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2020)
Gao, C., et al.: Infar dataset: infrared action recognition at different times. Neurocomputing 212, 36–47 (2016)
Gao, C., Meng, D., Yang, Y., Wang, Y., Zhou, X., Hauptmann, A.G.: Infrared patch-image model for small target detection in a single image. IEEE Trans. Image Process. 22(12), 4996–5009 (2013)
Gao, P., Ma, T., Li, H., Lin, Z., Dai, J., Qiao, Y.: MCMAE: masked convolution meets masked autoencoders. Adv. Neural. Inf. Process. Syst. 35, 35632–35644 (2022)
Girdhar, R., et al.: Imagebind one embedding space to bind them all. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pp. 15180–15190. IEEE (2023)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. IEEE Computer Society (2016)
He, Z., Cao, Y., Dong, Y., Yang, J., Cao, Y., Tisse, C.L.: Single-image-based nonuniformity correction of uncooled long-wave infrared detectors: a deep-learning approach. Appl. Opt. 57(18), D155–D164 (2018)
Kakogeorgiou, I., et al.: What to hide from your students: attention-guided masked image modeling. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13690, pp. 300–318. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20056-4_18
Kirillov, A., et al.: Segment anything. CoRR abs/2304.02643 (2023)
Kolesnikov, A., et al.: Big Transfer (BiT): general visual representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 491–507. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_29
Lab, A.S.: Thermal infrared dataset. https://projects.asl.ethz.ch/datasets/doku.php?id=ir:iricra2014
Li, B., et al.: Dense nested attention network for infrared small target detection. IEEE Trans. Image Process. 32, 1745–1758 (2022)
Li, G., Zheng, H., Liu, D., Wang, C., Su, B., Zheng, C.: SEMMAE: semantic-guided masking for learning masked autoencoders. Adv. Neural. Inf. Process. Syst. 35, 14290–14302 (2022)
Li, X., Wang, W., Yang, L., Yang, J.: Uniform masking: enabling MAE pre-training for pyramid-based vision transformers with locality. arXiv preprint arXiv:2205.10063 (2022)
Li, Y., Liu, H., Tian, Z., Geng, W.: Near-infrared vascular image segmentation using improved level set method. Infrared Phys. Technol. 131, 104678 (2023)
Li, Y., Xie, S., Chen, X., Dollar, P., He, K., Girshick, R.: Benchmarking detection transfer learning with vision transformers. arXiv preprint arXiv:2111.11429 (2021)
Li, Z., et al.: MST: masked self-supervised transformer for visual representation. Adv. Neural. Inf. Process. Syst. 34, 13165–13176 (2021)
Li, Z.Y., Gao, S., Cheng, M.M.: Sere: exploring feature self-relation for self-supervised transformer. IEEE Trans. Pattern Anal. Mach. Intell. 45(12), 15619–15631 (2023)
Lin, F., Ge, S., Bao, K., Yan, C., Zeng, D.: Learning shape-biased representations for infrared small target detection. IEEE Trans. Multimedia 1–12 (2023)
Liu, F., Gao, C., Chen, F., Meng, D., Zuo, W., Gao, X.: Infrared small and dim target detection with transformer under complex backgrounds. IEEE Trans. Image Process. 32, 5921–5932 (2023)
Liu, J., et al.: Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5802–5811 (2022)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019. OpenReview.net (2019). https://openreview.net/forum?id=Bkg6RiCqY7
Lüddecke, T., Ecker, A.S.: Image segmentation using text and image prompts. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18–24, 2022, pp. 7076–7086. IEEE (2022)
Madan, N., et al.: Self-supervised masked convolutional transformer block for anomaly detection. IEEE Trans. Pattern Anal. Mach. Intell. 46(1), 525–542 (2024)
OpenAI: Gpt-4 technical report. PREPRINT (2023)
OTCBVS: Otcbvs benchmark dataset collection. https://vcipl-okstate.org/pbvs/bench/index.html
Pan, H., Hong, Y., Sun, W., Jia, Y.: Deep dual-resolution networks for real-time and accurate semantic segmentation of traffic scenes. IEEE Trans. Intell. Transp. Syst. 24(3), 3448–3460 (2022)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Reed, C.J., et al.: Scale-MAE: a scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023)
Reporter, T.M.: Infrared camera finds 6-year-old lost in deep woods. https://www.youtube.com/watch?v=-FajSFRlkIo
Scheibenreif, L., Mommert, M., Borth, D.: Masked vision transformers for hyperspectral image classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2165–2175 (2023)
St-Charles, P.L., Bilodeau, G.A., Bergevin, R.: Mutual foreground segmentation with multispectral stereo pairs. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 375–384 (2017)
Sun, P., et al.: Sparse R-CNN: end-to-end object detection with learnable proposals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14454–14463 (2021)
Tang, L., Yuan, J., Zhang, H., Jiang, X., Ma, J.: PIAFusion: a progressive infrared and visible image fusion network based on illumination aware. Inf. Fus. 83–84, 79–92 (2022)
Tian, Z., Liu, H., Li, Q.: VU-Net: a symmetric network-based method for near-infrared blood vessel image segmentation. In: Long, S., Dhillon, B.S. (eds.) MMESE 2023. LNCS, vol. 1069, pp. 275–280. Springer, Cham (2023). https://doi.org/10.1007/978-981-99-4882-6_39
Tong, Z., Song, Y., Wang, J., Wang, L.: Videomae: masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural. Inf. Process. Syst. 35, 10078–10093 (2022)
ultralytics: ultralytics (2023). https://github.com/ultralytics/ultralytics
Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Wang, L., et al.: Videomae v2: scaling video masked autoencoders with dual masking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14549–14560 (2023)
Woo, S., et al.: Convnext v2: co-designing and scaling convnets with masked autoencoders. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16133–16142 (2023)
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 432–448. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_26
Xu, Z., Zhuang, J., Liu, Q., Zhou, J., Peng, S.: Benchmarking a large-scale fir dataset for on-road pedestrian detection. Infrared Phys. Technol. 96, 199–208 (2019)
Yin, M., et al.: Disentangled non-local neural networks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 191–207. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_12
Zhang, M., Zhang, R., Yang, Y., Bai, H., Zhang, J., Guo, J.: ISNet: shape matters for infrared small target detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 877–886 (2022)
Zhang, R., et al.: Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training. Adv. Neural. Inf. Process. Syst. 35, 27061–27074 (2022)
Zhang, X., Demiris, Y.: Visible and infrared image fusion using deep learning. IEEE Trans. Pattern Anal. Mach. Intell. 45(8), 10535–10554 (2023)
Acknowledgements
This work is supported in part by the National Key R&D Program of China (2022YFA1004100), and in part by the National Natural Science Foundation of China (No. 62176035, 62201111, 12226004, 62272375), the Science and Technology Research Program of Chongqing Municipal Education Commission under Grant (No. KJZD-K202100606).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, F., Gao, C., Zhang, Y., Guo, J., Wang, J., Meng, D. (2025). InfMAE: A Foundation Model in the Infrared Modality. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15076. Springer, Cham. https://doi.org/10.1007/978-3-031-72649-1_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-72649-1_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72648-4
Online ISBN: 978-3-031-72649-1
eBook Packages: Computer ScienceComputer Science (R0)