Skip to main content

ColorMAE: Exploring Data-Independent Masking Strategies in Masked AutoEncoders

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Masked AutoEncoders (MAE) have emerged as a robust self-supervised framework, offering remarkable performance across a wide range of downstream tasks. To increase the difficulty of the pretext task and learn richer visual representations, existing works have focused on replacing standard random masking with more sophisticated strategies, such as adversarial-guided and teacher-guided masking. However, these strategies depend on the input data thus commonly increasing the model complexity and requiring additional calculations to generate the mask patterns. This raises the question: Can we enhance MAE performance beyond random masking without relying on input data or incurring additional computational costs? In this work, we introduce a simple yet effective data-independent method, termed , which generates different binary mask patterns by filtering random noise. Drawing inspiration from color noise in image processing, we explore four types of filters to yield mask patterns with different spatial and semantic priors. requires no additional learnable parameters or computational overhead in the network, yet it significantly enhances the learned representations. We provide a comprehensive empirical evaluation, demonstrating our strategy’s superiority in downstream tasks compared to random masking. Notably, we report an improvement of 2.72 in mIoU in semantic segmentation tasks relative to baseline MAE implementations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Ahmed, A.G., Wonka, P.: Screen-space blue-noise diffusion of Monte Carlo sampling error via hierarchical ordering of pixels. ACM Trans. Graph. (TOG) 39(6), 1–15 (2020)

    Article  Google Scholar 

  2. Bachman, P., Hjelm, R.D., Buchwalter, W.: Learning representations by maximizing mutual information across views. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  3. Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J., Auli, M.: data2vec: a general framework for self-supervised learning in speech, vision and language. In: International Conference on Machine Learning, pp. 1298–1312. PMLR (2022)

    Google Scholar 

  4. Bao, H., Dong, L., Piao, S., Wei, F.: BEiT: BERT pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021)

  5. Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)

    Google Scholar 

  6. Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)

    Google Scholar 

  7. Castleman, K.R.: Digital Image Processing. Prentice Hall Press (1996)

    Google Scholar 

  8. Chen, K., Liu, Z., Hong, L., Xu, H., Li, Z., Yeung, D.Y.: Mixed autoencoder for self-supervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22742–22751 (2023)

    Google Scholar 

  9. Chen, P., Liu, S., Zhao, H., Jia, J.: GridMask data augmentation. arXiv preprint arXiv:2001.04086 (2020)

  10. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)

    Google Scholar 

  11. Chen, X., et al.: Context autoencoder for self-supervised representation learning. Int. J. Comput. Vis. 132(1), 208–223 (2024)

    Article  Google Scholar 

  12. Chen, X., He, K.: Exploring simple Siamese representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758 (2021)

    Google Scholar 

  13. Chen, X., Xie, S., He, K.: An empirical study of training self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, October 2021, pp. 9640–9649 (2021)

    Google Scholar 

  14. Correa, C.V., Arguello, H., Arce, G.R.: Spatiotemporal blue noise coded aperture design for multi-shot compressive spectral imaging. JOSA A 33(12), 2312–2322 (2016)

    Article  Google Scholar 

  15. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  16. Dong, X., et al.: Bootstrapped Masked Autoencoders for Vision BERT Pretraining. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022, ECCV 2022. LNCS, vol. 13690, pp. 247–264. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20056-4_15

  17. Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021. OpenReview.net (2021). https://openreview.net/forum?id=YicbFdNTTy

  18. El-Nouby, A., Izacard, G., Touvron, H., Laptev, I., Jegou, H., Grave, E.: Are large-scale datasets necessary for self-supervised pre-training? arXiv preprint arXiv:2112.10740 (2021)

  19. Ermolov, A., Siarohin, A., Sangineto, E., Sebe, N.: Whitening for self-supervised representation learning. In: International Conference on Machine Learning, pp. 3015–3024. PMLR (2021)

    Google Scholar 

  20. Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vis. 111, 98–136 (2015)

    Article  Google Scholar 

  21. Feng, Z., Zhang, S.: Evolved part masking for self-supervised learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10386–10395 (2023)

    Google Scholar 

  22. Gonzalez, R.C., Woods, R.E.: Digital Image Processing. Prentice Hall (2008)

    Google Scholar 

  23. Grill, J.B., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural. Inf. Process. Syst. 33, 21271–21284 (2020)

    Google Scholar 

  24. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)

    Google Scholar 

  25. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)

    Google Scholar 

  26. Hjelm, R.D., et al.: Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 (2018)

  27. Kakogeorgiou, I., et al.: What to hide from your students: attention-guided masked image modeling. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13690, pp. 300–318. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20056-4_18

  28. Lau, D.L., Arce, G.R., Gallagher, N.C.: Green-noise digital halftoning. Proc. IEEE 86(12), 2424–2444 (1998)

    Article  Google Scholar 

  29. Lau, D.L., Ulichney, R., Arce, G.R.: Blue and green noise halftoning models. IEEE Sig. Process. Mag. 20(4), 28–38 (2003)

    Article  Google Scholar 

  30. Li, G., Zheng, H., Liu, D., Wang, C., Su, B., Zheng, C.: SemMAE: semantic-guided masking for learning masked autoencoders. Adv. Neural. Inf. Process. Syst. 35, 14290–14302 (2022)

    Google Scholar 

  31. Li, X., Wang, W., Yang, L., Yang, J.: Uniform masking: enabling MAE pre-training for pyramid-based vision transformers with locality. arXiv preprint arXiv:2205.10063 (2022)

  32. Li, Y., Mao, H., Girshick, R., He, K.: Exploring plain vision transformer backbones for object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13669, pp. 280–296. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_17

  33. Li, Z., et al.: MST: masked self-supervised transformer for visual representation. Adv. Neural. Inf. Process. Syst. 34, 13165–13176 (2021)

    Google Scholar 

  34. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V 13. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  35. Muhammad, M.B., Yeasin, M.: Eigen-CAM: class activation map using principal components. In: 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. IEEE (2020)

    Google Scholar 

  36. Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

  37. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  38. Shi, Y., Siddharth, N., Torr, P., Kosiorek, A.R.: Adversarial masking for self-supervised learning. In: International Conference on Machine Learning, pp. 20026–20040. PMLR (2022)

    Google Scholar 

  39. Ulichney, R.A.: Void-and-cluster method for dither array generation. In: Human Vision, Visual Processing, and Digital Display IV, vol. 1913, pp. 332–343. SPIE (1993)

    Google Scholar 

  40. Vasseur, D.A., Yodzis, P.: The color of environmental noise. Ecology 85(4), 1146–1152 (2004)

    Article  Google Scholar 

  41. Wang, H., Fan, J., Wang, Y., Song, K., Wang, T., Zhang, Z.X.: DropPos: pre-training vision transformers by reconstructing dropped positions. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  42. Wang, H., Song, K., Fan, J., Wang, Y., Xie, J., Zhang, Z.: Hard patches mining for masked image modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10375–10385 (2023)

    Google Scholar 

  43. Wei, C., Fan, H., Xie, S., Wu, C.Y., Yuille, A., Feichtenhofer, C.: Masked feature prediction for self-supervised visual pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14668–14678 (2022)

    Google Scholar 

  44. Wolfe, A., Morrical, N., Akenine-Möller, T., Ramamoorthi, R., Ghosh, A., Wei, L.: Spatiotemporal blue noise masks. In: Eurographics Symposium on Rendering, pp. 117–126 (2022)

    Google Scholar 

  45. Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742 (2018)

    Google Scholar 

  46. Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 432–448. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_26

    Chapter  Google Scholar 

  47. Xie, J., et al.: Masked frequency modeling for self-supervised visual pre-training. In: The Eleventh International Conference on Learning Representations (2022)

    Google Scholar 

  48. Xie, Z., et al.: SimMIM: a simple framework for masked image modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9653–9663 (2022)

    Google Scholar 

  49. Zhang, Q., Wang, Y., Wang, Y.: How mask matters: towards theoretical understandings of masked autoencoders. Adv. Neural. Inf. Process. Syst. 35, 27127–27139 (2022)

    Google Scholar 

  50. Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 633–641 (2017)

    Google Scholar 

  51. Zhou, J., et al.: iBOT: image BERT pre-training with online tokenizer. arXiv preprint arXiv:2111.07832 (2021)

Download references

Acknowledgments

This work was supported by the KAUST Center of Excellence on GenAI under award number 5940.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carlos Hinojosa .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 21105 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hinojosa, C., Liu, S., Ghanem, B. (2025). ColorMAE: Exploring Data-Independent Masking Strategies in Masked AutoEncoders. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15078. Springer, Cham. https://doi.org/10.1007/978-3-031-72661-3_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72661-3_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72660-6

  • Online ISBN: 978-3-031-72661-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics