ColorMAE: Exploring Data-Independent Masking Strategies in Masked AutoEncoders

Hinojosa, Carlos; Liu, Shuming; Ghanem, Bernard

doi:10.1007/978-3-031-72661-3_25

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15078))

Included in the following conference series:

European Conference on Computer Vision

238 Accesses

Abstract

Masked AutoEncoders (MAE) have emerged as a robust self-supervised framework, offering remarkable performance across a wide range of downstream tasks. To increase the difficulty of the pretext task and learn richer visual representations, existing works have focused on replacing standard random masking with more sophisticated strategies, such as adversarial-guided and teacher-guided masking. However, these strategies depend on the input data thus commonly increasing the model complexity and requiring additional calculations to generate the mask patterns. This raises the question: Can we enhance MAE performance beyond random masking without relying on input data or incurring additional computational costs? In this work, we introduce a simple yet effective data-independent method, termed , which generates different binary mask patterns by filtering random noise. Drawing inspiration from color noise in image processing, we explore four types of filters to yield mask patterns with different spatial and semantic priors. requires no additional learnable parameters or computational overhead in the network, yet it significantly enhances the learned representations. We provide a comprehensive empirical evaluation, demonstrating our strategy’s superiority in downstream tasks compared to random masking. Notably, we report an improvement of 2.72 in mIoU in semantic segmentation tasks relative to baseline MAE implementations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.99; Price excludes VAT (USA)

Softcover Book: USD 161.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Improving Masked Autoencoders by Learning Where to Mask

SdAE: Self-distillated Masked Autoencoder

Learning with Unmasked Tokens Drives Stronger Vision Learners

References

Ahmed, A.G., Wonka, P.: Screen-space blue-noise diffusion of Monte Carlo sampling error via hierarchical ordering of pixels. ACM Trans. Graph. (TOG) 39(6), 1–15 (2020)
Article Google Scholar
Bachman, P., Hjelm, R.D., Buchwalter, W.: Learning representations by maximizing mutual information across views. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J., Auli, M.: data2vec: a general framework for self-supervised learning in speech, vision and language. In: International Conference on Machine Learning, pp. 1298–1312. PMLR (2022)
Google Scholar
Bao, H., Dong, L., Piao, S., Wei, F.: BEiT: BERT pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021)
Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Google Scholar
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)
Google Scholar
Castleman, K.R.: Digital Image Processing. Prentice Hall Press (1996)
Google Scholar
Chen, K., Liu, Z., Hong, L., Xu, H., Li, Z., Yeung, D.Y.: Mixed autoencoder for self-supervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22742–22751 (2023)
Google Scholar
Chen, P., Liu, S., Zhao, H., Jia, J.: GridMask data augmentation. arXiv preprint arXiv:2001.04086 (2020)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Google Scholar
Chen, X., et al.: Context autoencoder for self-supervised representation learning. Int. J. Comput. Vis. 132(1), 208–223 (2024)
Article Google Scholar
Chen, X., He, K.: Exploring simple Siamese representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758 (2021)
Google Scholar
Chen, X., Xie, S., He, K.: An empirical study of training self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, October 2021, pp. 9640–9649 (2021)
Google Scholar
Correa, C.V., Arguello, H., Arce, G.R.: Spatiotemporal blue noise coded aperture design for multi-shot compressive spectral imaging. JOSA A 33(12), 2312–2322 (2016)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dong, X., et al.: Bootstrapped Masked Autoencoders for Vision BERT Pretraining. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022, ECCV 2022. LNCS, vol. 13690, pp. 247–264. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20056-4_15
Dosovitskiy, A., et al.: An image is worth 16$\times $16 words: transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021. OpenReview.net (2021). https://openreview.net/forum?id=YicbFdNTTy
El-Nouby, A., Izacard, G., Touvron, H., Laptev, I., Jegou, H., Grave, E.: Are large-scale datasets necessary for self-supervised pre-training? arXiv preprint arXiv:2112.10740 (2021)
Ermolov, A., Siarohin, A., Sangineto, E., Sebe, N.: Whitening for self-supervised representation learning. In: International Conference on Machine Learning, pp. 3015–3024. PMLR (2021)
Google Scholar
Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vis. 111, 98–136 (2015)
Article Google Scholar
Feng, Z., Zhang, S.: Evolved part masking for self-supervised learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10386–10395 (2023)
Google Scholar
Gonzalez, R.C., Woods, R.E.: Digital Image Processing. Prentice Hall (2008)
Google Scholar
Grill, J.B., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural. Inf. Process. Syst. 33, 21271–21284 (2020)
Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Google Scholar
Hjelm, R.D., et al.: Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 (2018)
Kakogeorgiou, I., et al.: What to hide from your students: attention-guided masked image modeling. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13690, pp. 300–318. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20056-4_18
Lau, D.L., Arce, G.R., Gallagher, N.C.: Green-noise digital halftoning. Proc. IEEE 86(12), 2424–2444 (1998)
Article Google Scholar
Lau, D.L., Ulichney, R., Arce, G.R.: Blue and green noise halftoning models. IEEE Sig. Process. Mag. 20(4), 28–38 (2003)
Article Google Scholar
Li, G., Zheng, H., Liu, D., Wang, C., Su, B., Zheng, C.: SemMAE: semantic-guided masking for learning masked autoencoders. Adv. Neural. Inf. Process. Syst. 35, 14290–14302 (2022)
Google Scholar
Li, X., Wang, W., Yang, L., Yang, J.: Uniform masking: enabling MAE pre-training for pyramid-based vision transformers with locality. arXiv preprint arXiv:2205.10063 (2022)
Li, Y., Mao, H., Girshick, R., He, K.: Exploring plain vision transformer backbones for object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13669, pp. 280–296. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_17
Li, Z., et al.: MST: masked self-supervised transformer for visual representation. Adv. Neural. Inf. Process. Syst. 34, 13165–13176 (2021)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V 13. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Muhammad, M.B., Yeasin, M.: Eigen-CAM: class activation map using principal components. In: 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. IEEE (2020)
Google Scholar
Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015)
Article MathSciNet Google Scholar
Shi, Y., Siddharth, N., Torr, P., Kosiorek, A.R.: Adversarial masking for self-supervised learning. In: International Conference on Machine Learning, pp. 20026–20040. PMLR (2022)
Google Scholar
Ulichney, R.A.: Void-and-cluster method for dither array generation. In: Human Vision, Visual Processing, and Digital Display IV, vol. 1913, pp. 332–343. SPIE (1993)
Google Scholar
Vasseur, D.A., Yodzis, P.: The color of environmental noise. Ecology 85(4), 1146–1152 (2004)
Article Google Scholar
Wang, H., Fan, J., Wang, Y., Song, K., Wang, T., Zhang, Z.X.: DropPos: pre-training vision transformers by reconstructing dropped positions. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Wang, H., Song, K., Fan, J., Wang, Y., Xie, J., Zhang, Z.: Hard patches mining for masked image modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10375–10385 (2023)
Google Scholar
Wei, C., Fan, H., Xie, S., Wu, C.Y., Yuille, A., Feichtenhofer, C.: Masked feature prediction for self-supervised visual pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14668–14678 (2022)
Google Scholar
Wolfe, A., Morrical, N., Akenine-Möller, T., Ramamoorthi, R., Ghosh, A., Wei, L.: Spatiotemporal blue noise masks. In: Eurographics Symposium on Rendering, pp. 117–126 (2022)
Google Scholar
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742 (2018)
Google Scholar
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 432–448. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_26
Chapter Google Scholar
Xie, J., et al.: Masked frequency modeling for self-supervised visual pre-training. In: The Eleventh International Conference on Learning Representations (2022)
Google Scholar
Xie, Z., et al.: SimMIM: a simple framework for masked image modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9653–9663 (2022)
Google Scholar
Zhang, Q., Wang, Y., Wang, Y.: How mask matters: towards theoretical understandings of masked autoencoders. Adv. Neural. Inf. Process. Syst. 35, 27127–27139 (2022)
Google Scholar
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 633–641 (2017)
Google Scholar
Zhou, J., et al.: iBOT: image BERT pre-training with online tokenizer. arXiv preprint arXiv:2111.07832 (2021)

Download references

Acknowledgments

This work was supported by the KAUST Center of Excellence on GenAI under award number 5940.

Author information

Authors and Affiliations

King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
Carlos Hinojosa, Shuming Liu & Bernard Ghanem

Authors

Carlos Hinojosa
View author publications
You can also search for this author in PubMed Google Scholar
Shuming Liu
View author publications
You can also search for this author in PubMed Google Scholar
Bernard Ghanem
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carlos Hinojosa .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 21105 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hinojosa, C., Liu, S., Ghanem, B. (2025). ColorMAE: Exploring Data-Independent Masking Strategies in Masked AutoEncoders. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15078. Springer, Cham. https://doi.org/10.1007/978-3-031-72661-3_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-72661-3_25
Published: 27 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72660-6
Online ISBN: 978-3-031-72661-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

ColorMAE: Exploring Data-Independent Masking Strategies in Masked AutoEncoders