Skip to main content

Disentangling Masked Autoencoders for Unsupervised Domain Generalization

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Domain Generalization (DG), designed to enhance out-of-distribution (OOD) generalization, is all about learning invariance against domain shifts utilizing sufficient supervision signals. Yet, the scarcity of such labeled data has led to the rise of unsupervised domain generalization (UDG)—a more important yet challenging task in that models are trained across diverse domains in an unsupervised manner and eventually tested on unseen domains. UDG is fast gaining attention but is still far from well-studied.

To close the research gap, we propose a novel learning framework designed for UDG, termed the Disentangled Masked AutoEncoder (DisMAE), aiming to discover the disentangled representations that faithfully reveal the intrinsic features and superficial variations without access to the class label. At its core is the distillation of domain-invariant semantic features, which can not be distinguished by domain classifier, while filtering out the domain-specific variations (for example, color schemes and texture patterns) that are unstable and redundant. Notably, DisMAE co-trains the asymmetric dual-branch architecture with semantic and lightweight variation encoders, offering dynamic data manipulation and representation level augmentation capabilities. Extensive experiments on four benchmark datasets (i.e. DomainNet, PACS, VLCS, Colored MNIST) with both DG and UDG tasks demonstrate that DisMAE can achieve competitive OOD performance compared with the state-of-the-art DG and UDG baselines, which shed light on potential research line in improving the generalization ability with large-scale unlabeled data. Our codes are available at https://github.com/rookiehb/DisMAE.

A. Zhang and H. Wang—These authors contribute equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Ahuja, K., et al.: Invariance principle meets information bottleneck for out-of-distribution generalization. In: NeurIPS (2021)

    Google Scholar 

  2. Arjovsky, M., Bottou, L., Gulrajani, I., Lopez-Paz, D.: Invariant risk minimization. CoRR abs/1907.02893 (2019)

    Google Scholar 

  3. Bai, H., et al.: Decaug: out-of-distribution generalization via decomposed feature representation and semantic augmentation. In: AAAI (2021)

    Google Scholar 

  4. Bao, H., Dong, L., Piao, S., Wei, F.: Beit: BERT pre-training of image transformers. In: ICLR (2022)

    Google Scholar 

  5. Bardes, A., Ponce, J., LeCun, Y.: Vicreg: variance-invariance-covariance regularization for self-supervised learning. In: ICLR (2022)

    Google Scholar 

  6. Bui, M., Tran, T., Tran, A., Phung, D.Q.: Exploiting domain-specific features to enhance domain generalization. In: NeurIPS (2021)

    Google Scholar 

  7. Cabannes, V., Kiani, B.T., Balestriero, R., LeCun, Y., Bietti, A.: The SSL interplay: augmentations, inductive bias, and generalization. In: ICML (2023)

    Google Scholar 

  8. Cai, R., Li, Z., Wei, P., Qiao, J., Zhang, K., Hao, Z.: Learning disentangled semantic representation for domain adaptation. In: IJCAI (2019)

    Google Scholar 

  9. Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. In: NeurIPS (2020)

    Google Scholar 

  10. Chen, L., Zhang, Y., Song, Y., van den Hengel, A., Liu, L.: Domain generalization via rationale invariance. In: ICCV, pp. 1751–1760. IEEE (2023)

    Google Scholar 

  11. Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., Sutskever, I.: Generative pretraining from pixels. In: ICML (2020)

    Google Scholar 

  12. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.E.: A simple framework for contrastive learning of visual representations. In: ICML (2020)

    Google Scholar 

  13. Chen, X., et al.: Context autoencoder for self-supervised representation learning. CoRR abs/2202.03026 (2022)

    Google Scholar 

  14. Chen, X., Fan, H., Girshick, R.B., He, K.: Improved baselines with momentum contrastive learning. CoRR abs/2003.04297 (2020)

    Google Scholar 

  15. Chen, X., He, K.: Exploring simple siamese representation learning. In: CVPR (2021)

    Google Scholar 

  16. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)

    Google Scholar 

  17. Du, Y., et al.: Adarnn: adaptive learning and forecasting of time series. In: CIKM (2021)

    Google Scholar 

  18. Fang, C., Xu, Y., Rockmore, D.N.: Unbiased metric learning: on the utilization of multiple datasets and web images for softening bias. In: ICCV, pp. 1657–1664. IEEE Computer Society (2013)

    Google Scholar 

  19. Gholami, B., El-Khamy, M., Song, K.: Latent feature disentanglement for visual domain generalization. IEEE Trans. Image Process. 32, 5751–5763 (2023)

    Article  Google Scholar 

  20. Grill, J., et al.: Bootstrap your own latent - a new approach to self-supervised learning. In: NeurIPS (2020)

    Google Scholar 

  21. Gulrajani, I., Lopez-Paz, D.: In search of lost domain generalization. In: ICLR. OpenReview.net (2021)

    Google Scholar 

  22. Harary, S., et al.: Unsupervised domain generalization by learning a bridge across domains. In: CVPR (2022)

    Google Scholar 

  23. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.B.: Masked autoencoders are scalable vision learners. In: CVPR (2022)

    Google Scholar 

  24. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778. IEEE Computer Society (2016)

    Google Scholar 

  25. Huang, J., Guan, D., Xiao, A., Lu, S.: FSDR: frequency space domain randomization for domain generalization. In: CVPR, pp. 6891–6902. Computer Vision Foundation/IEEE (2021)

    Google Scholar 

  26. Jung, Y., Tian, J., Bareinboim, E.: Learning causal effects via weighted empirical risk minimization. In: NeurIPS (2020)

    Google Scholar 

  27. Krueger, D., et al.: Out-of-distribution generalization via risk extrapolation (rex). In: ICML (2021)

    Google Scholar 

  28. Li, D., Yang, Y., Song, Y., Hospedales, T.M.: Deeper, broader and artier domain generalization. In: ICCV, pp. 5543–5551. IEEE Computer Society (2017)

    Google Scholar 

  29. Li, H., Pan, S.J., Wang, S., Kot, A.C.: Domain generalization with adversarial feature learning. In: CVPR (2018)

    Google Scholar 

  30. Li, P., Li, D., Li, W., Gong, S., Fu, Y., Hospedales, T.M.: A simple feature augmentation for domain generalization. In: ICCV (2021)

    Google Scholar 

  31. Li, Y., et al.: Deep domain generalization via conditional invariant adversarial networks. In: ECCV (15) (2018)

    Google Scholar 

  32. Lin, C., Yuan, Z., Zhao, S., Sun, P., Wang, C., Cai, J.: Domain-invariant disentangled network for generalizable object detection. In: ICCV, pp. 8751–8760. IEEE (2021)

    Google Scholar 

  33. Liu, Y., et al.: Promoting semantic connectivity: dual nearest neighbors contrastive learning for unsupervised domain generalization. In: CVPR (2023)

    Google Scholar 

  34. Lu, W., Wang, J., Li, H., Chen, Y., Xie, X.: Domain-invariant feature exploration for domain generalization. Trans. Mach. Learn. Res. 2022 (2022)

    Google Scholar 

  35. van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. CoRR abs/1807.03748 (2018)

    Google Scholar 

  36. Pearl, J., Glymour, M., Jewell, N.P.: Causal Inference in Statistics: A Primer. Wiley, Hoboken (2016)

    Google Scholar 

  37. Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., Wang, B.: Moment matching for multi-source domain adaptation. In: ICCV (2019)

    Google Scholar 

  38. Peng, X., Huang, Z., Sun, X., Saenko, K.: Domain agnostic learning with disentangled representations. In: ICML. Proceedings of Machine Learning Research, vol. 97, pp. 5102–5112. PMLR (2019)

    Google Scholar 

  39. Ramé, A., Dancette, C., Cord, M.: Fishr: invariant gradient variances for out-of-distribution generalization. In: ICML. Proceedings of Machine Learning Research, vol. 162 (2022)

    Google Scholar 

  40. Sagawa, S., Koh, P.W., Hashimoto, T.B., Liang, P.: Distributionally robust neural networks for group shifts: on the importance of regularization for worst-case generalization. CoRR abs/1911.08731 (2019)

    Google Scholar 

  41. Shankar, S., Piratla, V., Chakrabarti, S., Chaudhuri, S., Jyothi, P., Sarawagi, S.: Generalizing across domains via cross-gradient training. In: ICLR (Poster). OpenReview.net (2018)

    Google Scholar 

  42. Shao, R., Lan, X., Li, J., Yuen, P.C.: Multi-adversarial discriminative deep domain generalization for face presentation attack detection. In: CVPR (2019)

    Google Scholar 

  43. Shu, Y., Cao, Z., Wang, C., Wang, J., Long, M.: Open domain generalization with domain-augmented meta-learning. In: CVPR, pp. 9624–9633. Computer Vision Foundation/IEEE (2021)

    Google Scholar 

  44. Vapnik, V.: An overview of statistical learning theory. IEEE Trans. Neural Networks 10(5), 988–999 (1999)

    Article  Google Scholar 

  45. Wald, Y., Feder, A., Greenfeld, D., Shalit, U.: On calibration and out-of-domain generalization. In: NeurIPS (2021)

    Google Scholar 

  46. Wang, J., et al.: Generalizing to unseen domains: a survey on domain generalization. IEEE Trans. Knowl. Data Eng. 35(8), 8052–8072 (2023)

    Google Scholar 

  47. Wang, T., Sun, Q., Pranata, S., Karlekar, J., Zhang, H.: Equivariance and invariance inductive bias for learning from insufficient data. In: ECCV (2022)

    Google Scholar 

  48. Wang, T., Yue, Z., Huang, J., Sun, Q., Zhang, H.: Self-supervised learning disentangled group representation as feature. In: NeurIPS (2021)

    Google Scholar 

  49. Wang, Y., Li, H., Cheng, H., Wen, B., Chau, L., Kot, A.C.: Variational disentanglement for domain generalization. Trans. Mach. Learn. Res. 2022 (2022)

    Google Scholar 

  50. Wen, Z., Li, Y.: Toward understanding the feature learning process of self-supervised contrastive learning. In: ICML (2021)

    Google Scholar 

  51. Xie, Z., et al.: Simmim: a simple framework for masked image modeling. In: CVPR (2022)

    Google Scholar 

  52. Yan, S., Song, H., Li, N., Zou, L., Ren, L.: Improve unsupervised domain adaptation with mixup training. CoRR abs/2001.00677 (2020)

    Google Scholar 

  53. Yang, H., et al.: Cycle-consistent masked autoencoder for unsupervised domain generalization. In: ICLR. OpenReview.net (2023)

    Google Scholar 

  54. Yang, H., et al.: Domain invariant masked autoencoders for self-supervised learning from multi-domains. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13691, pp. 151–168. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19821-2_9

    Chapter  Google Scholar 

  55. Ye, H., Xie, C., Cai, T., Li, R., Li, Z., Wang, L.: Towards a theoretical framework of out-of-distribution generalization. In: NeurIPS (2021)

    Google Scholar 

  56. Ye, N., et al.: OOD-bench: quantifying and understanding two dimensions of out-of-distribution generalization. In: CVPR (2022)

    Google Scholar 

  57. Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: self-supervised learning via redundancy reduction. In: ICML (2021)

    Google Scholar 

  58. Zhang, H., Zhang, Y., Liu, W., Weller, A., Schölkopf, B., Xing, E.P.: Towards principled disentanglement for domain generalization. In: CVPR (2022)

    Google Scholar 

  59. Zhang, X., He, Y., Xu, R., Yu, H., Shen, Z., Cui, P.: NICO++: towards better benchmarking for domain generalization. CoRR abs/2204.08040 (2022)

    Google Scholar 

  60. Zhang, X., Zhou, L., Xu, R., Cui, P., Shen, Z., Liu, H.: Towards unsupervised domain generalization. In: CVPR, pp. 4900–4910. IEEE (2022)

    Google Scholar 

  61. Zhao, S., Gong, M., Liu, T., Fu, H., Tao, D.: Domain generalization via entropy regularization. In: NeurIPS (2020)

    Google Scholar 

Download references

Acknowledgements

This research is supported by the National Natural Science Foundation of China (92270114) and the advanced computing resources provided by the Supercomputing Center of the USTC.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to An Zhang .

Editor information

Editors and Affiliations

Appendices

A Algorithm

Algorithm 1 depicts the detailed procedure of DisMAE.

figure a

B Discussion About Differences

We argue that DisMAE is novel and significantly different from prior studies w.r.t. three aspects. 1) Scope. Transitioning to UDG is non-trivial. Previous disentangled methods like DADA [38], DIDN [32], and DIR [19], while effective in DG, struggle with unsupervised data due to their high dependence on class labels to encapsulate semantic attributes. 2) Disentangled Targets. Without class label guidance, achieving a domain-invariant semantic encoder is challenging. Many UDG methods, such as DiMAE [54] and CycleMAE [53], can only separate domain styles using multiple decoders but fall short in disentangling domain-invariant semantics from variations. 3) Disentangle Strategy. DisMAE is grounded in disentanglement and invariance principles, uniquely combining adaptive contrastive loss with reconstruction loss collaboratively. The adaptive contrastive loss, in particular, is designed by seamlessly leveraging the domain classifier and intra-domain negative sampling. The differences are summarized in Table 4.

Table 4. Comparison with previous works

C Experiments

1.1 C.1 Experimental Settings

Baseline Hyperparameter Tuning. For a fair comparison, we uniformly substitute the backbones of all baselines with the same ViT-B/16 and rerun the experiment using UDG and DG open-source codebases. And we provide the default hyperparameters for UDG baselines in Table 5. And the search distribution for each hyperparameter in each DG baseline is detailed in Table 6.

Table 5. Hyperparameters for baselines in UDG. BS represents the batch size, and WD denotes weight decay.
Table 6. Default hyperparameters and random search distribution for baselines in DG

1.2 C.2 Overall Performance

Unsupervised Domain Generalization. Due to limited space in the paper, we show the rest UDG results in Table 7. We employ Clipart, Infograph, and Quickdraw as training domains and Painting, Real, and Sketch as test domains. Following the same all correlated settings and protocols in DARLING, we find that our DisMAE could achieve 1.14%, 1.19%, 4.40%, and 5.45% gains for average accuracy over the second-best baselines across 1%, 5%, 10%, and 100% fraction setting respectively.

Table 7. Unsupervised domain generalization results on DomainNet. We employ Clipart, Infograph, and Quickdraw as training domains and Painting, Real, and Sketch as test domains. All the models are unsupervised pre-trained before fine-tuning on the labeled data. Overall and Avg. are the overall test data accuracy and the arithmetic mean of individual domain accuracy respectively. Note that they are different because the size of each test domain isn’t equal. Bold = best, underline = second best.

Domain Generalization. Aligning with the training-domain validation setup in DomainBed, we achieve 0.50% gains for the average accuracy in PACS datasets, as shown in Table 8.

Table 8. Domain generalization results on PACS. Bold = best, underline = second best.

1.3 C.3 Discussion of the Invariance Principle

In Figs. 2 and 4, we visualize the representation acquired through MAE, our semantics encoder, and the variational encoder via t-SNE. We find that: (1) The representations generated by MAE for each domain showcase a degree of overlap at the center of the picture, accompanied by slight variations within each distribution. This suggests that MAE captures both semantics and domain-variant information but fails to disentangle them effectively. (2) Our semantic representations in each domain distribute uniformly. This justifies that DisMAE could learn domain-invariant representations from each domain. (3) The variation representations in each domain has their specific distribution. Clusters of similar variation data further emphasize domain-specific characteristics.

1.4 C.4 More Ablation Study

Effects of Decoder Depth. The efficacy of the adaptive contrastive loss hinges on the output of decoders. This prompts the inquiry: how many decoder layers are optimal for achieving peak performance? As shown in Table 9, a deeper decoder may lead to overfitting in reconstruction and subsequently diminish the effect of our contrastive loss. Thus, adopting a lightweight decoder could both accelerate the training and guarantee robustness.

Table 9. Hyperparameter Analysis of the decoder layer

Effects of Mask Ratios. In Table 10, we set different mask ratios to test the robustness of our model. And we found that the 80 percentile of the mask ratio reaches the optimal result. We set it as our default protocol.

Table 10. Hyperparameter Analysis of the mask ratio

1.5 C.5 Failure Cases

Some failure cases of our proposed DisMAE are in Fig. 6. Our approach struggles with reconstruction containing intricate details and lines. It frequently fails to generate images that possess sufficient detail while simultaneously providing clear augmented variations. We attribute these failures to two primary reasons: 1) The MAE backbone operates on patches, making pixel-level reconstruction difficult, and our method heavily relies on the MAE model’s reconstruction outcomes. 2) Our disentanglement lacks granularity, often capturing broad color regions and background information rather than nuanced details. In the context of UDG, reconstructing images with fine granularity, high resolution, and authenticity remains a challenging and crucial research direction. We are also keenly interested in exploring the potential integration of the diffusion model within the UDG framework.

Fig. 6.
figure 6

Some failure cases of reconstructed images generated by DisMAE.

1.6 C.6 Qualitative Reconstructions

Additional visualization results of image reconstruction, spanning both colored MNIST and DomainNet, can be observed in Fig. 7.

DisMAE differentiates between the foreground and background of an image. Remarkably, DisMAE can discern domain styles and fuse domain-specific elements across them—a notable instance is superimposing the sun from a sketch onto a painting. Such disentanglement ability endows DisMAE with the flexibility to generate controllable images by manipulating semantic and variation factors through swapping.

Fig. 7.
figure 7

Illustrative reconstructed images generated by DisMAE.

1.7 C.7 Detailed Implementation of DisMAE

We conduct all the experiments in Pytorch on a cluster of 8 NVIDIA Tesla A100 GPUs with 40GB each. Our default backbone consists of 12 blocks of semantic encoder, 6 blocks of variation encoder, and a transform-based decoder. We utilize ViT-B/16 as our default backbone for both visualization and main experiments. And we use ViT-Tiny/16 in our ablation study. We let margin \(\gamma =0.008\) and \(\tau =0.4\). In UDG, we choose the AdamW optimizer for the main branch and set the learning rate as 1e-4 and betas as (0.9, 0.95) for pre-training. As for finetuning, we adopt the learning rate as 0.025, 0.05, 5e-5, 5e-5 and batch size as 96, 192, 36, and 36 in the label fraction 1%, 5%, 10%, and 100% experiments respectively. And we finetune all the checkpoints for 50 epochs. In DG, \(\lambda _1\) is selected within {5e-4, 1e-3, 5e-3, 1e-2} and \(\lambda _2\) is selected within {0.1, 0.5, 1.0, 2.0}. The detailed hyperparameters for UDG and DG are listed in Table 11.

Table 11. Hyperparameters Selection of DisMAE.

Details About Training the Domain Classifier. As for the domain classifier, we use the SGD optimizer with a learning rate of 0.0005, momentum of 0.99, and weight decay of 0.05. We choose adaptive training intervals \(T_{ad}\) as 15 and maximum adaptive training epochs \(E_{ad}\) as 100 in the UDG setting. We only update the domain classifier by minimizing the cross-entropy loss while freezing backbones when \(e \bmod T_{ad} == 0\) and \( e \le E_{ad}\), where e is the current training epoch. The detailed algorithm can be found in Appendix A.

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, A., Wang, H., Wang, X., Chua, TS. (2025). Disentangling Masked Autoencoders for Unsupervised Domain Generalization. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15128. Springer, Cham. https://doi.org/10.1007/978-3-031-72897-6_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72897-6_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72896-9

  • Online ISBN: 978-3-031-72897-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics