Abstract
Domain Generalization (DG), designed to enhance out-of-distribution (OOD) generalization, is all about learning invariance against domain shifts utilizing sufficient supervision signals. Yet, the scarcity of such labeled data has led to the rise of unsupervised domain generalization (UDG)—a more important yet challenging task in that models are trained across diverse domains in an unsupervised manner and eventually tested on unseen domains. UDG is fast gaining attention but is still far from well-studied.
To close the research gap, we propose a novel learning framework designed for UDG, termed the Disentangled Masked AutoEncoder (DisMAE), aiming to discover the disentangled representations that faithfully reveal the intrinsic features and superficial variations without access to the class label. At its core is the distillation of domain-invariant semantic features, which can not be distinguished by domain classifier, while filtering out the domain-specific variations (for example, color schemes and texture patterns) that are unstable and redundant. Notably, DisMAE co-trains the asymmetric dual-branch architecture with semantic and lightweight variation encoders, offering dynamic data manipulation and representation level augmentation capabilities. Extensive experiments on four benchmark datasets (i.e. DomainNet, PACS, VLCS, Colored MNIST) with both DG and UDG tasks demonstrate that DisMAE can achieve competitive OOD performance compared with the state-of-the-art DG and UDG baselines, which shed light on potential research line in improving the generalization ability with large-scale unlabeled data. Our codes are available at https://github.com/rookiehb/DisMAE.
A. Zhang and H. Wang—These authors contribute equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ahuja, K., et al.: Invariance principle meets information bottleneck for out-of-distribution generalization. In: NeurIPS (2021)
Arjovsky, M., Bottou, L., Gulrajani, I., Lopez-Paz, D.: Invariant risk minimization. CoRR abs/1907.02893 (2019)
Bai, H., et al.: Decaug: out-of-distribution generalization via decomposed feature representation and semantic augmentation. In: AAAI (2021)
Bao, H., Dong, L., Piao, S., Wei, F.: Beit: BERT pre-training of image transformers. In: ICLR (2022)
Bardes, A., Ponce, J., LeCun, Y.: Vicreg: variance-invariance-covariance regularization for self-supervised learning. In: ICLR (2022)
Bui, M., Tran, T., Tran, A., Phung, D.Q.: Exploiting domain-specific features to enhance domain generalization. In: NeurIPS (2021)
Cabannes, V., Kiani, B.T., Balestriero, R., LeCun, Y., Bietti, A.: The SSL interplay: augmentations, inductive bias, and generalization. In: ICML (2023)
Cai, R., Li, Z., Wei, P., Qiao, J., Zhang, K., Hao, Z.: Learning disentangled semantic representation for domain adaptation. In: IJCAI (2019)
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. In: NeurIPS (2020)
Chen, L., Zhang, Y., Song, Y., van den Hengel, A., Liu, L.: Domain generalization via rationale invariance. In: ICCV, pp. 1751–1760. IEEE (2023)
Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., Sutskever, I.: Generative pretraining from pixels. In: ICML (2020)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.E.: A simple framework for contrastive learning of visual representations. In: ICML (2020)
Chen, X., et al.: Context autoencoder for self-supervised representation learning. CoRR abs/2202.03026 (2022)
Chen, X., Fan, H., Girshick, R.B., He, K.: Improved baselines with momentum contrastive learning. CoRR abs/2003.04297 (2020)
Chen, X., He, K.: Exploring simple siamese representation learning. In: CVPR (2021)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Du, Y., et al.: Adarnn: adaptive learning and forecasting of time series. In: CIKM (2021)
Fang, C., Xu, Y., Rockmore, D.N.: Unbiased metric learning: on the utilization of multiple datasets and web images for softening bias. In: ICCV, pp. 1657–1664. IEEE Computer Society (2013)
Gholami, B., El-Khamy, M., Song, K.: Latent feature disentanglement for visual domain generalization. IEEE Trans. Image Process. 32, 5751–5763 (2023)
Grill, J., et al.: Bootstrap your own latent - a new approach to self-supervised learning. In: NeurIPS (2020)
Gulrajani, I., Lopez-Paz, D.: In search of lost domain generalization. In: ICLR. OpenReview.net (2021)
Harary, S., et al.: Unsupervised domain generalization by learning a bridge across domains. In: CVPR (2022)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.B.: Masked autoencoders are scalable vision learners. In: CVPR (2022)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778. IEEE Computer Society (2016)
Huang, J., Guan, D., Xiao, A., Lu, S.: FSDR: frequency space domain randomization for domain generalization. In: CVPR, pp. 6891–6902. Computer Vision Foundation/IEEE (2021)
Jung, Y., Tian, J., Bareinboim, E.: Learning causal effects via weighted empirical risk minimization. In: NeurIPS (2020)
Krueger, D., et al.: Out-of-distribution generalization via risk extrapolation (rex). In: ICML (2021)
Li, D., Yang, Y., Song, Y., Hospedales, T.M.: Deeper, broader and artier domain generalization. In: ICCV, pp. 5543–5551. IEEE Computer Society (2017)
Li, H., Pan, S.J., Wang, S., Kot, A.C.: Domain generalization with adversarial feature learning. In: CVPR (2018)
Li, P., Li, D., Li, W., Gong, S., Fu, Y., Hospedales, T.M.: A simple feature augmentation for domain generalization. In: ICCV (2021)
Li, Y., et al.: Deep domain generalization via conditional invariant adversarial networks. In: ECCV (15) (2018)
Lin, C., Yuan, Z., Zhao, S., Sun, P., Wang, C., Cai, J.: Domain-invariant disentangled network for generalizable object detection. In: ICCV, pp. 8751–8760. IEEE (2021)
Liu, Y., et al.: Promoting semantic connectivity: dual nearest neighbors contrastive learning for unsupervised domain generalization. In: CVPR (2023)
Lu, W., Wang, J., Li, H., Chen, Y., Xie, X.: Domain-invariant feature exploration for domain generalization. Trans. Mach. Learn. Res. 2022 (2022)
van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. CoRR abs/1807.03748 (2018)
Pearl, J., Glymour, M., Jewell, N.P.: Causal Inference in Statistics: A Primer. Wiley, Hoboken (2016)
Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., Wang, B.: Moment matching for multi-source domain adaptation. In: ICCV (2019)
Peng, X., Huang, Z., Sun, X., Saenko, K.: Domain agnostic learning with disentangled representations. In: ICML. Proceedings of Machine Learning Research, vol. 97, pp. 5102–5112. PMLR (2019)
Ramé, A., Dancette, C., Cord, M.: Fishr: invariant gradient variances for out-of-distribution generalization. In: ICML. Proceedings of Machine Learning Research, vol. 162 (2022)
Sagawa, S., Koh, P.W., Hashimoto, T.B., Liang, P.: Distributionally robust neural networks for group shifts: on the importance of regularization for worst-case generalization. CoRR abs/1911.08731 (2019)
Shankar, S., Piratla, V., Chakrabarti, S., Chaudhuri, S., Jyothi, P., Sarawagi, S.: Generalizing across domains via cross-gradient training. In: ICLR (Poster). OpenReview.net (2018)
Shao, R., Lan, X., Li, J., Yuen, P.C.: Multi-adversarial discriminative deep domain generalization for face presentation attack detection. In: CVPR (2019)
Shu, Y., Cao, Z., Wang, C., Wang, J., Long, M.: Open domain generalization with domain-augmented meta-learning. In: CVPR, pp. 9624–9633. Computer Vision Foundation/IEEE (2021)
Vapnik, V.: An overview of statistical learning theory. IEEE Trans. Neural Networks 10(5), 988–999 (1999)
Wald, Y., Feder, A., Greenfeld, D., Shalit, U.: On calibration and out-of-domain generalization. In: NeurIPS (2021)
Wang, J., et al.: Generalizing to unseen domains: a survey on domain generalization. IEEE Trans. Knowl. Data Eng. 35(8), 8052–8072 (2023)
Wang, T., Sun, Q., Pranata, S., Karlekar, J., Zhang, H.: Equivariance and invariance inductive bias for learning from insufficient data. In: ECCV (2022)
Wang, T., Yue, Z., Huang, J., Sun, Q., Zhang, H.: Self-supervised learning disentangled group representation as feature. In: NeurIPS (2021)
Wang, Y., Li, H., Cheng, H., Wen, B., Chau, L., Kot, A.C.: Variational disentanglement for domain generalization. Trans. Mach. Learn. Res. 2022 (2022)
Wen, Z., Li, Y.: Toward understanding the feature learning process of self-supervised contrastive learning. In: ICML (2021)
Xie, Z., et al.: Simmim: a simple framework for masked image modeling. In: CVPR (2022)
Yan, S., Song, H., Li, N., Zou, L., Ren, L.: Improve unsupervised domain adaptation with mixup training. CoRR abs/2001.00677 (2020)
Yang, H., et al.: Cycle-consistent masked autoencoder for unsupervised domain generalization. In: ICLR. OpenReview.net (2023)
Yang, H., et al.: Domain invariant masked autoencoders for self-supervised learning from multi-domains. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13691, pp. 151–168. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19821-2_9
Ye, H., Xie, C., Cai, T., Li, R., Li, Z., Wang, L.: Towards a theoretical framework of out-of-distribution generalization. In: NeurIPS (2021)
Ye, N., et al.: OOD-bench: quantifying and understanding two dimensions of out-of-distribution generalization. In: CVPR (2022)
Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: self-supervised learning via redundancy reduction. In: ICML (2021)
Zhang, H., Zhang, Y., Liu, W., Weller, A., Schölkopf, B., Xing, E.P.: Towards principled disentanglement for domain generalization. In: CVPR (2022)
Zhang, X., He, Y., Xu, R., Yu, H., Shen, Z., Cui, P.: NICO++: towards better benchmarking for domain generalization. CoRR abs/2204.08040 (2022)
Zhang, X., Zhou, L., Xu, R., Cui, P., Shen, Z., Liu, H.: Towards unsupervised domain generalization. In: CVPR, pp. 4900–4910. IEEE (2022)
Zhao, S., Gong, M., Liu, T., Fu, H., Tao, D.: Domain generalization via entropy regularization. In: NeurIPS (2020)
Acknowledgements
This research is supported by the National Natural Science Foundation of China (92270114) and the advanced computing resources provided by the Supercomputing Center of the USTC.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Algorithm
Algorithm 1 depicts the detailed procedure of DisMAE.

B Discussion About Differences
We argue that DisMAE is novel and significantly different from prior studies w.r.t. three aspects. 1) Scope. Transitioning to UDG is non-trivial. Previous disentangled methods like DADA [38], DIDN [32], and DIR [19], while effective in DG, struggle with unsupervised data due to their high dependence on class labels to encapsulate semantic attributes. 2) Disentangled Targets. Without class label guidance, achieving a domain-invariant semantic encoder is challenging. Many UDG methods, such as DiMAE [54] and CycleMAE [53], can only separate domain styles using multiple decoders but fall short in disentangling domain-invariant semantics from variations. 3) Disentangle Strategy. DisMAE is grounded in disentanglement and invariance principles, uniquely combining adaptive contrastive loss with reconstruction loss collaboratively. The adaptive contrastive loss, in particular, is designed by seamlessly leveraging the domain classifier and intra-domain negative sampling. The differences are summarized in Table 4.
C Experiments
1.1 C.1 Experimental Settings
Baseline Hyperparameter Tuning. For a fair comparison, we uniformly substitute the backbones of all baselines with the same ViT-B/16 and rerun the experiment using UDG and DG open-source codebases. And we provide the default hyperparameters for UDG baselines in Table 5. And the search distribution for each hyperparameter in each DG baseline is detailed in Table 6.
1.2 C.2 Overall Performance
Unsupervised Domain Generalization. Due to limited space in the paper, we show the rest UDG results in Table 7. We employ Clipart, Infograph, and Quickdraw as training domains and Painting, Real, and Sketch as test domains. Following the same all correlated settings and protocols in DARLING, we find that our DisMAE could achieve 1.14%, 1.19%, 4.40%, and 5.45% gains for average accuracy over the second-best baselines across 1%, 5%, 10%, and 100% fraction setting respectively.
Domain Generalization. Aligning with the training-domain validation setup in DomainBed, we achieve 0.50% gains for the average accuracy in PACS datasets, as shown in Table 8.
1.3 C.3 Discussion of the Invariance Principle
In Figs. 2 and 4, we visualize the representation acquired through MAE, our semantics encoder, and the variational encoder via t-SNE. We find that: (1) The representations generated by MAE for each domain showcase a degree of overlap at the center of the picture, accompanied by slight variations within each distribution. This suggests that MAE captures both semantics and domain-variant information but fails to disentangle them effectively. (2) Our semantic representations in each domain distribute uniformly. This justifies that DisMAE could learn domain-invariant representations from each domain. (3) The variation representations in each domain has their specific distribution. Clusters of similar variation data further emphasize domain-specific characteristics.
1.4 C.4 More Ablation Study
Effects of Decoder Depth. The efficacy of the adaptive contrastive loss hinges on the output of decoders. This prompts the inquiry: how many decoder layers are optimal for achieving peak performance? As shown in Table 9, a deeper decoder may lead to overfitting in reconstruction and subsequently diminish the effect of our contrastive loss. Thus, adopting a lightweight decoder could both accelerate the training and guarantee robustness.
Effects of Mask Ratios. In Table 10, we set different mask ratios to test the robustness of our model. And we found that the 80 percentile of the mask ratio reaches the optimal result. We set it as our default protocol.
1.5 C.5 Failure Cases
Some failure cases of our proposed DisMAE are in Fig. 6. Our approach struggles with reconstruction containing intricate details and lines. It frequently fails to generate images that possess sufficient detail while simultaneously providing clear augmented variations. We attribute these failures to two primary reasons: 1) The MAE backbone operates on patches, making pixel-level reconstruction difficult, and our method heavily relies on the MAE model’s reconstruction outcomes. 2) Our disentanglement lacks granularity, often capturing broad color regions and background information rather than nuanced details. In the context of UDG, reconstructing images with fine granularity, high resolution, and authenticity remains a challenging and crucial research direction. We are also keenly interested in exploring the potential integration of the diffusion model within the UDG framework.
1.6 C.6 Qualitative Reconstructions
Additional visualization results of image reconstruction, spanning both colored MNIST and DomainNet, can be observed in Fig. 7.
DisMAE differentiates between the foreground and background of an image. Remarkably, DisMAE can discern domain styles and fuse domain-specific elements across them—a notable instance is superimposing the sun from a sketch onto a painting. Such disentanglement ability endows DisMAE with the flexibility to generate controllable images by manipulating semantic and variation factors through swapping.
1.7 C.7 Detailed Implementation of DisMAE
We conduct all the experiments in Pytorch on a cluster of 8 NVIDIA Tesla A100 GPUs with 40GB each. Our default backbone consists of 12 blocks of semantic encoder, 6 blocks of variation encoder, and a transform-based decoder. We utilize ViT-B/16 as our default backbone for both visualization and main experiments. And we use ViT-Tiny/16 in our ablation study. We let margin \(\gamma =0.008\) and \(\tau =0.4\). In UDG, we choose the AdamW optimizer for the main branch and set the learning rate as 1e-4 and betas as (0.9, 0.95) for pre-training. As for finetuning, we adopt the learning rate as 0.025, 0.05, 5e-5, 5e-5 and batch size as 96, 192, 36, and 36 in the label fraction 1%, 5%, 10%, and 100% experiments respectively. And we finetune all the checkpoints for 50 epochs. In DG, \(\lambda _1\) is selected within {5e-4, 1e-3, 5e-3, 1e-2} and \(\lambda _2\) is selected within {0.1, 0.5, 1.0, 2.0}. The detailed hyperparameters for UDG and DG are listed in Table 11.
Details About Training the Domain Classifier. As for the domain classifier, we use the SGD optimizer with a learning rate of 0.0005, momentum of 0.99, and weight decay of 0.05. We choose adaptive training intervals \(T_{ad}\) as 15 and maximum adaptive training epochs \(E_{ad}\) as 100 in the UDG setting. We only update the domain classifier by minimizing the cross-entropy loss while freezing backbones when \(e \bmod T_{ad} == 0\) and \( e \le E_{ad}\), where e is the current training epoch. The detailed algorithm can be found in Appendix A.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, A., Wang, H., Wang, X., Chua, TS. (2025). Disentangling Masked Autoencoders for Unsupervised Domain Generalization. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15128. Springer, Cham. https://doi.org/10.1007/978-3-031-72897-6_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-72897-6_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72896-9
Online ISBN: 978-3-031-72897-6
eBook Packages: Computer ScienceComputer Science (R0)