Disentangling Masked Autoencoders for Unsupervised Domain Generalization

Zhang, An; Wang, Han; Wang, Xiang; Chua, Tat-Seng

doi:10.1007/978-3-031-72897-6_8

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15128))

Included in the following conference series:

European Conference on Computer Vision

297 Accesses

Abstract

Domain Generalization (DG), designed to enhance out-of-distribution (OOD) generalization, is all about learning invariance against domain shifts utilizing sufficient supervision signals. Yet, the scarcity of such labeled data has led to the rise of unsupervised domain generalization (UDG)—a more important yet challenging task in that models are trained across diverse domains in an unsupervised manner and eventually tested on unseen domains. UDG is fast gaining attention but is still far from well-studied.

To close the research gap, we propose a novel learning framework designed for UDG, termed the Disentangled Masked AutoEncoder (DisMAE), aiming to discover the disentangled representations that faithfully reveal the intrinsic features and superficial variations without access to the class label. At its core is the distillation of domain-invariant semantic features, which can not be distinguished by domain classifier, while filtering out the domain-specific variations (for example, color schemes and texture patterns) that are unstable and redundant. Notably, DisMAE co-trains the asymmetric dual-branch architecture with semantic and lightweight variation encoders, offering dynamic data manipulation and representation level augmentation capabilities. Extensive experiments on four benchmark datasets (i.e. DomainNet, PACS, VLCS, Colored MNIST) with both DG and UDG tasks demonstrate that DisMAE can achieve competitive OOD performance compared with the state-of-the-art DG and UDG baselines, which shed light on potential research line in improving the generalization ability with large-scale unlabeled data. Our codes are available at https://github.com/rookiehb/DisMAE.

A. Zhang and H. Wang—These authors contribute equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Enhancing Domain Generalization with Auto-encoders

Shared wasserstein adversarial domain adaption

Article 04 March 2024

Domain Invariant Masked Autoencoders for Self-supervised Learning from Multi-domains

References

Ahuja, K., et al.: Invariance principle meets information bottleneck for out-of-distribution generalization. In: NeurIPS (2021)
Google Scholar
Arjovsky, M., Bottou, L., Gulrajani, I., Lopez-Paz, D.: Invariant risk minimization. CoRR abs/1907.02893 (2019)
Google Scholar
Bai, H., et al.: Decaug: out-of-distribution generalization via decomposed feature representation and semantic augmentation. In: AAAI (2021)
Google Scholar
Bao, H., Dong, L., Piao, S., Wei, F.: Beit: BERT pre-training of image transformers. In: ICLR (2022)
Google Scholar
Bardes, A., Ponce, J., LeCun, Y.: Vicreg: variance-invariance-covariance regularization for self-supervised learning. In: ICLR (2022)
Google Scholar
Bui, M., Tran, T., Tran, A., Phung, D.Q.: Exploiting domain-specific features to enhance domain generalization. In: NeurIPS (2021)
Google Scholar
Cabannes, V., Kiani, B.T., Balestriero, R., LeCun, Y., Bietti, A.: The SSL interplay: augmentations, inductive bias, and generalization. In: ICML (2023)
Google Scholar
Cai, R., Li, Z., Wei, P., Qiao, J., Zhang, K., Hao, Z.: Learning disentangled semantic representation for domain adaptation. In: IJCAI (2019)
Google Scholar
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. In: NeurIPS (2020)
Google Scholar
Chen, L., Zhang, Y., Song, Y., van den Hengel, A., Liu, L.: Domain generalization via rationale invariance. In: ICCV, pp. 1751–1760. IEEE (2023)
Google Scholar
Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., Sutskever, I.: Generative pretraining from pixels. In: ICML (2020)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.E.: A simple framework for contrastive learning of visual representations. In: ICML (2020)
Google Scholar
Chen, X., et al.: Context autoencoder for self-supervised representation learning. CoRR abs/2202.03026 (2022)
Google Scholar
Chen, X., Fan, H., Girshick, R.B., He, K.: Improved baselines with momentum contrastive learning. CoRR abs/2003.04297 (2020)
Google Scholar
Chen, X., He, K.: Exploring simple siamese representation learning. In: CVPR (2021)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Du, Y., et al.: Adarnn: adaptive learning and forecasting of time series. In: CIKM (2021)
Google Scholar
Fang, C., Xu, Y., Rockmore, D.N.: Unbiased metric learning: on the utilization of multiple datasets and web images for softening bias. In: ICCV, pp. 1657–1664. IEEE Computer Society (2013)
Google Scholar
Gholami, B., El-Khamy, M., Song, K.: Latent feature disentanglement for visual domain generalization. IEEE Trans. Image Process. 32, 5751–5763 (2023)
Article Google Scholar
Grill, J., et al.: Bootstrap your own latent - a new approach to self-supervised learning. In: NeurIPS (2020)
Google Scholar
Gulrajani, I., Lopez-Paz, D.: In search of lost domain generalization. In: ICLR. OpenReview.net (2021)
Google Scholar
Harary, S., et al.: Unsupervised domain generalization by learning a bridge across domains. In: CVPR (2022)
Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.B.: Masked autoencoders are scalable vision learners. In: CVPR (2022)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778. IEEE Computer Society (2016)
Google Scholar
Huang, J., Guan, D., Xiao, A., Lu, S.: FSDR: frequency space domain randomization for domain generalization. In: CVPR, pp. 6891–6902. Computer Vision Foundation/IEEE (2021)
Google Scholar
Jung, Y., Tian, J., Bareinboim, E.: Learning causal effects via weighted empirical risk minimization. In: NeurIPS (2020)
Google Scholar
Krueger, D., et al.: Out-of-distribution generalization via risk extrapolation (rex). In: ICML (2021)
Google Scholar
Li, D., Yang, Y., Song, Y., Hospedales, T.M.: Deeper, broader and artier domain generalization. In: ICCV, pp. 5543–5551. IEEE Computer Society (2017)
Google Scholar
Li, H., Pan, S.J., Wang, S., Kot, A.C.: Domain generalization with adversarial feature learning. In: CVPR (2018)
Google Scholar
Li, P., Li, D., Li, W., Gong, S., Fu, Y., Hospedales, T.M.: A simple feature augmentation for domain generalization. In: ICCV (2021)
Google Scholar
Li, Y., et al.: Deep domain generalization via conditional invariant adversarial networks. In: ECCV (15) (2018)
Google Scholar
Lin, C., Yuan, Z., Zhao, S., Sun, P., Wang, C., Cai, J.: Domain-invariant disentangled network for generalizable object detection. In: ICCV, pp. 8751–8760. IEEE (2021)
Google Scholar
Liu, Y., et al.: Promoting semantic connectivity: dual nearest neighbors contrastive learning for unsupervised domain generalization. In: CVPR (2023)
Google Scholar
Lu, W., Wang, J., Li, H., Chen, Y., Xie, X.: Domain-invariant feature exploration for domain generalization. Trans. Mach. Learn. Res. 2022 (2022)
Google Scholar
van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. CoRR abs/1807.03748 (2018)
Google Scholar
Pearl, J., Glymour, M., Jewell, N.P.: Causal Inference in Statistics: A Primer. Wiley, Hoboken (2016)
Google Scholar
Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., Wang, B.: Moment matching for multi-source domain adaptation. In: ICCV (2019)
Google Scholar
Peng, X., Huang, Z., Sun, X., Saenko, K.: Domain agnostic learning with disentangled representations. In: ICML. Proceedings of Machine Learning Research, vol. 97, pp. 5102–5112. PMLR (2019)
Google Scholar
Ramé, A., Dancette, C., Cord, M.: Fishr: invariant gradient variances for out-of-distribution generalization. In: ICML. Proceedings of Machine Learning Research, vol. 162 (2022)
Google Scholar
Sagawa, S., Koh, P.W., Hashimoto, T.B., Liang, P.: Distributionally robust neural networks for group shifts: on the importance of regularization for worst-case generalization. CoRR abs/1911.08731 (2019)
Google Scholar
Shankar, S., Piratla, V., Chakrabarti, S., Chaudhuri, S., Jyothi, P., Sarawagi, S.: Generalizing across domains via cross-gradient training. In: ICLR (Poster). OpenReview.net (2018)
Google Scholar
Shao, R., Lan, X., Li, J., Yuen, P.C.: Multi-adversarial discriminative deep domain generalization for face presentation attack detection. In: CVPR (2019)
Google Scholar
Shu, Y., Cao, Z., Wang, C., Wang, J., Long, M.: Open domain generalization with domain-augmented meta-learning. In: CVPR, pp. 9624–9633. Computer Vision Foundation/IEEE (2021)
Google Scholar
Vapnik, V.: An overview of statistical learning theory. IEEE Trans. Neural Networks 10(5), 988–999 (1999)
Article Google Scholar
Wald, Y., Feder, A., Greenfeld, D., Shalit, U.: On calibration and out-of-domain generalization. In: NeurIPS (2021)
Google Scholar
Wang, J., et al.: Generalizing to unseen domains: a survey on domain generalization. IEEE Trans. Knowl. Data Eng. 35(8), 8052–8072 (2023)
Google Scholar
Wang, T., Sun, Q., Pranata, S., Karlekar, J., Zhang, H.: Equivariance and invariance inductive bias for learning from insufficient data. In: ECCV (2022)
Google Scholar
Wang, T., Yue, Z., Huang, J., Sun, Q., Zhang, H.: Self-supervised learning disentangled group representation as feature. In: NeurIPS (2021)
Google Scholar
Wang, Y., Li, H., Cheng, H., Wen, B., Chau, L., Kot, A.C.: Variational disentanglement for domain generalization. Trans. Mach. Learn. Res. 2022 (2022)
Google Scholar
Wen, Z., Li, Y.: Toward understanding the feature learning process of self-supervised contrastive learning. In: ICML (2021)
Google Scholar
Xie, Z., et al.: Simmim: a simple framework for masked image modeling. In: CVPR (2022)
Google Scholar
Yan, S., Song, H., Li, N., Zou, L., Ren, L.: Improve unsupervised domain adaptation with mixup training. CoRR abs/2001.00677 (2020)
Google Scholar
Yang, H., et al.: Cycle-consistent masked autoencoder for unsupervised domain generalization. In: ICLR. OpenReview.net (2023)
Google Scholar
Yang, H., et al.: Domain invariant masked autoencoders for self-supervised learning from multi-domains. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13691, pp. 151–168. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19821-2_9
Chapter Google Scholar
Ye, H., Xie, C., Cai, T., Li, R., Li, Z., Wang, L.: Towards a theoretical framework of out-of-distribution generalization. In: NeurIPS (2021)
Google Scholar
Ye, N., et al.: OOD-bench: quantifying and understanding two dimensions of out-of-distribution generalization. In: CVPR (2022)
Google Scholar
Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: self-supervised learning via redundancy reduction. In: ICML (2021)
Google Scholar
Zhang, H., Zhang, Y., Liu, W., Weller, A., Schölkopf, B., Xing, E.P.: Towards principled disentanglement for domain generalization. In: CVPR (2022)
Google Scholar
Zhang, X., He, Y., Xu, R., Yu, H., Shen, Z., Cui, P.: NICO++: towards better benchmarking for domain generalization. CoRR abs/2204.08040 (2022)
Google Scholar
Zhang, X., Zhou, L., Xu, R., Cui, P., Shen, Z., Liu, H.: Towards unsupervised domain generalization. In: CVPR, pp. 4900–4910. IEEE (2022)
Google Scholar
Zhao, S., Gong, M., Liu, T., Fu, H., Tao, D.: Domain generalization via entropy regularization. In: NeurIPS (2020)
Google Scholar

Download references

Acknowledgements

This research is supported by the National Natural Science Foundation of China (92270114) and the advanced computing resources provided by the Supercomputing Center of the USTC.

Author information

Authors and Affiliations

National University of Singapore, Singapore, Singapore
An Zhang & Tat-Seng Chua
Zhejiang University, Hangzhou, China
Han Wang
University of Science and Technology of China, Hefei, China
Xiang Wang

Authors

An Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Han Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Tat-Seng Chua
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to An Zhang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

Appendices

A Algorithm

Algorithm 1 depicts the detailed procedure of DisMAE.

B Discussion About Differences

We argue that DisMAE is novel and significantly different from prior studies w.r.t. three aspects. 1) Scope. Transitioning to UDG is non-trivial. Previous disentangled methods like DADA [38], DIDN [32], and DIR [19], while effective in DG, struggle with unsupervised data due to their high dependence on class labels to encapsulate semantic attributes. 2) Disentangled Targets. Without class label guidance, achieving a domain-invariant semantic encoder is challenging. Many UDG methods, such as DiMAE [54] and CycleMAE [53], can only separate domain styles using multiple decoders but fall short in disentangling domain-invariant semantics from variations. 3) Disentangle Strategy. DisMAE is grounded in disentanglement and invariance principles, uniquely combining adaptive contrastive loss with reconstruction loss collaboratively. The adaptive contrastive loss, in particular, is designed by seamlessly leveraging the domain classifier and intra-domain negative sampling. The differences are summarized in Table 4.

Table 4. Comparison with previous works

Full size table

C Experiments

1.1 C.1 Experimental Settings

Baseline Hyperparameter Tuning. For a fair comparison, we uniformly substitute the backbones of all baselines with the same ViT-B/16 and rerun the experiment using UDG and DG open-source codebases. And we provide the default hyperparameters for UDG baselines in Table 5. And the search distribution for each hyperparameter in each DG baseline is detailed in Table 6.

Table 5. Hyperparameters for baselines in UDG. BS represents the batch size, and WD denotes weight decay.

Full size table

Table 6. Default hyperparameters and random search distribution for baselines in DG

Full size table

1.2 C.2 Overall Performance

Unsupervised Domain Generalization. Due to limited space in the paper, we show the rest UDG results in Table 7. We employ Clipart, Infograph, and Quickdraw as training domains and Painting, Real, and Sketch as test domains. Following the same all correlated settings and protocols in DARLING, we find that our DisMAE could achieve 1.14%, 1.19%, 4.40%, and 5.45% gains for average accuracy over the second-best baselines across 1%, 5%, 10%, and 100% fraction setting respectively.

Table 7. Unsupervised domain generalization results on DomainNet. We employ Clipart, Infograph, and Quickdraw as training domains and Painting, Real, and Sketch as test domains. All the models are unsupervised pre-trained before fine-tuning on the labeled data. Overall and Avg. are the overall test data accuracy and the arithmetic mean of individual domain accuracy respectively. Note that they are different because the size of each test domain isn’t equal. Bold = best, underline = second best.

Full size table

Domain Generalization. Aligning with the training-domain validation setup in DomainBed, we achieve 0.50% gains for the average accuracy in PACS datasets, as shown in Table 8.

Table 8. Domain generalization results on PACS. Bold = best, underline = second best.

Full size table

1.3 C.3 Discussion of the Invariance Principle

In Figs. 2 and 4, we visualize the representation acquired through MAE, our semantics encoder, and the variational encoder via t-SNE. We find that: (1) The representations generated by MAE for each domain showcase a degree of overlap at the center of the picture, accompanied by slight variations within each distribution. This suggests that MAE captures both semantics and domain-variant information but fails to disentangle them effectively. (2) Our semantic representations in each domain distribute uniformly. This justifies that DisMAE could learn domain-invariant representations from each domain. (3) The variation representations in each domain has their specific distribution. Clusters of similar variation data further emphasize domain-specific characteristics.

1.4 C.4 More Ablation Study

Effects of Decoder Depth. The efficacy of the adaptive contrastive loss hinges on the output of decoders. This prompts the inquiry: how many decoder layers are optimal for achieving peak performance? As shown in Table 9, a deeper decoder may lead to overfitting in reconstruction and subsequently diminish the effect of our contrastive loss. Thus, adopting a lightweight decoder could both accelerate the training and guarantee robustness.

Table 9. Hyperparameter Analysis of the decoder layer

Full size table

Effects of Mask Ratios. In Table 10, we set different mask ratios to test the robustness of our model. And we found that the 80 percentile of the mask ratio reaches the optimal result. We set it as our default protocol.

Table 10. Hyperparameter Analysis of the mask ratio

Full size table

1.5 C.5 Failure Cases

Some failure cases of our proposed DisMAE are in Fig. 6. Our approach struggles with reconstruction containing intricate details and lines. It frequently fails to generate images that possess sufficient detail while simultaneously providing clear augmented variations. We attribute these failures to two primary reasons: 1) The MAE backbone operates on patches, making pixel-level reconstruction difficult, and our method heavily relies on the MAE model’s reconstruction outcomes. 2) Our disentanglement lacks granularity, often capturing broad color regions and background information rather than nuanced details. In the context of UDG, reconstructing images with fine granularity, high resolution, and authenticity remains a challenging and crucial research direction. We are also keenly interested in exploring the potential integration of the diffusion model within the UDG framework.

1.6 C.6 Qualitative Reconstructions

Additional visualization results of image reconstruction, spanning both colored MNIST and DomainNet, can be observed in Fig. 7.

DisMAE differentiates between the foreground and background of an image. Remarkably, DisMAE can discern domain styles and fuse domain-specific elements across them—a notable instance is superimposing the sun from a sketch onto a painting. Such disentanglement ability endows DisMAE with the flexibility to generate controllable images by manipulating semantic and variation factors through swapping.

1.7 C.7 Detailed Implementation of DisMAE

We conduct all the experiments in Pytorch on a cluster of 8 NVIDIA Tesla A100 GPUs with 40GB each. Our default backbone consists of 12 blocks of semantic encoder, 6 blocks of variation encoder, and a transform-based decoder. We utilize ViT-B/16 as our default backbone for both visualization and main experiments. And we use ViT-Tiny/16 in our ablation study. We let margin $\gamma =0.008$ and $\tau =0.4$. In UDG, we choose the AdamW optimizer for the main branch and set the learning rate as 1e-4 and betas as (0.9, 0.95) for pre-training. As for finetuning, we adopt the learning rate as 0.025, 0.05, 5e-5, 5e-5 and batch size as 96, 192, 36, and 36 in the label fraction 1%, 5%, 10%, and 100% experiments respectively. And we finetune all the checkpoints for 50 epochs. In DG, $\lambda _1$ is selected within {5e-4, 1e-3, 5e-3, 1e-2} and $\lambda _2$ is selected within {0.1, 0.5, 1.0, 2.0}. The detailed hyperparameters for UDG and DG are listed in Table 11.

Table 11. Hyperparameters Selection of DisMAE.

Full size table

Details About Training the Domain Classifier. As for the domain classifier, we use the SGD optimizer with a learning rate of 0.0005, momentum of 0.99, and weight decay of 0.05. We choose adaptive training intervals $T_{ad}$ as 15 and maximum adaptive training epochs $E_{ad}$ as 100 in the UDG setting. We only update the domain classifier by minimizing the cross-entropy loss while freezing backbones when $e \bmod T_{ad} == 0$ and $ e \le E_{ad}$, where e is the current training epoch. The detailed algorithm can be found in Appendix A.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, A., Wang, H., Wang, X., Chua, TS. (2025). Disentangling Masked Autoencoders for Unsupervised Domain Generalization. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15128. Springer, Cham. https://doi.org/10.1007/978-3-031-72897-6_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-72897-6_8
Published: 02 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72896-9
Online ISBN: 978-3-031-72897-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Disentangling Masked Autoencoders for Unsupervised Domain Generalization

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Enhancing Domain Generalization with Auto-encoders

Shared wasserstein adversarial domain adaption

Domain Invariant Masked Autoencoders for Self-supervised Learning from Multi-domains

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

A Algorithm

B Discussion About Differences

C Experiments

1.1 C.1 Experimental Settings

1.2 C.2 Overall Performance

1.3 C.3 Discussion of the Invariance Principle

1.4 C.4 More Ablation Study

1.5 C.5 Failure Cases

1.6 C.6 Qualitative Reconstructions

1.7 C.7 Detailed Implementation of DisMAE

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us