Domain Invariant Masked Autoencoders for Self-supervised Learning from Multi-domains

Yang, Haiyang; Tang, Shixiang; Chen, Meilin; Wang, Yizhou; Zhu, Feng; Bai, Lei; Zhao, Rui; Ouyang, Wanli

doi:10.1007/978-3-031-19821-2_9

Haiyang Yang^12,15,
Shixiang Tang^14,15,
Meilin Chen^13,15,
Yizhou Wang^13,15,
Feng Zhu¹⁵,
Lei Bai¹⁴,
Rui Zhao^15,16 &
…
Wanli Ouyang¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13691))

Included in the following conference series:

European Conference on Computer Vision

3498 Accesses
8 Citations

Abstract

Generalizing learned representations across significantly different visual domains is a fundamental yet crucial ability of the human visual system. While recent self-supervised learning methods have achieved good performances with evaluation set on the same domain as the training set, they will have an undesirable performance decrease when tested on a different domain. Therefore, the self-supervised learning from multiple domains task is proposed to learn domain-invariant features that are not only suitable for evaluation on the same domain as the training set, but also can be generalized to unseen domains. In this paper, we propose a Domain-invariant Masked AutoEncoder (DiMAE) for self-supervised learning from multi-domains, which designs a new pretext task, i.e., the cross-domain reconstruction task, to learn domain-invariant features. The core idea is to augment the input image with style noise from different domains and then reconstruct the image from the embedding of the augmented image, regularizing the encoder to learn domain-invariant features. To accomplish the idea, DiMAE contains two critical designs, 1) content-preserved style mix, which adds style information from other domains to input while persevering the content in a parameter-free manner, and 2) multiple domain-specific decoders, which recovers the corresponding domain style of input to the encoded domain-invariant features for reconstruction. Experiments on PACS and DomainNet illustrate that DiMAE achieves considerable gains compared with recent state-of-the-art methods.

H. Yang—The work was done during an internship at SenseTime.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Representation Learning for Style and Content Disentanglement with Autoencoders

Multimodal and Multiclass Semi-supervised Image-to-Image Translation

Universal Model Adaptation by Style Augmented Open-set Consistency

Article 30 June 2023

Notes

1.
“content” and “style” are terms widely used in style mix. “content” means domain-invariant information, while “style” means domain-specific information.
2.
We do not use the widely-used ResNet18 [20] as the backbone, because DiMAE is exactly a generative method, in which Convolutaional networks are not applicable. We choose the ViT-small model for comparison because the number of their model parameters is similar.
3.
Overall and Avg. indicate the overall accuracy of all the test data and the arithmetic mean of the accuracy of 3 domains, respectively. Note that they are different because the capacities of different domains are not equal.

References

Albuquerque, I., Monteiro, J., Darvishi, M., Falk, T.H., Mitliagkas, I.: Generalizing to unseen domains via distribution matching. arXiv preprint arXiv:1911.00804 (2019)
Bao, H., Dong, L., Wei, F.: Beit: bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021)
Cao, S., Xu, P., Clifton, D.A.: How to understand masked autoencoders. arXiv preprint arXiv:2202.03670 (2022)
Carlucci, F.M., D’Innocente, A., Bucci, S., Caputo, B., Tommasi, T.: Domain generalization by solving jigsaw puzzles. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2229–2238 (2019)
Google Scholar
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)
Google Scholar
Chen, M., et al.: Generative pretraining from pixels. In: International Conference on Machine Learning PMLR, pp. 1691–1703 (2020)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning PMLR, pp. 1597–1607 (2020)
Google Scholar
Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.: Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029 (2020)
Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020)
Chen, X., Xie, S., He, K.: An empirical study of training self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9640–9649 (2021)
Google Scholar
Dai, Y., Li, X., Liu, J., Tong, Z., Duan, L.Y.: Generalizable person re-identification with relevance-aware mixture of experts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16145–16154 (2021)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16 $\times $ 16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Ericsson, L., Gouk, H., Hospedales, T.M.: How well do self-supervised models transfer? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5414–5423 (2021)
Google Scholar
Feng, Z., Xu, C., Tao, D.: Self-supervised representation learning from multi-domain data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3245–3255 (2019)
Google Scholar
Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728 (2018)
Grill, J.B., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Advances Neural Inf. Process. Syst. 33, 21271–21284 (2020)
Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377 (2021)
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hong, M., Choi, J., Kim, G.: Stylemix: separating content and style for enhanced data augmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14862–14870 (2021)
Google Scholar
Hu, Q., Wang, X., Hu, W., Qi, G.J.: Adco: adversarial contrast for efficient learning of unsupervised representations from self-trained negative adversaries. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1074–1083 (2021)
Google Scholar
Kim, D., Saito, K., Oh, T.H., Plummer, B.A., Sclaroff, S., Saenko, K.: CDS: cross-domain self-supervised pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9123–9132 (2021)
Google Scholar
Larsson, Gustav, Maire, Michael, Shakhnarovich, Gregory: Learning representations for automatic colorization. In: Leibe, Bastian, Matas, Jiri, Sebe, Nicu, Welling, Max (eds.) ECCV 2016. LNCS, vol. 9908, pp. 577–593. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_35
Chapter Google Scholar
Larsson, G., Maire, M., Shakhnarovich, G.: Colorization as a proxy task for visual understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6874–6883 (2017)
Google Scholar
Li, D., Yang, Y., Song, Y.Z., Hospedales, T.M.: Deeper, broader and artier domain generalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5542–5550 (2017)
Google Scholar
Li, D., Zhang, J., Yang, Y., Liu, C., Song, Y.Z., Hospedales, T.M.: Episodic training for domain generalization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1446–1455 (2019)
Google Scholar
Li, H., Pan, S.J., Wang, S., Kot, A.C.: Domain generalization with adversarial feature learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5400–5409 (2018)
Google Scholar
Li, X., Dai, Y., Ge, Y., Liu, J., Shan, Y., Duan, L.: Uncertainty modeling for out-of-distribution generalization. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=6HN7LHyzGgC
Li, Y., et al.: Deep domain generalization via conditional invariant adversarial networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 624–639 (2018)
Google Scholar
Noroozi, Mehdi, Favaro, Paolo: Unsupervised learning of visual representations by solving Jigsaw puzzles. In: Leibe, Bastian, Matas, Jiri, Sebe, Nicu, Welling, Max (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_5
Chapter Google Scholar
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544 (2016)
Google Scholar
Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., Wang, B.: Moment matching for multi-source domain adaptation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1406–1415 (2019)
Google Scholar
Rahman, M.M., Fookes, C., Baktashmotlagh, M., Sridharan, S.: Correlation-aware adversarial domain adaptation and generalization. Pattern Recogn. 100, 107124 (2020)
Google Scholar
Sariyildiz, M.B., Kalantidis, Y., Larlus, D., Alahari, K.: Concept generalization in visual representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9629–9639 (2021)
Google Scholar
Tian, Y., Sun, C., Poole, B., Krishnan, D., Schmid, C., Isola, P.: What makes for good views for contrastive learning? arXiv preprint arXiv:2005.10243 (2020)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning PMLR, pp. 10347–10357 (2021)
Google Scholar
Wang, Y., et al.: Revisiting the transferability of supervised pretraining: an MLP perspective. arXiv preprint arXiv:2112.00496 (2021)
Wang, Z., Loog, M., van Gemert, J.: Respecting domain relations: Hypothesis invariance for domain generalization. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 9756–9763. IEEE (2021)
Google Scholar
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742 (2018)
Google Scholar
Xu, Q., Zhang, R., Zhang, Y., Wang, Y., Tian, Q.: A fourier-based framework for domain generalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14383–14392 (2021)
Google Scholar
Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6023–6032 (2019)
Google Scholar
Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: self-supervised learning via redundancy reduction. In: International Conference on Machine Learning PMLR, pp. 12310–12320 (2021)
Google Scholar
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. In: International Conference on Learning Representations (2018)
Google Scholar
Zhang, X., Zhou, L., Xu, R., Cui, P., Shen, Z., Liu, H.: Domain-irrelevant representation learning for unsupervised domain generalization. arXiv preprint arXiv:2107.06219 (2021)
Zhao, N., Wu, Z., Lau, R.W., Lin, S.: What makes instance discrimination good for transfer learning? arXiv preprint arXiv:2006.06606 (2020)
Zhou, K., Yang, Y., Hospedales, T., Xiang, T.: Deep domain-adversarial image generation for domain generalisation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13025–13032 (2020)
Google Scholar
Zhu, J., et al.: Complementary relation contrastive distillation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 203–212 (2021)
Google Scholar
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232 (2017)
Google Scholar
Zhuang, F., et al.: A comprehensive survey on transfer learning. Proc. IEEE 109(1), 43–76 (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

Nanjing University, Nanjing, China
Haiyang Yang
Zhejiang University, Zhejiang, China
Meilin Chen & Yizhou Wang
The University of Sydney, Sydney, Australia
Shixiang Tang, Lei Bai & Wanli Ouyang
SenseTime Research, Hong Kong, China
Haiyang Yang, Shixiang Tang, Meilin Chen, Yizhou Wang, Feng Zhu & Rui Zhao
Qing Yuan Research Institute, Shanghai Jiao Tong University, Shanghai, China
Rui Zhao

Authors

Haiyang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Shixiang Tang
View author publications
You can also search for this author in PubMed Google Scholar
Meilin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yizhou Wang
View author publications
You can also search for this author in PubMed Google Scholar
Feng Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Lei Bai
View author publications
You can also search for this author in PubMed Google Scholar
Rui Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Wanli Ouyang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lei Bai .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1715 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, H. et al. (2022). Domain Invariant Masked Autoencoders for Self-supervised Learning from Multi-domains. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13691. Springer, Cham. https://doi.org/10.1007/978-3-031-19821-2_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-19821-2_9
Published: 23 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19820-5
Online ISBN: 978-3-031-19821-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics