Domain Generalization by Joint-Product Distribution Alignment

doi:10.1016/j.patcog.2022.109086

Pattern Recognition

Volume 134, February 2023, 109086

https://doi.org/10.1016/j.patcog.2022.109086 Get rights and content

Highlights

•
To address the domain difference in domain generalization, we propose to align multiple domains $P^{1} (x, y), \dots, P^{n} (x, y)$ via the alignment of two distributions: the joint distribution $P (x, y, l)$ and the product distribution $P (x, y) P (l)$ , where the domain label $l \in {1, \dots, n}$ .
•
We analytically derive an explicit estimate of the Relative Chi-Square (RCS) divergence between $P (x, y, l)$ and $P (x, y) P (l)$ , and minimize this estimate to align distributions in the neural transformation space.
•
We demonstrate the effectiveness of our solution through conducting comprehensive experiments on several multi-domain image classification datasets.

Abstract

In this work, we address the problem of domain generalization for classification, where the goal is to learn a classification model on a set of source domains and generalize it to a target domain. The source and target domains are different, which weakens the generalization ability of the learned model. To tackle the domain difference, we propose to align a joint distribution and a product distribution using a neural transformation, and minimize the Relative Chi-Square (RCS) divergence between the two distributions to learn that transformation. In this manner, we conveniently achieve the alignment of multiple domains in the neural transformation space. Specifically, we show that the RCS divergence can be explicitly estimated as the maximal value of a quadratic function, which allows us to perform joint-product distribution alignment by minimizing the divergence estimate. We demonstrate the effectiveness of our solution through comparison with the state-of-the-art methods on several image classification datasets.

Introduction

A common assumption underlying most supervised learning algorithms is that the training (source) and test (target) data are drawn from the same domain $P (x, y)$ , where $x$ is the feature variable and $y$ is the class label. Under this assumption, a classification model appropriately trained in the source domain is guaranteed to generalize well to the target domain in a probability sense [1]. Unfortunately, in real-world applications, control over the data generation process is less perfect: The source data available for training the classification model can be distributionally different from the target data on which the model will be tested, a problem known as dataset shift [2], [3], dataset bias [4], or domain shift [5], [6]. Under such circumstances, the source trained model may perform poorly on the target data [7], [8], [9], [10].

Domain generalization is concerned with the above non-identically-distributed supervised learning problem, where the training data are respectively drawn from $n (n \geq 2)$ source domains $P^{1} (x, y), \dots, P^{n} (x, y)$ , while the test data are sampled from a target domain $P^{t} (x, y)$ . The source and target domains are different but related to one another [11], [12], [13], and the goal of domain generalization for classification is to learn/train a classification model on the source domains and generalize it to the target domain.

Domain generalization methods aim to exploit the domain relationship (i.e., relationship among distributions) to reduce the domain difference and train a classification model on the source data [11], [13], [14], [15]. Essentially, these methods aim to learn a feature transformation (i.e., a projection matrix or a neural transformation) to align the $n$ source domains $P^{1} (x, y)$ , $\dots$ , $P^{n} (x, y)$ whose samples are available during training, and expect the learned transformation to generalize to the target domain $P^{t} (x, y)$ such that its difference from the source ones is reduced. As a result, the source trained model can generalize better to the target domain [11], [14], [16].

Since a domain $P (x, y)$ can be factorized into $P (x, y) = P (y | x) P (x)$ or $P (x, y) = P (x | y) P (y)$ , generally there are two solutions to align the $n$ domains $P^{1} (x, y), \dots, P^{n} (x, y)$ . The first one learns a feature transformation to align a set of $n$ marginal distributions (marginals) $P^{1} (x), \dots, P^{n} (x)$ , and assumes that the posterior distribution $P (y | x)$ is stable [7], [11], [12]. However, as discussed in Zhao et al. [14], Nguyen et al. [15], Li et al. [16], the stability of $P (y | x)$ is often violated in practice (e.g., speaker recognition, object recognition), which could result in the under-alignment of domains. Therefore, the second solution proposes to align domains $P^{1} (x, y), \dots, P^{n} (x, y)$ by seeking a feature transformation (e.g., a neural transformation) to align a set of $n$ marginals and multiple sets of class-conditional distributions (class-conditionals) [13], [14], [16]. These sets of class-conditionals could either be $c$ sets of $n$ distributions $P^{1} (x | y = i), \dots, P^{n} (x | y = i)$ for $i \in {1, \dots, c}$ , or $n$ sets of $c$ distributions $P^{l} (x | y = 1), \dots, P^{l} (x | y = c)$ for $l \in {1, \dots, n}$ , where $c$ is the number of classes. However, since this solution has to align multiple sets of class-conditionals, with each set containing multiple distributions, it may not scale well with the number of classes [15]. Besides, to align distributions in the neural network context, it usually needs to introduce additional discriminator subnetworks, and solves the challenging minimax problem between the neural transformation and the added subnetworks.

In this work, we address the above issues and propose a Joint-Product Distribution Alignment (JPDA) solution to align the $n$ domains $P^{1} (x, y)$ , $\dots, P^{n} (x, y)$ . To be specific, we first introduce domain label $l \in {1, \dots, n}$ for the $n$ domains and rewrite them as $P (x, y | l = 1), \dots, P (x, y | l = n)$ , respectively. We then learn a neural transformation (feature extractor) $T$ to align the joint distribution $P (x, y, l)$ and the product distribution $P (x, y) P (l)$ such that $P (T (x), y, l) = P (T (x), y) P (l)$ , which implies that the distribution of $(T (x), y)$ is independent of the domain label $l$ . This independence conveniently leads to the alignment of the $n$ domains, i.e., $P (T (x), y | l = 1) = \dots = P (T (x), y | l = n)$ . Compared to the aforementioned two solutions from prior works [7], [11], [13], [14], our JPDA solution (1) avoids the factorization of domains and the alignment of the many factorized components, i.e., the marginal distributions and the class-conditional distributions, and (2) only needs to align two distributions, i.e., the joint distribution and the product distribution. Such alignment is algorithmically straightforward and scales well with the number of classes. Namely, unlike previous works [13], [14], in our solution the number of distributions aligned is fixed and does not grow with the number of classes. Apart from aligning distributions in the neural transformation space, we learn a downstream classifier for the target classification task.

To be more specific, we align joint distribution $P (x, y, l)$ and product distribution $P (x, y) P (l)$ under the distribution-ratio-based Relative Chi-Square (RCS) divergence [18]. Importantly, we show that the RCS divergence between these two distributions can be analytically estimated as the maximal value of a quadratic function, and consequently obtain an explicit estimate of the RCS divergence. This allows us to directly minimize the divergence estimate with respect to the neural transformation to achieve joint-product distribution alignment. Compared to the existing adversarial methods [7], [13], [14] that make use of another distribution-ratio-based divergence, the Jensen–Shannon (JS) divergence, our JPDA solution (1) does not need to introduce additional discriminator subnetworks, which could result in learning more network parameters, and (2) avoids solving the challenging minimax problem between the neural transformation and the discriminator subnetworks. Our cost function is a combination of the joint-product distribution divergence and the classification loss. We minimize it via the minibatch Stochastic Gradient Descent (SGD) algorithm, and obtain a network model (containing the neural transformation and the classifier) with better generalization capability. See Fig. 1 that illustrates our solution to domain generalization for image classification. To summarize, our major contributions in this work can be listed as follows:

•
We propose to align domains $P^{1} (x, y), \dots, P^{n} (x, y)$ via the alignment of joint distribution $P (x, y, l)$ and product distribution $P (x, y) P (l)$ , where the domain label $l \in {1, \dots, n}$ .
•
We analytically derive an explicit estimate of the RCS divergence between $P (x, y, l)$ and $P (x, y) P (l)$ to serve as the alignment loss.
•
We demonstrate the effectiveness of our solution by conducting comprehensive experiments on several multi-domain image classification datasets.

Section snippets

Related work

We first discuss the domain alignment works in domain generalization, which align domains by factorizing them and aligning the factorized components, i.e., the marginal distributions (marginals), the class-conditional distributions (class-conditionals). Subsequently, we briefly review works that tackle the problem via other strategies.

The study of learning and generalizing a classification model from multiple source domains to a target domain can be traced back to the early works of

Methodology

We define the domain generalization problem, give an overview of our solution, and elaborate on the technical details.

Experiments

For conducting the domain generalization experiments, we note that there exist two different experimental settings in the field: one commonly practiced in Nguyen et al. [15], Xu et al. [27], Yang et al. [28], Carlucci et al. [37], and the other one proposed by Gulrajani and Lopez-Paz [38]. We conduct our experiments under the former, which involves following the settings in prior works and citing the available results reported by the authors themselves.

Conclusion

In this work, we study the domain generalization problem and propose the JPDA solution to better generalize a source trained network classification model to a different but related target domain. Our solution aligns the joint distribution and the product distribution in the neural transformation space, and minimizes the classification loss. Particularly, the two distributions are aligned under the RCS divergence, which is estimated from empirical data via analytically solving an unconstrained

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part by the Technology and Innovation Major Project of the Ministry of Science and Technology of China under Grant 2020AAA0108404, in part by the National Natural Science Foundation of China under Grant 62106137, and in part by the Shantou University under Grant NTF21035.

Sentao Chen received the Ph.D. degree in software engineering from South China University of Technology, Guangzhou, China, in 2020. He is currently a Lecturer with the Department of Computer Science, Shantou University, Shantou, China. His research interests include statistical machine learning, domain adaptation, and domain generalization.

References (55)

J.G. Moreno-Torres et al.
A unifying view on dataset shift in classification
Pattern Recognit.
(2012)
J. Zhang et al.
Generalizable model-agnostic semantic segmentation via target-specific normalization
Pattern Recognit.
(2022)
J. Liang et al.
Exploring uncertainty in pseudo-label guided unsupervised domain adaptation
Pattern Recognit.
(2019)
M.M. Rahman et al.
Correlation-aware adversarial domain adaptation and generalization
Pattern Recognit.
(2020)
H. Wang et al.
Domain generalization and adaptation based on second-order style information
Pattern Recognit.
(2022)
Y.-H. Liu et al.
A two-way alignment approach for unsupervised multi-source domain adaptation
Pattern Recognit.
(2022)
V.N. Vapnik
Statistical Learning Theory
(1998)
J. Quiñonero-Candela et al.
Dataset Shift in Machine Learning
(2008)
A. Torralba et al.
Unbiased look at dataset bias
IEEE Conference on Computer Vision and Pattern Recognition
(2011)
D. Li et al.
Deeper, broader and artier domain generalization
IEEE International Conference on Computer Vision
(2017)

H. Li et al.

Domain generalization with adversarial feature learning

IEEE Conference on Computer Vision and Pattern Recognition

(2018)

S. Chen et al.

Domain adaptation by joint distribution invariant projections

IEEE Trans. Image Process.

(2020)

K. Zhou et al.

Domain generalization with mixstyle

International Conference on Learning Representations

(2021)

K. Muandet et al.

Domain generalization via invariant feature representation

International Conference on Machine Learning

(2013)

M. Ghifary et al.

Scatter component analysis: a unified framework for domain adaptation and domain generalization

IEEE Trans. Pattern Anal. Mach. Intell.

(2017)

Y. Li et al.

Deep domain generalization via conditional invariant adversarial networks

European Conference on Computer Vision

(2018)

S. Zhao et al.

Domain generalization via entropy regularization

Advances in Neural Information Processing Systems

(2020)

A.T. Nguyen et al.

Domain invariant representation learning with domain density transformations

Advances in Neural Information Processing Systems

(2021)

Y. Li et al.

Domain generalization via conditional invariant representations

AAAI Conference on Artificial Intelligence

(2018)

H. Venkateswara et al.

Deep hashing network for unsupervised domain adaptation

IEEE Conference on Computer Vision and Pattern Recognition

(2017)

M. Yamada et al.

Relative density-ratio estimation for robust distribution comparison

Neural Comput.

(2013)

G. Blanchard et al.

Generalizing from several related classification tasks to a new unlabeled sample

Advances in Neural Information Processing Systems

(2011)

A. Khosla et al.

Undoing the damage of dataset bias

European Conference on Computer Vision

(2012)

K. Akuzawa et al.

Adversarial invariant feature learning with accuracy constraint for domain generalization

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

(2019)

Y. Ganin et al.

Domain-adversarial training of neural networks

J. Mach. Learn. Res.

(2016)

M. Gong et al.

Domain adaptation with conditional transferable components

International Conference on Machine Learning

(2016)

D. Li et al.

Learning to generalize: meta-learning for domain generalization

AAAI Conference on Artificial Intelligence

(2018)

Cited by (17)

Multiple-environment Self-adaptive Network for aerial-view geo-localization
2024, Pattern Recognition
Aerial-view geo-localization tends to determine an unknown position through matching the drone-view image with the geo-tagged satellite-view image. This task is mostly regarded as an image retrieval problem. The key underpinning this task is to design a series of deep neural networks to learn discriminative image descriptors. However, existing methods meet large performance drops under realistic weather, such as rain and fog, since they do not take the domain shift between the training data and multiple test environments into consideration. To minor this domain gap, we propose a Multiple-environment Self-adaptive Network (MuSe-Net) to dynamically adjust the domain shift caused by environmental changing. In particular, MuSe-Net employs a two-branch neural network containing one multiple-environment style extraction network and one self-adaptive feature extraction network. As the name implies, the multiple-environment style extraction network is to extract the environment-related style information, while the self-adaptive feature extraction network utilizes an adaptive modulation module to dynamically minimize the environment-related style gap. Extensive experiments on three widely-used benchmarks, i.e., University-1652, SUES-200, and CVUSA, demonstrate that the proposed MuSe-Net achieves a competitive result for geo-localization in multiple environments. Furthermore, we observe that the proposed method also shows great potential to the unseen extreme weather, such as mixing the fog, rain and snow.
Discovering causally invariant features for out-of-distribution generalization
2024, Pattern Recognition
Out-of-distribution (OOD) generalization aims to generalize a model trained on source domains to unseen target domains. Recently, causality-based generalization methods have focused on learning invariant causal relationships around the label variable, as causal mechanisms are robust across different domains. However, these methods would yield an inaccurate causal variable set due to the lack of heterogeneous domain data or a prior causal structure, which severely weakens their generalization capacity. To address this problem, we propose a Causally Invariant Features Discovery (CIFD) framework, which combines causal structure discovery and causal effect estimation for selecting a high-quality causal variable set and realizing better OOD generalization. Specifically, CIFD first identifies all potential causal variables by learning a double-layer-based local causal structure around the label variable. Secondly, CIFD uses a double-layer causal effect estimator for estimating the causality of potential causal variables and obtaining true causal variables. The comprehensive experiments on both regression and classification tasks clearly demonstrate the superiority of our framework over the state-of-art methods.
Multi-Source Domain Adaptation with Mixture of Joint Distributions
2024, Pattern Recognition
The goal of Multi-Source Domain Adaptation (MSDA) is to train a model (e.g., neural network) with minimal target loss, utilizing training data from multiple source domains (source joint distributions) and a target domain (target joint distribution). The challenge in this problem is that the multiple source joint distributions are different from the target joint distribution. In this paper, we develop a theory that shows a neural network’s target loss is upper bounded by both its source mixture loss (i.e., the loss concerning the source mixture joint distribution) and the Pearson $χ^{2}$ divergence between the source mixture joint distribution and the target joint distribution. Here, the source mixture joint distribution is the mixture of multiple source joint distributions with mixing weights. Accordingly, we propose an algorithm that optimizes both the mixing weights and the neural network to minimize the estimated source mixture loss and the estimated Pearson $χ^{2}$ divergence. To estimate the Pearson $χ^{2}$ divergence, we rewrite it as the maximal value of a quadratic functional, exploit a linear-in-parameter function as the functional’s input, and solve the resultant optimization problem with an analytic solution. This analytic solution allows us to explicitly express the estimated divergence as a loss of the mixing weights and the network’s feature extractor. Finally, we conduct experiments on popular image classification datasets, and the results show that our algorithm statistically outperforms the comparison algorithms. PyTorch code is available at https://github.com/sentaochen/Mixture-of-Joint-Distributions.
Semi-supervised domain generalization with evolving intermediate domain
2024, Pattern Recognition
Domain Generalization (DG) aims to generalize a model trained on multiple source domains to an unseen target domain. The source domains always require precise annotations, which can be cumbersome or even infeasible to obtain in practice due to the vast amount of data involved. Web data, namely web-crawled images, offers an opportunity to access large amounts of unlabeled images with rich style information, which can be leveraged to improve DG. From this perspective, we introduce a novel paradigm of DG, termed as Semi-Supervised Domain Generalization (SSDG), to explore how the labeled and unlabeled source domains can interact, and establish two settings, including the close-set and open-set SSDG. The close-set SSDG is based on existing public DG datasets, while the open-set SSDG, built on the newly-collected web-crawled datasets, presents a novel yet realistic challenge that pushes the limits of current technologies. A natural approach of SSDG is to transfer knowledge from labeled data to unlabeled data via pseudo labeling, and train the model on both labeled and pseudo-labeled data for generalization. Since there are conflicting goals between domain-oriented pseudo labeling and out-of-domain generalization, we develop a pseudo labeling phase and a generalization phase independently for SSDG. Unfortunately, due to the large domain gap, the pseudo labels provided in the pseudo labeling phase inevitably contain noise, which has negative affect on the subsequent generalization phase. Therefore, to improve the quality of pseudo labels and further enhance generalizability, we propose a cyclic learning framework to encourage a positive feedback between these two phases, utilizing an evolving intermediate domain that bridges the labeled and unlabeled domains in a curriculum learning manner. Extensive experiments are conducted to validate the effectiveness of our method. It is worth highlighting that web-crawled images can promote domain generalization as demonstrated by the experimental results.
Domain generalization via Inter-domain Alignment and Intra-domain Expansion
2024, Pattern Recognition
The performance of traditional deep learning models tends to drop dramatically during being deployed in real-world scenarios when the distribution shift between the seen training and unseen test data occurs. Domain Generalization methods are designed to achieve generalizability to deal with the above issue. Since the features extracted by softmax cross-entropy loss are not adequately domain-invariant, previous works in Domain Generalization have attempted to overcome this problem by employing contrastive-based losses which pull positive pairs (i.e., samples with the same class label) from different domains closer. Unfortunately, these approaches tend to produce an extremely small feature space, which is not robust facing unseen domain and easily overfits to source domains. To address the aforementioned issue, we propose a novel loss named IAIE Loss to simultaneously perform Inter-domain Alignment and Intra-domain Expansion for positive pairs, which facilitates the model to extract domain-invariant features and mitigates overfitting. Specifically, we design two sets of positive samples named “easy positive samples” and “hard positive samples”. IAIE Loss pulls the hard positive pairs closer (alignment) while pushing the easy positive pairs apart (expansion). The state-of-the-art results on multiple DG benchmark datasets verify the effectiveness of our method.
Generalization of deep learning models for natural gas indication in 2D seismic data
2023, Pattern Recognition
Methods based on Machine Learning and Deep Learning are increasingly popular to help interpret large volumes of data that belong to various areas and seek to fulfill multiple tasks. One of these areas studies seismic data in the search for hydrocarbon reserves, for which Deep Learning models are trained, showing acceptable results for low study data. However, these models present generalization problems. Their performance tends to decrease when used on seismic data from new exploration. This tendency is particularly true for 2D data, which have many features. This work presents a method to improve the generalization of the Deep Learning model for the indication of natural gas in 2D seismic data based on the recommendation of training data and hyperparameter operations of the model. The tests used a database of the Parnaíba basin in northeast Brazil. Experiments showed an increase in the correct indication of natural gas that varies according to the metric $8 % \leq R e c a l l \leq 37 %$ , with a fluctuation in the increase of false positives of $- 2 % \leq P r e c i s i o n \leq 13 %$ . It is an improvement in the generalization of the Deep Learning model of up to 11% according to the F1 score metric or up to 10% according to the IoU metric.

View all citing articles on Scopus

Lei Wang received his Ph.D. degree from Nanyang Technological University, Singapore in 2004. Now he works as associate professor at School of Computing and Information Technology of University of Wollongong, Australia. His research interests include pattern recognition, machine/deep learning, computer vision and image retrieval.

Zijie Hong received the B.S. degree in software engineering from South China University of Technology, Guangzhou, China, in 2019. He is currently pursuing the M.Sc. degree in software engineering in the School of Software Engineering, South China University of Technology. His research interests include domain adaptation and computer vision.

Xiaowei Yang received the B.S. degree in theoretical and applied mechanics, the M.Sc. degree in computational mechanics, and the Ph.D. degree in solid mechanics from Jilin University, Changchun, China, in 1991, 1996, and 2000, respectively. He is currently a full-time professor in the School of Software Engineering, South China University of Technology. His current research interests include designs and analyses of algorithms for large-scale pattern recognition, imbalanced learning, semisupervised learning, support vector machines, tensor learning, and evolutionary computation.

View full text

Domain Generalization by Joint-Product Distribution Alignment

Highlights

Abstract

Introduction

Section snippets

Related work

Methodology

Experiments

Conclusion

Declaration of Competing Interest

Acknowledgments

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Statistical Learning Theory

Dataset Shift in Machine Learning

Unbiased look at dataset bias

IEEE Conference on Computer Vision and Pattern Recognition

Deeper, broader and artier domain generalization

IEEE International Conference on Computer Vision

Domain generalization with adversarial feature learning

IEEE Conference on Computer Vision and Pattern Recognition

Domain adaptation by joint distribution invariant projections

IEEE Trans. Image Process.

Domain generalization with mixstyle

International Conference on Learning Representations

Domain generalization via invariant feature representation

International Conference on Machine Learning

Scatter component analysis: a unified framework for domain adaptation and domain generalization

IEEE Trans. Pattern Anal. Mach. Intell.

Deep domain generalization via conditional invariant adversarial networks

European Conference on Computer Vision

Domain generalization via entropy regularization

Advances in Neural Information Processing Systems

Domain invariant representation learning with domain density transformations

Advances in Neural Information Processing Systems

Domain generalization via conditional invariant representations

AAAI Conference on Artificial Intelligence

Deep hashing network for unsupervised domain adaptation

IEEE Conference on Computer Vision and Pattern Recognition

Relative density-ratio estimation for robust distribution comparison

Neural Comput.

Generalizing from several related classification tasks to a new unlabeled sample

Advances in Neural Information Processing Systems

Undoing the damage of dataset bias

European Conference on Computer Vision

Adversarial invariant feature learning with accuracy constraint for domain generalization

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

Domain-adversarial training of neural networks

J. Mach. Learn. Res.

Domain adaptation with conditional transferable components

International Conference on Machine Learning

Learning to generalize: meta-learning for domain generalization

AAAI Conference on Artificial Intelligence