Discrepant collaborative training by Sinkhorn divergences
Introduction
In this paper, we address the issue of learning diverse yet distinct models in a general Co-Training [1] (Co-Tr) framework. More specifically, we equip the Co-Tr learning paradigm with a novel discrepancy module that results in learning different yet complementary views of the input; thereby enhancing the overall discriminative ability of the entire Co-Tr module.
Designing diverse models is a long-standing problem in machine learning despite several breakthroughs [2], [3], [4]. The prime example is the boosting algorithm and its variants, where a number of different, weak classifiers are learned sequentially. However, in the era of deep learning, the capacity of boosting algorithms to produce strong classifiers have been vastly superseded, and more appropriate alternative methods to enforce diversity in a deep network need to be investigated.
To address the aforementioned need, what started as a simple learning of two conditionally independent views of classifying web-data [1], has now been employed across a wide variety of tasks such as image classification [5], [6], text classification [4], email classification [7], and natural language processing [8] etc. In our preliminary study, we have employed Co-Tr framework with MMD to learn from noisy labels. Our work proved that improved Co-Tr could learn better from noisy labels. Co-Tr has also been used in a semi-supervised setting [6],
We stress that the diversity is a requirement for the success of the Co-Tr framework [9]. The predominant techniques to induce diversity in Co-Tr are (a) use of different network architectures [10], (b) random initialization schemes [11] in each of the individual networks, and (c) training each network with different sets of samples [5], [6]. While these tactics have achieved significant improvements in learning different but complementary views of the input, they suffer from a few setbacks. First, two or more networks with different architectures must be highly compatible with each other in order to jointly learn different yet complementary features given the same input data. Moreover, random initialization of the networks does not guarantee that the learnt features will indeed be diverse, distinct and informative enough for the task at hand. As a remedy, the recent work of Qiao et al utilized adversarial examples during training along with random initialization [6]. Nevertheless, one cannot guarantee the learning of complementary features as the distribution of the adversarial examples is very similar to the original training images. Interestingly, several recent works used identical homogeneous networks as the feature extractor units for co-training [5], [12]. Nonetheless, these methods do not explicitly enforce a discrepancy between the homogeneous networks and may thereby fail in capturing inherent discriminative features across the different views.
In this paper, we address the above concern by explicitly enforcing a diversity constraint in the Co-Tr framework such that the underlying networks do learn different yet complementary views (or features) of the input even though they have the same architecture and different initialization. Both the two networks f and g have the same structure (Fig. 1). They are fed with same batch of input data and are trained on the same task: image classification. However, they will be updated by different losses. Our proposed module (yellow boxes in Fig. 1) helps to keep the homogeneous networks different from each other and improve the image classification accuracy. O1 and O2 are prediction outputs from networks. For the test stage, we gather the outputs of network f and g for each test image. We then select the larger value (most confident) as the final prediction of the test image. More details are explained in Section 5. We apply divergence [13] to measure diversity between networks. One can simply tune into Maximum Mean Discrepancy (MMD) or Optimal Transport (OT) distance by changing ∈. The module can be seamlessly added to any part of a Co-Tr framework. This module drives the networks to robustly learn from different views but keeping the consistency of the outputs at same time. Fig. 1 provides a brief schematic of our proposed methodology.
Our major contributions are as follows:
- •
An explicit method to disentangle learnt representations via an based discrepancy module.
- •
An empirical study of the effectiveness of two well known variants of divergence measures, including Maximum Mean Discrepancy and Wasserstein Distance.
- •
An extensive set of experiments demonstrating the advantage of such a module across two different settings; (i) noisy labeled image classification, and (ii) semi-supervised image classification.
Section snippets
Extension of previous work
We extend our previous work to a more general module by applying more tasks, datasets and divergences. Our previous work can be considered as a special case of our proposed work here. Our theoretical contribution is to formulate the problem using Sinkhorn divergence. Sinkhorn divergence is a family of divergences where MMD and Wasserstein distance are considered as instances of this family. Our new work aims to cover more general notion of distance measures under the family of Sinkhorn
Co-training
Blum et al [1] successfully used a Co-Tr framework to solve the problem of web page classification. Co-Tr algorithm is to learn two (or more) models with distinction and diversity. It has been applied to a large variety of tasks ranging from domain adaptation [16], image classification [6], data segmentation [17], tag-based image search [18] and many more. Similar to the general framework of learning in Co-Tr, Learning to Teach [19] applies an inherent feedback sharing mechanism between a
Preliminaries
Notation. All notations used are shown in Table 1. Throughout this paper, we use bold lower-case letters (e.g, x) and bold upper-case letters (e.g, X) to represent column vectors and matrices respectively. [x]i denotes the ith element of the vector x. In represents the n × n identity matrix. represents the Frobenius norm of the matrix X, with Tr(⋅) indicating the trace of the matrix X. X⊤ denotes the transpose of X. and represent the set of probability measures
Methodology
Here, we present our proposed methodology, i.e., Discrepant Collaborative Training (DCT). We formulate DCT as a cohort of two homogeneous networks f and g with their learnable parameters represented as θ and respectively (see Fig. 1 for more details.). Our total loss is contributed by three parts which are: LM, LD and LC. LM is the basic loss function for different tasks (for example, cross entropy loss for image classification task). LD is discrepancy loss to increase diversity. LC is the
Dataset
We verify the effectiveness of our approach on six benchmark datasets: MNIST [35], CIFAR10 [36], CIFAR100 [36], SVHN [37], CUB200–2011 [38], CARS196 [39], Clothing1M [15] on different tasks. Table 3 provides more details regarding the datasets.
Setup
Table 2 shows the CNN architecture used for experiments on MNIST, CIFAR10, CIFAR100 and SVHN. For experiments in the weakly-supervised learning setup, we follow the well-acknowledged standard settings of “Temporal Ensembling” [40] and “Virtual Adversarial
Further discussion and analysis
In this section, we provide details of our implementation and study the robustness of our DCT design with respect to its parameters, namely λ2 and λ3 (see Eq. (19), below). We will first report the value of the hyper parameters used in our experiments. Then we will analyze and investigate the effect of the hyper-parameters over the performance of the proposed DCT algorithm.
Conclusion
A novel and effective method (i.e. Co-Tr utilizing ) for training deep neural networks was presented. It was demonstrated to be outperforming all other baseline methods in the vast majority of settings influenced by noisy labels, as well as in a semi-supervised setting. divergence was used, in the middle of the network, to drive the networks to learn distinct features, while the at the end enforces the networks to still learn similar class probability distributions. Multi-label
Credit author statement
No. Name(s) of author(s) Highest academic degree Author affiliations Contribution 1 Yan Han
(corresponding)Master Australian National University Conceptualization; Methodology; Software; Validation; Investigation; Writing-Original Draft. 2 Soumava Kumar Roy Master Australian National University Conceptualization; Methodology; Writing-Review & Editing. 3 Lars Petersson Doctor Commonwealth Scientific and Industrial Research Organization Conceptualization;Methodology; Supervision; Writing-Review &
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (50)
- et al.
Ensemble diversity measures and their application to thinning
Inform. Fusion
(2005) - et al.
Combining labeled and unlabeled data with co-training
Multitask learning
Mach. Learn.
(1997)- et al.
Multi-task feature learning
- et al.
Analyzing the effectiveness and applicability of co-training, in: Cikm
Vol.
(2000) - et al.
Co-teaching: robust training of deep neural networks with extremely noisy labels
- et al.
Deep Co-Training for semi-supervised image recognition
- et al.
Email classification with co-training
- et al.
Co-training an improved recurrent neural network with probability statistic models for named entity recognition
- et al.
Deep mutual learning