Discrepant collaborative training by Sinkhorn divergences

https://doi.org/10.1016/j.imavis.2021.104213Get rights and content

Highlights

  • An improved Co-training framework by applying Sinkhorn divergence module to it

  • An empirical study of effectiveness of well-known variants of Sinkhorn divergence

  • An extensive set of experiments demonstrating the advantage of our module

Abstract

Deep Co-Training algorithms are typically comprised of two distinct and diverse feature extractors that simultaneously attempt to learn task-specific features from the same inputs. Achieving such an objective is, however, not trivial, despite its innocent look. This is because homogeneous networks tend to mimic each other under the collaborative training setup. Keeping this difficulty in mind, we make use of the newly proposed S divergence to encourage diversity between homogeneous networks. The S divergence encapsulates popular measures such as maximum mean discrepancy and the Wasserstein distance under the same umbrella and provides us with a principled, yet simple and straightforward mechanism. Our empirical results in two domains, classification in the presence of noisy labels and semi-supervised image classification, clearly demonstrate the benefits of the proposed framework in learning distinct and diverse features. We show that in these respective settings, we achieve impressive results by a notable margin.

Introduction

In this paper, we address the issue of learning diverse yet distinct models in a general Co-Training [1] (Co-Tr) framework. More specifically, we equip the Co-Tr learning paradigm with a novel discrepancy module that results in learning different yet complementary views of the input; thereby enhancing the overall discriminative ability of the entire Co-Tr module.

Designing diverse models is a long-standing problem in machine learning despite several breakthroughs [2], [3], [4]. The prime example is the boosting algorithm and its variants, where a number of different, weak classifiers are learned sequentially. However, in the era of deep learning, the capacity of boosting algorithms to produce strong classifiers have been vastly superseded, and more appropriate alternative methods to enforce diversity in a deep network need to be investigated.

To address the aforementioned need, what started as a simple learning of two conditionally independent views of classifying web-data [1], has now been employed across a wide variety of tasks such as image classification [5], [6], text classification [4], email classification [7], and natural language processing [8] etc. In our preliminary study, we have employed Co-Tr framework with MMD to learn from noisy labels. Our work proved that improved Co-Tr could learn better from noisy labels. Co-Tr has also been used in a semi-supervised setting [6],

We stress that the diversity is a requirement for the success of the Co-Tr framework [9]. The predominant techniques to induce diversity in Co-Tr are (a) use of different network architectures [10], (b) random initialization schemes [11] in each of the individual networks, and (c) training each network with different sets of samples [5], [6]. While these tactics have achieved significant improvements in learning different but complementary views of the input, they suffer from a few setbacks. First, two or more networks with different architectures must be highly compatible with each other in order to jointly learn different yet complementary features given the same input data. Moreover, random initialization of the networks does not guarantee that the learnt features will indeed be diverse, distinct and informative enough for the task at hand. As a remedy, the recent work of Qiao et al utilized adversarial examples during training along with random initialization [6]. Nevertheless, one cannot guarantee the learning of complementary features as the distribution of the adversarial examples is very similar to the original training images. Interestingly, several recent works used identical homogeneous networks as the feature extractor units for co-training [5], [12]. Nonetheless, these methods do not explicitly enforce a discrepancy between the homogeneous networks and may thereby fail in capturing inherent discriminative features across the different views.

In this paper, we address the above concern by explicitly enforcing a diversity constraint in the Co-Tr framework such that the underlying networks do learn different yet complementary views (or features) of the input even though they have the same architecture and different initialization. Both the two networks f and g have the same structure (Fig. 1). They are fed with same batch of input data and are trained on the same task: image classification. However, they will be updated by different losses. Our proposed module (yellow boxes in Fig. 1) helps to keep the homogeneous networks different from each other and improve the image classification accuracy. O1 and O2 are prediction outputs from networks. For the test stage, we gather the outputs of network f and g for each test image. We then select the larger value (most confident) as the final prediction of the test image. More details are explained in Section 5. We apply S divergence [13] to measure diversity between networks. One can simply tune S into Maximum Mean Discrepancy (MMD) or Optimal Transport (OT) distance by changing . The S module can be seamlessly added to any part of a Co-Tr framework. This module drives the networks to robustly learn from different views but keeping the consistency of the outputs at same time. Fig. 1 provides a brief schematic of our proposed methodology.

Our major contributions are as follows:

  • An explicit method to disentangle learnt representations via an S based discrepancy module.

  • An empirical study of the effectiveness of two well known variants of S divergence measures, including Maximum Mean Discrepancy and Wasserstein Distance.

  • An extensive set of experiments demonstrating the advantage of such a S module across two different settings; (i) noisy labeled image classification, and (ii) semi-supervised image classification.

Section snippets

Extension of previous work

We extend our previous work to a more general module by applying more tasks, datasets and divergences. Our previous work can be considered as a special case of our proposed work here. Our theoretical contribution is to formulate the problem using Sinkhorn divergence. Sinkhorn divergence is a family of divergences where MMD and Wasserstein distance are considered as instances of this family. Our new work aims to cover more general notion of distance measures under the family of Sinkhorn

Co-training

Blum et al [1] successfully used a Co-Tr framework to solve the problem of web page classification. Co-Tr algorithm is to learn two (or more) models with distinction and diversity. It has been applied to a large variety of tasks ranging from domain adaptation [16], image classification [6], data segmentation [17], tag-based image search [18] and many more. Similar to the general framework of learning in Co-Tr, Learning to Teach [19] applies an inherent feedback sharing mechanism between a

Preliminaries

Notation. All notations used are shown in Table 1. Throughout this paper, we use bold lower-case letters (e.g, x) and bold upper-case letters (e.g, X) to represent column vectors and matrices respectively. [x]i denotes the ith element of the vector x. In represents the n × n identity matrix. XF=TrXX represents the Frobenius norm of the matrix X, with Tr(⋅) indicating the trace of the matrix X. X denotes the transpose of X. PU and PV represent the set of probability measures

Methodology

Here, we present our proposed methodology, i.e., Discrepant Collaborative Training (DCT). We formulate DCT as a cohort of two homogeneous networks f and g with their learnable parameters represented as θ and θ^ respectively (see Fig. 1 for more details.). Our total loss is contributed by three parts which are: LM, LD and LC. LM is the basic loss function for different tasks (for example, cross entropy loss for image classification task). LD is discrepancy loss to increase diversity. LC is the

Dataset

We verify the effectiveness of our approach on six benchmark datasets: MNIST [35], CIFAR10 [36], CIFAR100 [36], SVHN [37], CUB200–2011 [38], CARS196 [39], Clothing1M [15] on different tasks. Table 3 provides more details regarding the datasets.

Setup

Table 2 shows the CNN architecture used for experiments on MNIST, CIFAR10, CIFAR100 and SVHN. For experiments in the weakly-supervised learning setup, we follow the well-acknowledged standard settings of “Temporal Ensembling” [40] and “Virtual Adversarial

Further discussion and analysis

In this section, we provide details of our implementation and study the robustness of our DCT design with respect to its parameters, namely λ2 and λ3 (see Eq. (19), below). We will first report the value of the hyper parameters used in our experiments. Then we will analyze and investigate the effect of the hyper-parameters over the performance of the proposed DCT algorithm.

Conclusion

A novel and effective method (i.e. Co-Tr utilizing S) for training deep neural networks was presented. It was demonstrated to be outperforming all other baseline methods in the vast majority of settings influenced by noisy labels, as well as in a semi-supervised setting. S divergence was used, in the middle of the network, to drive the networks to learn distinct features, while the S at the end enforces the networks to still learn similar class probability distributions. Multi-label

Credit author statement

No.Name(s) of author(s)Highest academic degreeAuthor affiliationsContribution
1Yan Han
(corresponding)
MasterAustralian National UniversityConceptualization; Methodology; Software; Validation; Investigation; Writing-Original Draft.
2Soumava Kumar RoyMasterAustralian National UniversityConceptualization; Methodology; Writing-Review & Editing.
3Lars PeterssonDoctorCommonwealth Scientific and Industrial Research OrganizationConceptualization;Methodology; Supervision; Writing-Review &

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (50)

  • R.E. Banfield et al.

    Ensemble diversity measures and their application to thinning

    Inform. Fusion

    (2005)
  • A. Blum et al.

    Combining labeled and unlabeled data with co-training

  • R. Caruana

    Multitask learning

    Mach. Learn.

    (1997)
  • A. Argyriou et al.

    Multi-task feature learning

  • K. Nigam et al.

    Analyzing the effectiveness and applicability of co-training, in: Cikm

    Vol.

    (2000)
  • B. Han et al.

    Co-teaching: robust training of deep neural networks with extremely noisy labels

  • S. Qiao et al.

    Deep Co-Training for semi-supervised image recognition

  • S. Kiritchenko et al.

    Email classification with co-training

  • Y. Sun et al.

    Co-training an improved recurrent neural network with probability statistic models for named entity recognition

  • Y. Zhang et al.

    Deep mutual learning

  • J.C. Spall

    Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control

    (2005)
  • E. Malach et al.

    Decoupling “when to update” from “how to update”

  • A. Genevay et al.

    Learning generative models with sinkhorn divergences

  • Z. Liu et al.

    Deepfashion: powering robust clothes recognition and retrieval with rich annotations

  • J. Deng et al.

    Scalable multi-label annotation

  • S. Chen et al.

    Multi-task Attention-based Semi-supervised Learning for Medical Image Segmentation

  • Y. Chai et al.

    Bicos: a bi-level co-segmentation method for image classification

  • Y. Gong et al.

    A multi-view embedding space for modeling internet images, tags, and their semantics

    Int. J. Comp. Vision

    (2014)
  • G. Hinton et al.

    Distilling the Knowledge in a Neural Network

  • Y. Zhang et al.

    Deep mutual learning

  • D.P. Kingma et al.

    Auto-encoding Variational Bayes

  • I. Goodfellow et al.

    Generative adversarial nets

  • S. Kullback et al.

    On information and sufficiency

    Ann. Math. Stat.

    (1951)
  • C.D. Manning et al.

    Foundations of Statistical Natural Language Processing

    (1999)
  • J. Feydy et al.

    Interpolating between Optimal Transport and MMD using Sinkhorn Divergences

  • Cited by (0)

    View full text