Elsevier

Pattern Recognition Letters

Volume 37, 1 February 2014, Pages 15-23
Pattern Recognition Letters

Unlabeling data can improve classification accuracy

https://doi.org/10.1016/j.patrec.2013.03.027Get rights and content

Highlights

  • Early classification in clinical studies is possible with transductive learners.

  • Removal of labels can improve classification accuracy.

  • Algorithms are mislead to different degrees by correctly labeled data.

  • Data has the strongest influence on prediction not percentage of labeling.

Abstract

In this study we focus on the effects of sample limitations on partially supervised learning algorithms. We analyze the performance of these types of learning algorithms on small datasets under varying trade-offs between labeled and unlabeled samples. In contrast to the typical settings for partially supervised learning algorithms, the number of available unlabeled samples is also restricted.

We utilize gene expression datasets, which are typical examples of data collections of small sample size. DNA microarrays are used to generate these profiles by measuring thousands of mRNA values simultaneously. These profiles are increasingly used for tumor categorization. Partially labeled microarray datasets occur naturally in the diagnostic setting if the corresponding labeling process is time consuming or expensive (i.e., “early relapse” vs. “late relapse”).

Surprisingly, the best classification results in our study were not always achieved for a maximal proportion of labeled samples. This is unexpected as asymptotical results for an unlimited amount of samples suggest that a labeled sample is of an exponentially higher value than an unlabeled one. Our analysis shows that in the case of finite sample sizes a more balanced trade-off between labeled and unlabeled samples is optimal. This trade-off was not unique over all experiments. It could be shown that the optimal trade-off between unlabeled and labeled samples is mainly dependent on the chosen learning algorithm.

Introduction

In modern clinical studies the progress of a disease is often monitored by gene expression profiles extracted with the help of DNA microarrays (e.g., West et al., 2001, Shipp et al., 2002, Bittner et al., 2000). These so called high-throughput methods can be used to extract thousands of gene expression levels simultaneously. Gene expression profiles can be used as a decision basis for the categorization into clinical relevant groups (e.g. “inflammation” vs. “tumor”, Buchholz et al., 2005). In this setting a data collection seldom exceeds a few dozen samples. A classifier utilized for this task has to handle data of high dimensionality and low cardinality. One standard learning scheme applied in this scenario is the supervised one, a learning scheme training a classifier solely on categorized samples (e.g., Duda et al., 2001). It is based on the implicit assumption that an adequate number of labeled training samples exist. Many clinically relevant classification tasks do not fulfill these assumptions. Here, gene expression profiles can be available years before the corresponding diagnoses. For example, one scenario could be the response to treatment. In this case it could be important to know if this patient will sustain an “early relapse” or a “late relapse”. If the standard supervised classification scheme is used, the analyses of the available data can only start after the last diagnosis is available. Often these early diagnoses could be helpful in earlier stages of a study.

Partially supervised approaches, such as the semi-supervised learning scheme (Chapelle et al., 2006) or the transductive learning scheme (Vapnik and Chervonenkis, 1974), are able to handle partially labeled datasets. They are able to incorporate information of both labeled and unlabeled samples. This can be done by utilizing the positional information of an unlabeled sample (Shu et al., 2009). Although partially supervised algorithms seem preferable in the setting described above, they are more likely to be applied in fields with much more available observations (e.g., Joachims, 1999, Cohen et al., 2004). The performance of these algorithms on small (and possibly high-dimensional) datasets is mainly unexplored.

In this study we investigate the usability of partially supervised algorithms for the classification of small (microarray) datasets. We analyzed seven of these algorithms under different constraints on ten publicly available microarray datasets. Our experimental setup consists of sequences of cross-validation experiments that allow an analysis of partially supervised algorithms under varying trade-offs of labeled and unlabeled samples. The focus of our study is to identify “early prediction tools”, classifiers that can be used for sparsely labeled datasets. Prior results presented at the PSL 2011 workshop (Lausser et al., 2011), lead to this investigation and the hypothesis given in the title.

The rest of this paper is structured as follows. Section 1.1 gives an overview on the related work in the field of partially supervised learning. Section 2 describes the experimental setup and gives the list of algorithms tested in our study (Section 2.4). The results of our study are given in Section 4 and are discussed in Section 5.

Incorporating information from unlabeled samples into the training of a classifier is not new. First approaches in this direction can be found in the 1960s in the context of self-training approaches (Scudder, 1965). Today the field of partially supervised classifiers is mainly partitioned in two major classes of algorithms see for a good introduction (Chapelle et al., 2006, Seeger, 2000). The first one, called semi-supervised classification, is often used as a general term for all partially supervised training algorithms (Chapelle et al., 2006). Specifically, it is used for algorithms that train “stand-alone” classification models which can be applied to new unseen samples. The second one called transductive classification was proposed by Vapnik and coauthors (Vapnik and Chervonenkis, 1974, Vapnik and Sterin, 1977). An overview of this field can be found in Vapnik (1998). The task of transductive learning is to incorporate the unlabeled query (test) samples directly into the training of a classifier.

Both types of classifiers are normally utilized in applications that are characterized by an expensive or time consuming labeling process but allow an easy acquisition of unlabeled samples. These characteristics occur for example in the fields of text mining (e.g., Nigam et al., 1998, Joachims, 1999, Yarowsky, 1995) or image recognition (e.g., Cohen et al., 2004, Cai et al., 2007). Partially supervised algorithms were also applied on large biological datasets (e.g., Weston et al., 2003, Shah et al., 2008).

Asymptotical results on the beneficial influence of unlabeled samples on the pattern recognition task were given by Castelli and Cover, 1995, Castelli and Cover, 1996. Their analysis is based on a setting in which the samples are drawn according to a mixture of identifiable class conditional densities. They show that if an unlimited amount of unlabeled samples exists, a single labeled sample can be used to construct a classifier with a risk of 2R(1-R), where R is the Bayes risk (Castelli and Cover, 1995). Additional labeled samples reduce the risk of the classifier exponentially fast to the Bayes risk. They conclude that labeled samples are necessary and exponentially more valuable in this setting.

In Castelli and Cover (1996) they analyze a similar setting in which the class conditional densities are already identified and only the class priors are unknown. They show that in this case a classifier can solely be trained by unlabeled samples and that labeled and unlabeled samples can diminish the difference between the risk of the classifier and the Bayes risk in a similar way.

In this work, we now explore the region of small sample number and labeled to unlabeled ratio for finite size settings before infinity.

Section snippets

Methods

In this work a classifier c will be seen as a discriminative function c:XY mapping an input space X to the space of class labels Y. We will focus on binary classification tasks. The space of class labels will be fixed to Y{0,1}. The joint probability distribution D of X×Y is assumed to be fixed but unknown. In this setting the prediction of a class label should be achieved with an high accuracy. That is, the classifier should attain a small generalization riskRD=Pr(c(X)Y).Here (X,Y) denotes

Experimental setup

The algorithms described above were analyzed on ten microarray datasets published in the years between 2000 and 2005. The dimensionality and cardinality of these datasets is given in Table 1.

The classifiers were tested in a sequence of cross-validation experiments. In this sequence the number of folds and, as explained before, the number of available labeled and unlabeled training samples is varied. The sequences started with cross-validation experiments with k{10,,2} folds, which corresponds

Classification results

The results of the cross-validation sequences are shown in Fig. 2. Accuracies, sensitivities and specificities are shown. Two of the tested algorithms, namely tsvm and plc, achieved a meaningful accuracy (higher than a good naive classifier always predicting the larger class) on all datasets. The same is true for the tknn except for its results in the 11.1% and 10% cross-validation experiments on SH. The sel and the ssnmc reach this naive approach only in some experiments on BI. Some tuples of

Discussion

To our knowledge this is one of the probably few studies, we do not know about any other, in partially supervised learning that deals with small datasets. For these datasets the sample size limitation does not only apply to labeled samples but also to unlabeled ones. Our study showed that some partially supervised learners achieved non-trivial classification accuracies in all our experiments. As expected, better results were mostly gained in the cross-validation settings designed to use more

Acknowledgments

This work was funded in part by a Karl-Steinbuch grant to FS, the German Federal Ministry of Education and Research (BMBF) within the framework of medical genome research (PaCa-Net; project ID PKB-01GS08) and Gerontosys II (Forschungskern SyStaR, project ID 0315894A) to HAK. This work was also supported by Deutsche Forschungsgemeinschaft (DFG) grant to MS (SCHM 2966/1-1) and HAK (SFB 1074, project Z1). The responsibility for the content lies exclusively with the authors.

References (37)

  • Cai, D., He, X., Han, J., 2007. Semi-supervised discriminant analysis. In: IEEE 11th International Conference on...
  • V. Castelli et al.

    The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter

    IEEE Transactions on Information Theory

    (1996)
  • I. Cohen et al.

    Semisupervised learning of classifiers: theory, algorithms, and their application to human–computer interaction

    PAMI

    (2004)
  • Cormen, T., Leiserson, C., Rivest, R., Stein, C., 2009. Introduction to Algorithms, third ed., The MIT...
  • R. Duda et al.

    Pattern Classification

    (2001)
  • Fix, E., Hodges, J., 1951. Discriminatory analysis: nonparametric discrimination: consistency properties. Technical...
  • T. Hastie et al.

    The Elements of Statistical Learning

    (2003)
  • Cited by (0)

    1

    Contributed equally.

    View full text