We consider the problem of adding a large unlabeled sample from the target domain to boost the performance of a domain adaptation algorithm when only a small set of labeled examples is available from the target domain. In particular, we consider the problem setting motivated by the tasks of splice site prediction and protein localization. For example, for splice site prediction, annotating a genome using machine learning requires a lot of labeled data, whereas for non-model organisms, there are only some labeled data and lots of unlabeled data. With domain adaptation one can leverage the large amount of data from a related model organism, along with the labeled and unlabeled data from the organism of interest to train a classifier for the latter. Our goal is to analyze the three approaches of incorporating the unlabeled data—with soft labels only (i.e., Expectation-Maximization), with hard labels only (i.e., self-training), or with both soft and hard labels—for the splice site prediction and protein localization in particular, and more broadly for a general iterative domain adaptation setting. We provide empirical results on splice site prediction and protein localization indicating that using a combination of soft and hard labels performs as good as the best of the other two approaches of integrating unlabeled data.

When checking for convergence we assigned hard labels to all instances from the target unlabeled data set.
Downloaded from http://ftp.raetschlab.org/user/cwidmer.,
Downloaded from http://www.psort.org/dataset/datasetv2.html.
Downloaded from http://www.cbs.dtu.dk/services/TargetP/datasets/datasets.php.
This work was supported by an institutional development award (IDeA) from the National Institute of General Medical Sciences of the National Institutes of Health under Grant Number P20GM103418. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of General Medical Sciences or the National Institutes of Health. The computing for this project was performed on the Beocat Research Cluster at Kansas State University, which is funded in part by Grants MRI-1126709, CC-NIE-1341026, MRI-1429316, CCIIE-1440548.
