Abstract
We consider the problem of adding a large unlabeled sample from the target domain to boost the performance of a domain adaptation algorithm when only a small set of labeled examples is available from the target domain. In particular, we consider the problem setting motivated by the tasks of splice site prediction and protein localization. For example, for splice site prediction, annotating a genome using machine learning requires a lot of labeled data, whereas for non-model organisms, there are only some labeled data and lots of unlabeled data. With domain adaptation one can leverage the large amount of data from a related model organism, along with the labeled and unlabeled data from the organism of interest to train a classifier for the latter. Our goal is to analyze the three approaches of incorporating the unlabeled data—with soft labels only (i.e., Expectation-Maximization), with hard labels only (i.e., self-training), or with both soft and hard labels—for the splice site prediction and protein localization in particular, and more broadly for a general iterative domain adaptation setting. We provide empirical results on splice site prediction and protein localization indicating that using a combination of soft and hard labels performs as good as the best of the other two approaches of integrating unlabeled data.





Similar content being viewed by others
Notes
When checking for convergence we assigned hard labels to all instances from the target unlabeled data set.
Downloaded from http://ftp.raetschlab.org/user/cwidmer.,
Downloaded from http://www.psort.org/dataset/datasetv2.html.
Downloaded from http://www.cbs.dtu.dk/services/TargetP/datasets/datasets.php.
References
Bernal A, Crammer K, Hatzigeorgiou A, Pereira F (2007) Global discriminative learning for higher-accuracy computational gene prediction. PLOS Comput Biol 3(3):e54
Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on computational learning theory., COLT’ 98ACM, New York, NY, USA, pp 92–100
Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M Jr, Haussler D (2000) Knowledge-based analysis of microarray gene expression data using support vector machines. Proc Natl Acad Sci 97(1):262–267
Chapelle O, Schölkopf B, Zien A (eds) (2006) Semi-supervised Learning. Adaptive computation and machine learning. The MIT Press, Cambridge
Dai W, Xue GR, Yang Q, Yu Y (2007) Transferring Naïve Bayes classifiers for text classification. In: Proceedings of the national conference on artificial intelligence. AAAI Press, MIT Press, Menlo Park, CA, Cambridge, MA, London, vol 22, p 540
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1):1–38
Emanuelsson O, Nielsen H, Brunak S, von Heijne G (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 300(4):1005–1016
Gardy JL, Brinkman FS (2006) Methods for predicting bacterial protein subcellular localization. Nat Rev Microbiol 4(1):741–751
Gardy JL, Laird MR, Chen F, Rey S, Walsh C, Ester M, Brinkman FS (2005) Psortb v. 2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics 21(5):617–623
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18
Herndon N, Caragea D (2014a) Empirical study of domain adaptation with Naïve Bayes on the task of splice site prediction. In: Proceedings of the 5th international conference on bioinformatics models, methods and algorithms, BIOINFORMATICS 2014, pp 57–67
Herndon N, Caragea D (2014b) Predicting protein localization using a domain adaptation approach. In: Biomedical engineering systems and technologies. Springer, Berlin, pp 191–206
Herndon N, Caragea D (2015) Domain adaptation with logistic regression for the task of splice site prediction. In: Proceedings of the 11th international symposium on bioinformatics research and applications, ISBRA 2015, pp 125–137
Hubbard T, Park J (1995) Fold recognition and Ab Initio structure predictions using hidden Markov models and beta-strand pair potentials. Proteins 23(3):398–402
Joachims T (2002) Learning to classify text using support vector machines: methods, theory and algorithms. Kluwer Academic Publishers, Berlin
John GH, Langley P (1995) Estimating continuous distributions in bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., pp 338–345
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86
Lewis DD (1992) Representation and learning in information retrieval. Ph.D. thesis, University of Massachusetts
Maeireizo B, Litman D, Hwa R (2004) Co-training for predicting emotions with spoken dialogue data. In: Proceedings of the association for computational linguistics on interactive poster and demonstration sessions, ACL demo ’04. Association for computational linguistics, Stroudsburg, PA, USA
McCallum A, Nigam K et al (1998) A comparison of event models for Naïve Bayes text classification. In: Proceedings of the association for the advancement of artificial intelligence workshop on learning for text categorization, vol 752, Citeseer, pp 41–48
Müller KR, Mika S, Rätsch G, Tsuda S, Schölkopf B (2001) An introduction to kernel-based learning algorithms. IEEE Trans Neural Netw 12(2):181–202
Nigam K, McCallum AK, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39(2–3):103–134
Noble WS (2006) What is a support vector machine? Nat Biotechnol 24(12):1565–1567
Rätsch G, Sonnenburg S, Srinivasan J, Witte H, Müller KR, Sommer R, Schölkopf B (2007) Improving the C. elegans genome annotation using machine learning. PLoS Comput Biol 3:e20
Riloff E, Wiebe J, Wilson T (2003) Learning subjective nouns using extraction pattern bootstrapping. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL., CONLL ’03Association for computational linguistics, Stroudsburg, PA, USA, pp 25–32
Roli F, Marcialis G (2006) Semi-supervised PCA-based face recognition using self-training. In: Yeung DY, Kwok J, Fred A, Roli F, de Ridder D (eds) Structural, syntactic, and statistical pattern recognition. Lecture notes in computer science, vol 4109. Springer, Berlin, pp 560–568
Schweikert G, Widmer C, Schölkopf B, Rätsch G (2008) An empirical analysis of domain adaptation algorithms for genomic sequence analysis. In: Proceedings of the fifth annual conference on neural information processing systems (NIPS), pp 1433–1440
Sonnenburg S, Schweikert G, Philips P, Behr J, Rätsch G (2007) Accurate splice site prediction using support vector machines. BMC Bioinf 8(Supplement 10):1–16
Stanke M, Waack S (2003) Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19(suppl 2):ii215–ii225
Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd annual meeting on association for computational linguistics., ACL ’95Association for computational linguistics, Stroudsburg, PA, USA, pp 189–196
Zhu X, Ghahramani Z (2002) Learning from labeled and unlabeled data with label propagation. Tech. rep, Citeseer
Zien A, Rätsch G, Mika S, Schölkopf B, Lengauer T, Müller KR (2000) Engineering support vector machine kernels that recognize translation initiation Sites. Bioinformatics 16(9):799–807
Acknowledgments
This work was supported by an Institutional Development Award (IDeA) from the National Institute of General Medical Sciences of the National Institutes of Health under Grant Number P20GM103418. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of General Medical Sciences or the National Institutes of Health. The computing for this project was performed on the Beocat Research Cluster at Kansas State University, which is funded in part by Grants MRI-1126709, CC-NIE-1341026, MRI-1429316, CC-IIE-1440548.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported by an institutional development award (IDeA) from the National Institute of General Medical Sciences of the National Institutes of Health under Grant Number P20GM103418. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of General Medical Sciences or the National Institutes of Health. The computing for this project was performed on the Beocat Research Cluster at Kansas State University, which is funded in part by Grants MRI-1126709, CC-NIE-1341026, MRI-1429316, CCIIE-1440548.
Rights and permissions
About this article
Cite this article
Herndon, N., Caragea, D. An evaluation of approaches for using unlabeled data with domain adaptation. Netw Model Anal Health Inform Bioinforma 5, 25 (2016). https://doi.org/10.1007/s13721-016-0133-6
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13721-016-0133-6