Skip to main content
Log in

An evaluation of approaches for using unlabeled data with domain adaptation

  • Original Article
  • Published:
Network Modeling Analysis in Health Informatics and Bioinformatics Aims and scope Submit manuscript

Abstract

We consider the problem of adding a large unlabeled sample from the target domain to boost the performance of a domain adaptation algorithm when only a small set of labeled examples is available from the target domain. In particular, we consider the problem setting motivated by the tasks of splice site prediction and protein localization. For example, for splice site prediction, annotating a genome using machine learning requires a lot of labeled data, whereas for non-model organisms, there are only some labeled data and lots of unlabeled data. With domain adaptation one can leverage the large amount of data from a related model organism, along with the labeled and unlabeled data from the organism of interest to train a classifier for the latter. Our goal is to analyze the three approaches of incorporating the unlabeled data—with soft labels only (i.e., Expectation-Maximization), with hard labels only (i.e., self-training), or with both soft and hard labels—for the splice site prediction and protein localization in particular, and more broadly for a general iterative domain adaptation setting. We provide empirical results on splice site prediction and protein localization indicating that using a combination of soft and hard labels performs as good as the best of the other two approaches of integrating unlabeled data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. When checking for convergence we assigned hard labels to all instances from the target unlabeled data set.

  2. Downloaded from http://ftp.raetschlab.org/user/cwidmer.,

  3. Downloaded from http://www.psort.org/dataset/datasetv2.html.

  4. Downloaded from http://www.cbs.dtu.dk/services/TargetP/datasets/datasets.php.

References

  • Bernal A, Crammer K, Hatzigeorgiou A, Pereira F (2007) Global discriminative learning for higher-accuracy computational gene prediction. PLOS Comput Biol 3(3):e54

    Article  MathSciNet  Google Scholar 

  • Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on computational learning theory., COLT’ 98ACM, New York, NY, USA, pp 92–100

  • Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M Jr, Haussler D (2000) Knowledge-based analysis of microarray gene expression data using support vector machines. Proc Natl Acad Sci 97(1):262–267

    Article  Google Scholar 

  • Chapelle O, Schölkopf B, Zien A (eds) (2006) Semi-supervised Learning. Adaptive computation and machine learning. The MIT Press, Cambridge

    Google Scholar 

  • Dai W, Xue GR, Yang Q, Yu Y (2007) Transferring Naïve Bayes classifiers for text classification. In: Proceedings of the national conference on artificial intelligence. AAAI Press, MIT Press, Menlo Park, CA, Cambridge, MA, London, vol 22, p 540

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1):1–38

    MathSciNet  MATH  Google Scholar 

  • Emanuelsson O, Nielsen H, Brunak S, von Heijne G (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 300(4):1005–1016

    Article  Google Scholar 

  • Gardy JL, Brinkman FS (2006) Methods for predicting bacterial protein subcellular localization. Nat Rev Microbiol 4(1):741–751

    Article  Google Scholar 

  • Gardy JL, Laird MR, Chen F, Rey S, Walsh C, Ester M, Brinkman FS (2005) Psortb v. 2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics 21(5):617–623

    Article  Google Scholar 

  • Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18

    Article  Google Scholar 

  • Herndon N, Caragea D (2014a) Empirical study of domain adaptation with Naïve Bayes on the task of splice site prediction. In: Proceedings of the 5th international conference on bioinformatics models, methods and algorithms, BIOINFORMATICS 2014, pp 57–67

  • Herndon N, Caragea D (2014b) Predicting protein localization using a domain adaptation approach. In: Biomedical engineering systems and technologies. Springer, Berlin, pp 191–206

  • Herndon N, Caragea D (2015) Domain adaptation with logistic regression for the task of splice site prediction. In: Proceedings of the 11th international symposium on bioinformatics research and applications, ISBRA 2015, pp 125–137

  • Hubbard T, Park J (1995) Fold recognition and Ab Initio structure predictions using hidden Markov models and beta-strand pair potentials. Proteins 23(3):398–402

    Article  Google Scholar 

  • Joachims T (2002) Learning to classify text using support vector machines: methods, theory and algorithms. Kluwer Academic Publishers, Berlin

    Book  Google Scholar 

  • John GH, Langley P (1995) Estimating continuous distributions in bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., pp 338–345

  • Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86

    Article  MathSciNet  MATH  Google Scholar 

  • Lewis DD (1992) Representation and learning in information retrieval. Ph.D. thesis, University of Massachusetts

  • Maeireizo B, Litman D, Hwa R (2004) Co-training for predicting emotions with spoken dialogue data. In: Proceedings of the association for computational linguistics on interactive poster and demonstration sessions, ACL demo ’04. Association for computational linguistics, Stroudsburg, PA, USA

  • McCallum A, Nigam K et al (1998) A comparison of event models for Naïve Bayes text classification. In: Proceedings of the association for the advancement of artificial intelligence workshop on learning for text categorization, vol 752, Citeseer, pp 41–48

  • Müller KR, Mika S, Rätsch G, Tsuda S, Schölkopf B (2001) An introduction to kernel-based learning algorithms. IEEE Trans Neural Netw 12(2):181–202

    Article  Google Scholar 

  • Nigam K, McCallum AK, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39(2–3):103–134

    Article  MATH  Google Scholar 

  • Noble WS (2006) What is a support vector machine? Nat Biotechnol 24(12):1565–1567

    Article  MathSciNet  Google Scholar 

  • Rätsch G, Sonnenburg S, Srinivasan J, Witte H, Müller KR, Sommer R, Schölkopf B (2007) Improving the C. elegans genome annotation using machine learning. PLoS Comput Biol 3:e20

    Article  MathSciNet  Google Scholar 

  • Riloff E, Wiebe J, Wilson T (2003) Learning subjective nouns using extraction pattern bootstrapping. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL., CONLL ’03Association for computational linguistics, Stroudsburg, PA, USA, pp 25–32

  • Roli F, Marcialis G (2006) Semi-supervised PCA-based face recognition using self-training. In: Yeung DY, Kwok J, Fred A, Roli F, de Ridder D (eds) Structural, syntactic, and statistical pattern recognition. Lecture notes in computer science, vol 4109. Springer, Berlin, pp 560–568

    Chapter  Google Scholar 

  • Schweikert G, Widmer C, Schölkopf B, Rätsch G (2008) An empirical analysis of domain adaptation algorithms for genomic sequence analysis. In: Proceedings of the fifth annual conference on neural information processing systems (NIPS), pp 1433–1440

  • Sonnenburg S, Schweikert G, Philips P, Behr J, Rätsch G (2007) Accurate splice site prediction using support vector machines. BMC Bioinf 8(Supplement 10):1–16

    Google Scholar 

  • Stanke M, Waack S (2003) Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19(suppl 2):ii215–ii225

    Article  Google Scholar 

  • Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd annual meeting on association for computational linguistics., ACL ’95Association for computational linguistics, Stroudsburg, PA, USA, pp 189–196

  • Zhu X, Ghahramani Z (2002) Learning from labeled and unlabeled data with label propagation. Tech. rep, Citeseer

  • Zien A, Rätsch G, Mika S, Schölkopf B, Lengauer T, Müller KR (2000) Engineering support vector machine kernels that recognize translation initiation Sites. Bioinformatics 16(9):799–807

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported by an Institutional Development Award (IDeA) from the National Institute of General Medical Sciences of the National Institutes of Health under Grant Number P20GM103418. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of General Medical Sciences or the National Institutes of Health. The computing for this project was performed on the Beocat Research Cluster at Kansas State University, which is funded in part by Grants MRI-1126709, CC-NIE-1341026, MRI-1429316, CC-IIE-1440548.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nic Herndon.

Additional information

This work was supported by an institutional development award (IDeA) from the National Institute of General Medical Sciences of the National Institutes of Health under Grant Number P20GM103418. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of General Medical Sciences or the National Institutes of Health. The computing for this project was performed on the Beocat Research Cluster at Kansas State University, which is funded in part by Grants MRI-1126709, CC-NIE-1341026, MRI-1429316, CCIIE-1440548.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Herndon, N., Caragea, D. An evaluation of approaches for using unlabeled data with domain adaptation. Netw Model Anal Health Inform Bioinforma 5, 25 (2016). https://doi.org/10.1007/s13721-016-0133-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13721-016-0133-6

Keywords

Navigation