An evaluation of approaches for using unlabeled data with domain adaptation

Herndon, Nic; Caragea, Doina

doi:10.1007/s13721-016-0133-6

An evaluation of approaches for using unlabeled data with domain adaptation

Original Article
Published: 07 July 2016

Volume 5, article number 25, (2016)
Cite this article

Network Modeling Analysis in Health Informatics and Bioinformatics Aims and scope Submit manuscript

286 Accesses
Explore all metrics

Abstract

We consider the problem of adding a large unlabeled sample from the target domain to boost the performance of a domain adaptation algorithm when only a small set of labeled examples is available from the target domain. In particular, we consider the problem setting motivated by the tasks of splice site prediction and protein localization. For example, for splice site prediction, annotating a genome using machine learning requires a lot of labeled data, whereas for non-model organisms, there are only some labeled data and lots of unlabeled data. With domain adaptation one can leverage the large amount of data from a related model organism, along with the labeled and unlabeled data from the organism of interest to train a classifier for the latter. Our goal is to analyze the three approaches of incorporating the unlabeled data—with soft labels only (i.e., Expectation-Maximization), with hard labels only (i.e., self-training), or with both soft and hard labels—for the splice site prediction and protein localization in particular, and more broadly for a general iterative domain adaptation setting. We provide empirical results on splice site prediction and protein localization indicating that using a combination of soft and hard labels performs as good as the best of the other two approaches of integrating unlabeled data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Combating Label Distribution Shift for Active Domain Adaptation

On the analysis of adaptability in multi-source domain adaptation

Article 27 June 2019

What to Do When the Access to the Source Data Is Constrained?

Notes

When checking for convergence we assigned hard labels to all instances from the target unlabeled data set.
Downloaded from http://ftp.raetschlab.org/user/cwidmer.,
Downloaded from http://www.psort.org/dataset/datasetv2.html.
Downloaded from http://www.cbs.dtu.dk/services/TargetP/datasets/datasets.php.

References

Bernal A, Crammer K, Hatzigeorgiou A, Pereira F (2007) Global discriminative learning for higher-accuracy computational gene prediction. PLOS Comput Biol 3(3):e54
Article MathSciNet Google Scholar
Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on computational learning theory., COLT’ 98ACM, New York, NY, USA, pp 92–100
Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M Jr, Haussler D (2000) Knowledge-based analysis of microarray gene expression data using support vector machines. Proc Natl Acad Sci 97(1):262–267
Article Google Scholar
Chapelle O, Schölkopf B, Zien A (eds) (2006) Semi-supervised Learning. Adaptive computation and machine learning. The MIT Press, Cambridge
Google Scholar
Dai W, Xue GR, Yang Q, Yu Y (2007) Transferring Naïve Bayes classifiers for text classification. In: Proceedings of the national conference on artificial intelligence. AAAI Press, MIT Press, Menlo Park, CA, Cambridge, MA, London, vol 22, p 540
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1):1–38
MathSciNet MATH Google Scholar
Emanuelsson O, Nielsen H, Brunak S, von Heijne G (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 300(4):1005–1016
Article Google Scholar
Gardy JL, Brinkman FS (2006) Methods for predicting bacterial protein subcellular localization. Nat Rev Microbiol 4(1):741–751
Article Google Scholar
Gardy JL, Laird MR, Chen F, Rey S, Walsh C, Ester M, Brinkman FS (2005) Psortb v. 2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics 21(5):617–623
Article Google Scholar
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18
Article Google Scholar
Herndon N, Caragea D (2014a) Empirical study of domain adaptation with Naïve Bayes on the task of splice site prediction. In: Proceedings of the 5th international conference on bioinformatics models, methods and algorithms, BIOINFORMATICS 2014, pp 57–67
Herndon N, Caragea D (2014b) Predicting protein localization using a domain adaptation approach. In: Biomedical engineering systems and technologies. Springer, Berlin, pp 191–206
Herndon N, Caragea D (2015) Domain adaptation with logistic regression for the task of splice site prediction. In: Proceedings of the 11th international symposium on bioinformatics research and applications, ISBRA 2015, pp 125–137
Hubbard T, Park J (1995) Fold recognition and Ab Initio structure predictions using hidden Markov models and beta-strand pair potentials. Proteins 23(3):398–402
Article Google Scholar
Joachims T (2002) Learning to classify text using support vector machines: methods, theory and algorithms. Kluwer Academic Publishers, Berlin
Book Google Scholar
John GH, Langley P (1995) Estimating continuous distributions in bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., pp 338–345
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86
Article MathSciNet MATH Google Scholar
Lewis DD (1992) Representation and learning in information retrieval. Ph.D. thesis, University of Massachusetts
Maeireizo B, Litman D, Hwa R (2004) Co-training for predicting emotions with spoken dialogue data. In: Proceedings of the association for computational linguistics on interactive poster and demonstration sessions, ACL demo ’04. Association for computational linguistics, Stroudsburg, PA, USA
McCallum A, Nigam K et al (1998) A comparison of event models for Naïve Bayes text classification. In: Proceedings of the association for the advancement of artificial intelligence workshop on learning for text categorization, vol 752, Citeseer, pp 41–48
Müller KR, Mika S, Rätsch G, Tsuda S, Schölkopf B (2001) An introduction to kernel-based learning algorithms. IEEE Trans Neural Netw 12(2):181–202
Article Google Scholar
Nigam K, McCallum AK, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39(2–3):103–134
Article MATH Google Scholar
Noble WS (2006) What is a support vector machine? Nat Biotechnol 24(12):1565–1567
Article MathSciNet Google Scholar
Rätsch G, Sonnenburg S, Srinivasan J, Witte H, Müller KR, Sommer R, Schölkopf B (2007) Improving the C. elegans genome annotation using machine learning. PLoS Comput Biol 3:e20
Article MathSciNet Google Scholar
Riloff E, Wiebe J, Wilson T (2003) Learning subjective nouns using extraction pattern bootstrapping. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL., CONLL ’03Association for computational linguistics, Stroudsburg, PA, USA, pp 25–32
Roli F, Marcialis G (2006) Semi-supervised PCA-based face recognition using self-training. In: Yeung DY, Kwok J, Fred A, Roli F, de Ridder D (eds) Structural, syntactic, and statistical pattern recognition. Lecture notes in computer science, vol 4109. Springer, Berlin, pp 560–568
Chapter Google Scholar
Schweikert G, Widmer C, Schölkopf B, Rätsch G (2008) An empirical analysis of domain adaptation algorithms for genomic sequence analysis. In: Proceedings of the fifth annual conference on neural information processing systems (NIPS), pp 1433–1440
Sonnenburg S, Schweikert G, Philips P, Behr J, Rätsch G (2007) Accurate splice site prediction using support vector machines. BMC Bioinf 8(Supplement 10):1–16
Google Scholar
Stanke M, Waack S (2003) Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19(suppl 2):ii215–ii225
Article Google Scholar
Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd annual meeting on association for computational linguistics., ACL ’95Association for computational linguistics, Stroudsburg, PA, USA, pp 189–196
Zhu X, Ghahramani Z (2002) Learning from labeled and unlabeled data with label propagation. Tech. rep, Citeseer
Zien A, Rätsch G, Mika S, Schölkopf B, Lengauer T, Müller KR (2000) Engineering support vector machine kernels that recognize translation initiation Sites. Bioinformatics 16(9):799–807
Article Google Scholar

Download references

Acknowledgments

This work was supported by an Institutional Development Award (IDeA) from the National Institute of General Medical Sciences of the National Institutes of Health under Grant Number P20GM103418. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of General Medical Sciences or the National Institutes of Health. The computing for this project was performed on the Beocat Research Cluster at Kansas State University, which is funded in part by Grants MRI-1126709, CC-NIE-1341026, MRI-1429316, CC-IIE-1440548.

Author information

Authors and Affiliations

Kansas State University, Manhattan, USA
Nic Herndon & Doina Caragea

Authors

Nic Herndon
View author publications
You can also search for this author inPubMed Google Scholar
Doina Caragea
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Nic Herndon.

Additional information

This work was supported by an institutional development award (IDeA) from the National Institute of General Medical Sciences of the National Institutes of Health under Grant Number P20GM103418. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of General Medical Sciences or the National Institutes of Health. The computing for this project was performed on the Beocat Research Cluster at Kansas State University, which is funded in part by Grants MRI-1126709, CC-NIE-1341026, MRI-1429316, CCIIE-1440548.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Herndon, N., Caragea, D. An evaluation of approaches for using unlabeled data with domain adaptation. Netw Model Anal Health Inform Bioinforma 5, 25 (2016). https://doi.org/10.1007/s13721-016-0133-6

Download citation

Received: 18 December 2015
Revised: 28 May 2016
Accepted: 25 June 2016
Published: 07 July 2016
DOI: https://doi.org/10.1007/s13721-016-0133-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An evaluation of approaches for using unlabeled data with domain adaptation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Combating Label Distribution Shift for Active Domain Adaptation

On the analysis of adaptability in multi-source domain adaptation

What to Do When the Access to the Source Data Is Constrained?

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now