Domain Adaptation with Logistic Regression for the Task of Splice Site Prediction

Herndon, Nic; Caragea, Doina

doi:10.1007/978-3-319-19048-8_11

Nic Herndon⁷ &
Doina Caragea⁷

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9096))

Included in the following conference series:

International Symposium on Bioinformatics Research and Applications

2017 Accesses

Abstract

Supervised classifiers are highly dependent on abundant labeled training data. Alternatives for addressing the lack of labeled data include: labeling data (but this is costly and time consuming); training classifiers with abundant data from another domain (however, the classification accuracy usually decreases as the distance between domains increases); or complementing the limited labeled data with abundant unlabeled data from the same domain and learning semi-supervised classifiers (but the unlabeled data can mislead the classifier). A better alternative is to use both the abundant labeled data from a source domain and the limited labeled data from the target domain to train classifiers in a domain adaptation setting. We propose such a classifier, based on logistic regression, and evaluate it for the task of splice site prediction – a difficult and essential step in gene prediction. Our classifier achieved high accuracy, with highest areas under the precision-recall curve between 50.83% and 82.61%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

An evaluation of approaches for using unlabeled data with domain adaptation

Article 07 July 2016

An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets

Article Open access 01 September 2015

Efficient Model Selection for Regularized Classification by Exploiting Unlabeled Data

References

Arita, M., Tsuda, K., Asai, K.: Modeling splicing sites with pairwise correlations. Bioinformatics 18(suppl 2), S27–S34 (2002)
Google Scholar
Baten, A.K.M.A., Halgamuge, S.K., Chang, B., Wickramarachchi, N.: Biological Sequence Data Preprocessing for Classification: A Case Study in Splice Site Identification. In: Liu, D., Fei, S., Hou, Z., Zhang, H., Sun, C. (eds.) ISNN 2007, Part II. LNCS, vol. 4492, pp. 1221–1230. Springer, Heidelberg (2007)
Chapter Google Scholar
Baten, A.K.M.A., Chang, B.C.H., Halgamuge, S.K., Li, J.: Splice site identification using probabilistic parameters and svm classification. BMC Bioinformatics 7(suppl 5), S15 (2006)
Google Scholar
Bernal, A., Crammer, K., Hatzigeorgiou, A., Pereira, F.: Global Discriminative Learning for Higher-Accuracy Computational Gene Prediction. PLoS Comput. Biol. 3(3), e54 (2007)
Google Scholar
Brown, M.P.S., Grundy, W.N., Lin, D., Cristianini, N., Sugnet, C., Furey Jr., T.S., Ares, M., Haussler, D.: Knowledge-based Analysis of Microarray Gene Expression Data Using Support Vector Machines. PNAS 97(1), 262–267 (2000)
Article Google Scholar
Cai, D., Delcher, A., Kao, B., Kasif, S.: Modeling splice sites with Bayes networks. Bioinformatics 16(2), 152–158 (2000)
Article Google Scholar
Catal, C., Diri, B.: Unlabelled extra data do not always mean extra performance for semi-supervised fault prediction. Expert Systems 26(5), 458–471 (2009); Wiley Online Library
Google Scholar
Chelba, C., Acero, A.: Adaptation of maximum entropy capitalizer: Little data can help a lot. Computer Speech & Language 20(4), 382–399 (2006)
Article Google Scholar
Dai, W., Xue, G.R., Yang, Q., Yu, Y.: Transferring Naïve Bayes Classifiers for Text Classification. In: Proceedings of the 22nd AAAI Conference on Artificial Intelligence (2007)
Google Scholar
Davis, J., Goadrich, M.: The relationship between Precision-Recall and ROC curves. In: Proceedings of the Twenty Third International Conference on Machine Learning, pp. 233–240. ACM (2006)
Google Scholar
Gross, S.S., Do, C.B., Sirota, M., Batzoglou, S.: Contrast: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction. Genome Biology 8(12), R269 (2007)
Google Scholar
Giannoulis, G., Krithara, A., Karatsalos, C., Paliouras, G.: Splice site recognition using transfer learning. In: Likas, A., Blekas, K., Kalles, D. (eds.) SETN 2014. LNCS (LNAI), vol. 8445, pp. 341–353. Springer, Heidelberg (2014)
Chapter Google Scholar
Herndon, N., Caragea, D.: Empirical Study of Domain Adaptation with Naïve Bayes on the Task of Splice Site Prediction. In: Proceedings of the 5th International Conference on Bioinformatics Models, Methods and Algorithms, pp. 57–67 (2014)
Google Scholar
Herndon, N., Caragea, D.: Predicting Protein Localization Using a Domain Adaptation Approach. In: FernÁndez Chimeno, M., Fernandes, P.L., Alvarez, S., Stacey, D., Solé-Casals, J., Fred, A., Gamboa, H. (eds.) BIOSTEC 2013. CCIS, vol. 452, pp. 191–206. Springer, Heidelberg (2014)
Chapter Google Scholar
Hubbard, T.J., Park, J.: Fold recognition and ab initio structure predictions using hidden markov models and β-strand pair potentials. Proteins: Structure, Function, and Bioinformatics 23(3), 398–402 (1995)
Article Google Scholar
Korf, I., Flicek, P., Duan, D., Brent, M.R.: Integrating genomic homology into gene structure prediction. Bioinformatics 17(suppl. 1), S140–S148 (2001)
Google Scholar
Le Cessie, S., Van Houwelingen, J.C.: Ridge estimators in logistic regression. Applied Statistics, 191–201 (1992)
Google Scholar
Li, J.L., Wang, L.F., Wang, H.Y., Bai, L.Y., Yuan, Z.M.: High-accuracy splice site prediction based on sequence component and position features. Genet. Mol. Res. 11(3), 3432–3451 (2012)
Article Google Scholar
Müller, K.-R., Mika, S., Rätsch, G., Tsuda, S., Schölkopf, B.: An Introduction to Kernel-Based learning Algorithms. IEEE Transactions on Neural Networks 12(2), 181–202 (2001)
Article Google Scholar
Noble, W.S.: What is a support vector machine? Nat. Biotech. 24(12), 1565–1567 (2006)
Article MathSciNet Google Scholar
Rätsch, G., Sonnenburg, S., Srinivasan, J., Witte, H., Müller, K.-R., Sommer, R., Schölkopf, B.: Improving the C. elegans genome annotation using machine learning. PLoS Computational Biology 3, e20 (2007)
Google Scholar
Schweikert, G., Widmer, C., Schölkopf, B., Rätsch, G.: An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis. In: NIPS 2008, pp. 1433–1440 (2008)
Google Scholar
Sonnenburg, S., Schweikert, G., Philips, P., Behr, J., Rätsch, G.: Accurate Splice site Prediction Using Support Vector Machines. BMC Bioinformatics 8(suppl.10), 1–16 (2007)
Google Scholar
Stanescu, A., Caragea, D.: Ensemble-based semi-supervised learning approaches for imbalanced splice site datasets. In: Proceedings of the 6th IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2014, pp. 432–437 (2014)
Google Scholar
Stanescu, A., Caragea, D.: Semi-supervised self-training approaches for imbalanced splice site datasets. In: Proceedings of the 6th International Conference on Bioinformatics and Computational Biology, BICoB 2014, pp. 131–136 (2014)
Google Scholar
Stanke, M., Waack, S.: Gene prediction with a hidden markov model and a new intron submodel. Bioinformatics 19(suppl 2), ii215–ii225 (2003)
Google Scholar
Steijger, T., Abril, J.F., Engström, P.G., Kokocinski, F., Hubbard, T.J., Guigó, R., Harrow, J., Bertone, P., RGASP Consortium, et al.: Assessment of transcript reconstruction methods for rna-seq. Nature Methods 10(12), 1177–1184 (2013)
Google Scholar
Tan, S., Cheng, X., Wang, Y., Xu, H.: Adapting Naïve Bayes to Domain Adaptation for Sentiment Analysis. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 337–349. Springer, Heidelberg (2009)
Google Scholar
Zhang, Y., Chu, C.H., Chen, Y., Zha, H., Ji, X.: Splice site prediction using support vector machines with a Bayes kernel. Expert Syst. Appl. 30(1), 73–81 (2006)
Article Google Scholar
Zien, A., Rätsch, G., Mika, S., Schölkopf, B., Lengauer, T., Müller, K.-R.: Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16(9), 799–807 (2000)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computing and Information Sciences, Kansas State University, 234 Nichols Hall, Manhattan, KS, 66506, USA
Nic Herndon & Doina Caragea

Authors

Nic Herndon
View author publications
You can also search for this author in PubMed Google Scholar
Doina Caragea
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nic Herndon .

Editor information

Editors and Affiliations

Georgia State University, Atlanta, USA
Robert Harrison
Old Dominion University, Norfolk, USA
Yaohang Li
University of Connecticut, Storrs, Connecticut, USA
Ion Măndoiu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Herndon, N., Caragea, D. (2015). Domain Adaptation with Logistic Regression for the Task of Splice Site Prediction. In: Harrison, R., Li, Y., Măndoiu, I. (eds) Bioinformatics Research and Applications. ISBRA 2015. Lecture Notes in Computer Science(), vol 9096. Springer, Cham. https://doi.org/10.1007/978-3-319-19048-8_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-19048-8_11
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19047-1
Online ISBN: 978-3-319-19048-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Domain Adaptation with Logistic Regression for the Task of Splice Site Prediction