Predicting Protein Localization Using a Domain Adaptation Approach

Herndon, Nic; Caragea, Doina

doi:10.1007/978-3-662-44485-6_14

Nic Herndon⁸ &
Doina Caragea⁸

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 452))

Included in the following conference series:

International Joint Conference on Biomedical Engineering Systems and Technologies

870 Accesses
4 Citations

Abstract

A challenge arising from the ever-increasing volume of biological data generated by next generation sequencing technologies is the annotation of this data, e.g. identification of gene structure from the location of splice sites, or prediction of protein function/localization. The annotation can be achieved by using automated classification algorithms. Supervised classification requires large amounts of labeled data for the problem at hand. For many problems, labeled data is not available. However, labeled data might be available for a similar, related problem. To leverage the labeled data available for the related problem, we propose an algorithm that builds a naïve Bayes classifier for biological sequences in a domain adaptation setting. Specifically, it uses the existing large corpus of labeled data from a source organism, in conjunction with any available labeled data and lots of unlabeled data from a target organism, thus alleviating the need to manually label a large number of sequences for a supervised classifier. When tested on the task of predicting protein localization from the composition of the protein, this algorithm performed better than the multinomial naïve Bayes classifier. However, on a more difficult task, of splice site prediction, the results were not satisfactory.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

An evaluation of approaches for using unlabeled data with domain adaptation

Article 07 July 2016

Mathematical Basis of Predicting Dominant Function in Protein Sequences by a Generic HMM–ANN Algorithm

Article Open access 26 April 2018

A new algorithm to train hidden Markov models for biological sequences with partial labels

Article Open access 26 March 2021

Notes

1.
Downloaded from http://www.psort.org/dataset/datasetv2.html
2.
Downloaded from http://www.cbs.dtu.dk/services/TargetP/datasets/datasets.php
3.
Downloaded from ftp://ftp.tuebingen.mpg.de/fml/cwidmer/

References

Baten, A., Chang, B., Halgamuge, S., Li, J.: Splice site identification using probabilistic parameters and SVM classification. BMC Bioinform. 7(Suppl. 5), S15 (2006)
Article Google Scholar
Bernal, A., Crammer, K., Hatzigeorgiou, A., Pereira, F.: Global discriminative learning for higher-accuracy computational gene prediction. PLoS Comput. Biol. 3(3), e54 (2007)
Article MathSciNet Google Scholar
Brown, M.P.S., Grundy, W.N., Lin, D., Cristianini, N., Sugnet, C., Furey, T.S., Ares Jr., M., Haussler, D.: Knowledge-based analysis of microarray gene expression data using support vector machines. PNAS 97(1), 262–267 (2000)
Article Google Scholar
Dai, W., Xue, G., Yang, Q., Yu, Y.: Transferring naïve bayes classifiers for text classification. In: Proceedings of the 22nd AAAI Conference on Artificial Intelligence (2007)
Google Scholar
Degroeve, S., Saeys, Y., De Baets, B., Rouzé, P., Van De Peer, Y.: Splicemachine: predicting splice sites from high-dimensional local context representations. Bioinformatics 21(8), 1332–1338 (2005)
Article Google Scholar
Eaton, J.W., Bateman, D., Hauberg, S.: GNU Octave Manual Version 3. Network Theory Ltd., Bristol (2008)
Google Scholar
Emanuelsson, O., Nielsen, H., Brunak, S., von Heijne, G.: Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol. 300(4), 1005–1016 (2000)
Article Google Scholar
Gardy, J.L., Laird, M.R., Chen, F., Rey, S., Walsh, C.J., Ester, M., Brinkman, F.S.L.: Psortb v. 2.0: Expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics 21(5), 617–623 (2005)
Article Google Scholar
Gardy, J.L., Spencer, C., Wang, K., Ester, M., Tusnády, G.E., Simon, I., Hua, S., deFays, K., Lambert, C., Nakai, K., Brinkman, F.S.: Psort-b: improving protein subcellular localization prediction for gram-negative bacteria. Nucleic Acids Res. 31(13), 3613–3617 (2003)
Article Google Scholar
Huang, J., Li, T., Chen, K., Wu, J.: An approach of encoding for prediction of splice sites using svm. Biochimie 88, 923–929 (2006)
Article Google Scholar
Jaakkola, T.S., Haussler, D.: Exploiting generative models in discriminative classifiers. In: Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II, pp. 487–493. MIT Press, Cambridge (1999)
Google Scholar
Jiang, J., Zhai, C.: A two-stage approach to domain adaptation for statistical classifiers. In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, CIKM ’07, pp. 401–410. ACM, New York (2007)
Google Scholar
Lorena, A.C., de Carvalho, A.C.P.L.F.: Human splice site identification with multiclass support vector machines and bagging. In: Kaynak, O., Alpaydın, E., Oja, E., Xu, L. (eds.) ICANN/ICONIP 2003. LNCS, vol. 2714, pp. 234–241. Springer, Heidelberg (2003)
Chapter Google Scholar
Maeireizo, B., Litman, D., Hwa, R.: Co-training for predicting emotions with spoken dialogue data. In: Proceedings of the ACL 2004 on Interactive Poster and Demonstration Sessions, ACLdemo ’04. Association for Computational Linguistics, Stroudsburg (2004)
Google Scholar
Mccallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI-98 Workshop on ‘Learning for Text Categorization’ (1998)
Google Scholar
Müller, K.-R., Mika, S., Rätsch, G., Tsuda, S., Schölkopf, B.: An introduction to kernel-based learning algorithms. IEEE Trans. Neural Networks 12(2), 181–202 (2001)
Article Google Scholar
Nigam, K., Mccallum, A., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2–3), 103–134 (1999)
Google Scholar
Noble, W.S.: What is a support vector machine? Nat Biotechnol. 24(12), 1565–1567 (2006)
Article MathSciNet Google Scholar
Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010)
Article Google Scholar
Rätsch, G., Sonnenburg, S.: Accurate splice site detection for caenorhabditis elegans. In: Schölkopf, B., Tsuda, K., Vert, J.-P. (eds.) Kernel Methods in Computational Biology, pp. 277–298. MIT Press, Cambridge (2004)
Google Scholar
Rätsch, G., Sonnenburg, S., Srinivasan, J., Witte, H., Müller, K.-R., Sommer, R., Schölkopf, B.: Improving the c. elegans genome annotation using machine learning. PLoS Comput. Biol. 3, e20 (2007)
Article Google Scholar
Riloff, E., Wiebe, J., Wilson, T.: Learning subjective nouns using extraction pattern bootstrapping. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, CONLL ’03, vol. 4, pp. 25–32. Association for Computational Linguistics, Stroudsburg (2003)
Google Scholar
Schölkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge (2001)
Google Scholar
Schweikert, G., Widmer, C., Schölkopf, B., Rätsch, G.: An empirical analysis of domain adaptation algorithms for genomic sequence analysis. In: NIPS’08, pp. 1433–1440 (2008)
Google Scholar
Sonnenburg, S., Rätsch, G., Jagota, A., Müller, K.-R.: New methods for splice site recognition. In: Dorronsoro, J.R. (ed.) ICANN 2002. LNCS, vol. 2415, pp. 329–336. Springer, Heidelberg (2002)
Chapter Google Scholar
Sonnenburg, S., Schweikert, G., Philips, P., Behr, J., Rätsch, G.: Accurate splice site prediction using support vector machines. BMC Bioinf. 8(Suppl. 10), 1–16 (2007)
Google Scholar
Tan, S., Cheng, X., Wang, Y., Xu, H.: Adapting Naive Bayes to domain adaptation for sentiment analysis. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 337–349. Springer, Heidelberg (2009)
Chapter Google Scholar
Tsuda, K., Kawanabe, M., Rätsch, G., Sonnenburg, S., Müller, K.-R.: A new discriminative kernel from probabilistic models. Neural Comput. 14(10), 2397–2414 (2002)
Article MATH Google Scholar
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer-Verlag New York Inc., New York (1995)
Book MATH Google Scholar
Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics, ACL ’95, pp. 189–196. Association for Computational Linguistics, Stroudsburg (1995)
Google Scholar
Zhang, Y., Chu, C.-H., Chen, Y., Zha, H., Ji, X.: Splice site prediction using support vector machines with a bayes kernel. Expert Syst. Appl. 30(1), 73–81 (2006)
Article Google Scholar
Zien, A., Rätsch, G., Mika, S., Schölkopf, B., Lengauer, T., Müller, K.-R.: Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16(9), 799–807 (2000)
Article Google Scholar

Download references

Acknowledgements

The computing for this project was performed on the Beocat Research Cluster at Kansas State University, which is funded in part by NSF grants CNS-1006860, EPS-1006860, EPS-0919443, and MRI-1126709.

Author information

Authors and Affiliations

Kansas State University, Manhattan, KS, 66506, USA
Nic Herndon & Doina Caragea

Authors

Nic Herndon
View author publications
You can also search for this author in PubMed Google Scholar
Doina Caragea
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nic Herndon .

Editor information

Editors and Affiliations

Universitat Politècnica de Catalunya, Barcelona, Spain
Mireya Fernández-Chimeno
Instituto Gulbenkian de Ciência, Oeiras, Portugal
Pedro L. Fernandes
Boston College, Chestnut Hill, Massachusetts, USA
Sergio Alvarez
University of Guelph, Guelph, Canada
Deborah Stacey
University of Vic, Vic, Spain
Jordi Solé-Casals
Technical University of Lisbon, Lisbon, Portugal
Ana Fred
New University of Lisbon, Lisboa, Portugal
Hugo Gamboa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Herndon, N., Caragea, D. (2014). Predicting Protein Localization Using a Domain Adaptation Approach. In: Fernández-Chimeno, M., et al. Biomedical Engineering Systems and Technologies. BIOSTEC 2013. Communications in Computer and Information Science, vol 452. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44485-6_14

Download citation

DOI: https://doi.org/10.1007/978-3-662-44485-6_14
Published: 02 November 2014
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44484-9
Online ISBN: 978-3-662-44485-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics