DOMpro: Protein Domain Prediction Using Profiles, Secondary Structure, Relative Solvent Accessibility, and Recursive Neural Networks

Cheng, Jianlin; Sweredoski, Michael J.; Baldi, Pierre

doi:10.1007/s10618-005-0023-5

DOMpro: Protein Domain Prediction Using Profiles, Secondary Structure, Relative Solvent Accessibility, and Recursive Neural Networks

Published: 11 May 2006

Volume 13, pages 1–10, (2006)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Jianlin Cheng¹,
Michael J. Sweredoski¹ &
Pierre Baldi¹

811 Accesses
79 Citations
3 Altmetric
Explore all metrics

Abstract

Protein domains are the structural and functional units of proteins. The ability to parse protein chains into different domains is important for protein classification and for understanding protein structure, function, and evolution. Here we use machine learning algorithms, in the form of recursive neural networks, to develop a protein domain predictor called DOMpro. DOMpro predicts protein domains using a combination of evolutionary information in the form of profiles, predicted secondary structure, and predicted relative solvent accessibility. DOMpro is trained and tested on a curated dataset derived from the CATH database. DOMpro correctly predicts the number of domains for 69% of the combined dataset of single and multi-domain chains. DOMpro achieves a sensitivity of 76% and specificity of 85% with respect to the single-domain proteins and sensitivity of 59% and specificity of 38% with respect to the two-domain proteins. DOMpro also achieved a sensitivity and specificity of 71% and 71% respectively in the Critical Assessment of Fully Automated Structure Prediction 4 (CAFASP-4) (Fisher et al., 1999; Saini and Fischer, 2005) and was ranked among the top ab initio domain predictors. The DOMpro server, software, and dataset are available at http://www.igb.uci.edu/servers/psss.html.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

Integrated Servers for Structure-Informed Function Prediction

SeqPredNN: a neural network that generates protein sequences that fold into specified tertiary structures

Article Open access 03 October 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25(17):3389–3402.
Google Scholar
Baldi, P. and Pollastri, G. 2003. The principled design of large-scale recursive neural network architectures-DAG-RNNs and the protein structure prediction problem. Journal of Machine Learning Research, 4:575–602.
Google Scholar
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E., 2000. The protein data bank. Nucleic Acids Research, 28:235–242.
Google Scholar
Bryson, K., McGuffin, L.J., Marsden, R.L., Ward, J.J., Sodhi, J.S., and Jones, D.T. 2005. Protein structure prediction servers at University College London. Nucleic Acids Research, 33:w36–38.
Cheng, J., Randall, A.Z., Sweredoski, M.J., and Baldi, P., 2005a. SCRATCH: A protein structure and structural feature prediction server. Nucleic Acids Research, 33:w72–76.
Cheng, J., Sweredoski, M.J., and Baldi, P., 2005b. Accurate prediction of protein disordered regions by mining protein structure data. Data Mining and Knowledge Discovery, In Press.
Chivian, D., Kim, D.E., Malmstrom, L., Bradley, P., Robertson, T., Murphy, P., Strauss, C. E., Bonneau, R., Rohl, C.A., and Baker, D. 2003. Automated prediction of CASP-5 structures using the Robetta server. Proteins, 53(S6):524–533.
Google Scholar
Fischer, D., Barret, C., Bryson, K., Elofsson, A., Godzik, A., Jones, D., Karplus, K.J. Kelley, L.A. MacCallum, R.M., Pawowski, K., Rost, B., Rychlewski, L., and Sternberg, M. 1999. CAFASP-1: Critical assessment of fully automated structure prediction methods. Proteins, Suppl 3:209–217.
George, R.A. and Heringa, J., 2002. SnapDRAGON: A method to delineate protein structural domains from sequence data. Journal of Molecular Biology, 316:839–851.
Google Scholar
Gewehr, J.E. and Zimmer, R. 2005. SSEP-Domain: protein domain prediction by alignment of secondary structure elements and profiles, Bioinformatics, In press.
Heger, A. and Holm, L., 2003. Exhaustive enumeration of protein domain families. Journal of Molecular Biology, 328:749–767.
Google Scholar
Holm, L. and Sander, C. 1994. Parser for protein folding units. Proteins, 19:256–268.
Google Scholar
Holm, L. and Sander, C., 1998a. Dictionary of recurrent domains in protein structures. Proteins, 33:88–96.
Google Scholar
Holm, L. and Sander, C. 1998b. Touring protein fold space with Dali/FSSP. Nucleic Acids Research, 26:316–319.
Jones, D.T., 1999. Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology, 292:195–202.
Google Scholar
Kabsch, W. and Sander, C. 1983. Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features, Biopolymers, 22:2577–2637.
Levitt, M. and Chothia, C. 1976. Structural patterns in globular proteins. Nature, 261(5561):552–558.
Google Scholar
Lexa, M. and Valle, G. 2003. PRIMEX: Rapid identification of oligonucleotide matches in whole genomes. Bioinformatics, 19:2486–2488.
Google Scholar
Linding, R. Russell, R.B. Neduva, V., and Gibson, T.J. 2003. GlobPlot: Exploring protein sequences for globularity and disorder. Nucleic Acids Research 31:3701–3708.
Google Scholar
Liu, J. and Rost, B. 2004. Sequence-based prediction of protein domains. Nucleic Acids Research 32(12):3522–3530.
Google Scholar
Marchler-Bauer, A., Anderson, J.B., DeWeese-Scott. C., Fedorova, N.D., Geer, L.Y., He, S., Hurwitz, D.I., Jackson, J.D., Jacobs, A.R., Lanczycki, C.J., Liebert, C.A., Liu, C., Madej, T., Marchler, G.H., Mazumder, R., Nikolskaya, A.N., Panchenko, A.R., Rao, B.S., Shoemaker, B.A., Simonyan, V., Song, J.S., Thiessen, P.A., Vasudevan, S., Wang, Y., Yamashita, R.A., Yin, J.J., and Bryant, S.H. 2003. CDD: A curated Entrez database of conserved domain alignments. Nucleic Acids Research, 31(1):383–387.
Google Scholar
Marsden, R.L., McGuffin, L.J., and Jones, D.T. 2002. Rapid protein domain assignment from amino acid sequence using predicted secondary structure, Protein Science, 11:2814–2824.
Google Scholar
Mika, S. and Rost, B. 2003. UniqueProt: Creating representative protein sequence sets. Nucleic Acids Research, 31(13):3789–3791.
Google Scholar
Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C., 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology, 247:536–540.
Google Scholar
Nagarajan, N. and Yona, G., 2004. Automatic prediction of protein domains from sequence information using a hybrid learning system. Bioinformatics, 20:1335–1360.
Google Scholar
Orengo, C.A., Bray, J.E., Buchan, D.W., Harrison, A., Lee, D., Perl, F.M., Sillitoe, I., Todd, A.E., and Thornton, J.M. 2002. The CATH protein family database: A resource for structural and functional annotation of genomes, Proteomics, 2:11–21.
Pollastri, G., Baldi, P., Fariselli, P., and Casadio, R., 2002. Prediction of coordination number and relative solvent accessibility in proteins. Proteins, 47:142–153.
Google Scholar
Pollastri, G. and Baldi, P., 2002. Prediction of contact maps by GIOHMMs and recurrent neural networks using lateral propagation from all four cardinal corners. Bioinformatics, 18(Suppl 1):S62–S70. Proceeding of the ISMB 2002 Conference.
Google Scholar
Pollastri, G., Przybylski, D., Rost, B., and Baldi, P., 2001. Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins, 47:228–235.
Google Scholar
Przybylski, D. and Rost, B. 2002. Alignments grow, secondary structure prediction improves. Proteins, 46:197–205.
Google Scholar
Saini, H.K. and Fischer, D. 2005. Meta-DP: Domain prediction meta server. Bioinformatics, 21:2917-2920.
Google Scholar
von Ohsen, N., Sommer, I., Zimmer, R., and Lengauer, T., 2004. Arby: Automatic protein structure prediction using profile-profile alignment and confidence measures. Bioinformatics, 20:2228–2235.
Google Scholar
Wheelan, S.J., Marchler, Bauer A., and Bryant, S.H. 2000. Domain size distributions can predict domain boundaries. Bioinformatics 16(7):613–618.
Zdobnov, E.M. and Apweiler, R., 2001. InterProScan–an integration platform for the signature-recognition methods in InterPro. Bioinformatics, 17:847–848.

Download references

Acknowledgments

This work is supported by the Institute for Genomics and Bioinformatics at UCI, a Laurel Wilkening Faculty Innovation award, an NIH Biomedical Informatics Training grant (LM-07443-01), an NSF MRI grant (EIA-0321390), a Sun Microsystems award, a grant from the University of California Systemwide Biotechnology Research and Education Program (UC BREP) to PB.

Author information

Authors and Affiliations

Institute for Genomics and Bioinformatics, School of Information and Computer Sciences, University of California Irvine, Irvine, CA, 92697, USA
Jianlin Cheng, Michael J. Sweredoski & Pierre Baldi

Authors

Jianlin Cheng
View author publications
You can also search for this author inPubMed Google Scholar
Michael J. Sweredoski
View author publications
You can also search for this author inPubMed Google Scholar
Pierre Baldi
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Jianlin Cheng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cheng, J., Sweredoski, M.J. & Baldi, P. DOMpro: Protein Domain Prediction Using Profiles, Secondary Structure, Relative Solvent Accessibility, and Recursive Neural Networks. Data Min Knowl Disc 13, 1–10 (2006). https://doi.org/10.1007/s10618-005-0023-5

Download citation

Received: 18 May 2005
Accepted: 14 October 2005
Published: 11 May 2006
Issue Date: July 2006
DOI: https://doi.org/10.1007/s10618-005-0023-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DOMpro: Protein Domain Prediction Using Profiles, Secondary Structure, Relative Solvent Accessibility, and Recursive Neural Networks

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Highly accurate protein structure prediction with AlphaFold

Integrated Servers for Structure-Informed Function Prediction

SeqPredNN: a neural network that generates protein sequences that fold into specified tertiary structures

Explore related subjects

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now