Skip to main content
Log in

DOMpro: Protein Domain Prediction Using Profiles, Secondary Structure, Relative Solvent Accessibility, and Recursive Neural Networks

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Protein domains are the structural and functional units of proteins. The ability to parse protein chains into different domains is important for protein classification and for understanding protein structure, function, and evolution. Here we use machine learning algorithms, in the form of recursive neural networks, to develop a protein domain predictor called DOMpro. DOMpro predicts protein domains using a combination of evolutionary information in the form of profiles, predicted secondary structure, and predicted relative solvent accessibility. DOMpro is trained and tested on a curated dataset derived from the CATH database. DOMpro correctly predicts the number of domains for 69% of the combined dataset of single and multi-domain chains. DOMpro achieves a sensitivity of 76% and specificity of 85% with respect to the single-domain proteins and sensitivity of 59% and specificity of 38% with respect to the two-domain proteins. DOMpro also achieved a sensitivity and specificity of 71% and 71% respectively in the Critical Assessment of Fully Automated Structure Prediction 4 (CAFASP-4) (Fisher et al., 1999; Saini and Fischer, 2005) and was ranked among the top ab initio domain predictors. The DOMpro server, software, and dataset are available at http://www.igb.uci.edu/servers/psss.html.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1.
Figure 2.
Figure 3.
Figure 4.
Figure 5.

Similar content being viewed by others

References

  • Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25(17):3389–3402.

    Google Scholar 

  • Baldi, P. and Pollastri, G. 2003. The principled design of large-scale recursive neural network architectures-DAG-RNNs and the protein structure prediction problem. Journal of Machine Learning Research, 4:575–602.

    Google Scholar 

  • Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E., 2000. The protein data bank. Nucleic Acids Research, 28:235–242.

    Google Scholar 

  • Bryson, K., McGuffin, L.J., Marsden, R.L., Ward, J.J., Sodhi, J.S., and Jones, D.T. 2005. Protein structure prediction servers at University College London. Nucleic Acids Research, 33:w36–38.

  • Cheng, J., Randall, A.Z., Sweredoski, M.J., and Baldi, P., 2005a. SCRATCH: A protein structure and structural feature prediction server. Nucleic Acids Research, 33:w72–76.

  • Cheng, J., Sweredoski, M.J., and Baldi, P., 2005b. Accurate prediction of protein disordered regions by mining protein structure data. Data Mining and Knowledge Discovery, In Press.

  • Chivian, D., Kim, D.E., Malmstrom, L., Bradley, P., Robertson, T., Murphy, P., Strauss, C. E., Bonneau, R., Rohl, C.A., and Baker, D. 2003. Automated prediction of CASP-5 structures using the Robetta server. Proteins, 53(S6):524–533.

    Google Scholar 

  • Fischer, D., Barret, C., Bryson, K., Elofsson, A., Godzik, A., Jones, D., Karplus, K.J. Kelley, L.A. MacCallum, R.M., Pawowski, K., Rost, B., Rychlewski, L., and Sternberg, M. 1999. CAFASP-1: Critical assessment of fully automated structure prediction methods. Proteins, Suppl 3:209–217.

  • George, R.A. and Heringa, J., 2002. SnapDRAGON: A method to delineate protein structural domains from sequence data. Journal of Molecular Biology, 316:839–851.

    Google Scholar 

  • Gewehr, J.E. and Zimmer, R. 2005. SSEP-Domain: protein domain prediction by alignment of secondary structure elements and profiles, Bioinformatics, In press.

  • Heger, A. and Holm, L., 2003. Exhaustive enumeration of protein domain families. Journal of Molecular Biology, 328:749–767.

    Google Scholar 

  • Holm, L. and Sander, C. 1994. Parser for protein folding units. Proteins, 19:256–268.

    Google Scholar 

  • Holm, L. and Sander, C., 1998a. Dictionary of recurrent domains in protein structures. Proteins, 33:88–96.

    Google Scholar 

  • Holm, L. and Sander, C. 1998b. Touring protein fold space with Dali/FSSP. Nucleic Acids Research, 26:316–319.

  • Jones, D.T., 1999. Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology, 292:195–202.

    Google Scholar 

  • Kabsch, W. and Sander, C. 1983. Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features, Biopolymers, 22:2577–2637.

  • Levitt, M. and Chothia, C. 1976. Structural patterns in globular proteins. Nature, 261(5561):552–558.

    Google Scholar 

  • Lexa, M. and Valle, G. 2003. PRIMEX: Rapid identification of oligonucleotide matches in whole genomes. Bioinformatics, 19:2486–2488.

    Google Scholar 

  • Linding, R. Russell, R.B. Neduva, V., and Gibson, T.J. 2003. GlobPlot: Exploring protein sequences for globularity and disorder. Nucleic Acids Research 31:3701–3708.

    Google Scholar 

  • Liu, J. and Rost, B. 2004. Sequence-based prediction of protein domains. Nucleic Acids Research 32(12):3522–3530.

    Google Scholar 

  • Marchler-Bauer, A., Anderson, J.B., DeWeese-Scott. C., Fedorova, N.D., Geer, L.Y., He, S., Hurwitz, D.I., Jackson, J.D., Jacobs, A.R., Lanczycki, C.J., Liebert, C.A., Liu, C., Madej, T., Marchler, G.H., Mazumder, R., Nikolskaya, A.N., Panchenko, A.R., Rao, B.S., Shoemaker, B.A., Simonyan, V., Song, J.S., Thiessen, P.A., Vasudevan, S., Wang, Y., Yamashita, R.A., Yin, J.J., and Bryant, S.H. 2003. CDD: A curated Entrez database of conserved domain alignments. Nucleic Acids Research, 31(1):383–387.

    Google Scholar 

  • Marsden, R.L., McGuffin, L.J., and Jones, D.T. 2002. Rapid protein domain assignment from amino acid sequence using predicted secondary structure, Protein Science, 11:2814–2824.

    Google Scholar 

  • Mika, S. and Rost, B. 2003. UniqueProt: Creating representative protein sequence sets. Nucleic Acids Research, 31(13):3789–3791.

    Google Scholar 

  • Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C., 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology, 247:536–540.

    Google Scholar 

  • Nagarajan, N. and Yona, G., 2004. Automatic prediction of protein domains from sequence information using a hybrid learning system. Bioinformatics, 20:1335–1360.

    Google Scholar 

  • Orengo, C.A., Bray, J.E., Buchan, D.W., Harrison, A., Lee, D., Perl, F.M., Sillitoe, I., Todd, A.E., and Thornton, J.M. 2002. The CATH protein family database: A resource for structural and functional annotation of genomes, Proteomics, 2:11–21.

  • Pollastri, G., Baldi, P., Fariselli, P., and Casadio, R., 2002. Prediction of coordination number and relative solvent accessibility in proteins. Proteins, 47:142–153.

    Google Scholar 

  • Pollastri, G. and Baldi, P., 2002. Prediction of contact maps by GIOHMMs and recurrent neural networks using lateral propagation from all four cardinal corners. Bioinformatics, 18(Suppl 1):S62–S70. Proceeding of the ISMB 2002 Conference.

    Google Scholar 

  • Pollastri, G., Przybylski, D., Rost, B., and Baldi, P., 2001. Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins, 47:228–235.

    Google Scholar 

  • Przybylski, D. and Rost, B. 2002. Alignments grow, secondary structure prediction improves. Proteins, 46:197–205.

    Google Scholar 

  • Saini, H.K. and Fischer, D. 2005. Meta-DP: Domain prediction meta server. Bioinformatics, 21:2917-2920.

    Google Scholar 

  • von Ohsen, N., Sommer, I., Zimmer, R., and Lengauer, T., 2004. Arby: Automatic protein structure prediction using profile-profile alignment and confidence measures. Bioinformatics, 20:2228–2235.

    Google Scholar 

  • Wheelan, S.J., Marchler, Bauer A., and Bryant, S.H. 2000. Domain size distributions can predict domain boundaries. Bioinformatics 16(7):613–618.

  • Zdobnov, E.M. and Apweiler, R., 2001. InterProScan–an integration platform for the signature-recognition methods in InterPro. Bioinformatics, 17:847–848.

Download references

Acknowledgments

This work is supported by the Institute for Genomics and Bioinformatics at UCI, a Laurel Wilkening Faculty Innovation award, an NIH Biomedical Informatics Training grant (LM-07443-01), an NSF MRI grant (EIA-0321390), a Sun Microsystems award, a grant from the University of California Systemwide Biotechnology Research and Education Program (UC BREP) to PB.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianlin Cheng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cheng, J., Sweredoski, M.J. & Baldi, P. DOMpro: Protein Domain Prediction Using Profiles, Secondary Structure, Relative Solvent Accessibility, and Recursive Neural Networks. Data Min Knowl Disc 13, 1–10 (2006). https://doi.org/10.1007/s10618-005-0023-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-005-0023-5

Keywords

Navigation