Skip to main content

A Feature Selection Algorithm Based on Graph Theory and Random Forests for Protein Secondary Structure Prediction

  • Conference paper
Bioinformatics Research and Applications (ISBRA 2007)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4463))

Included in the following conference series:

Abstract

Protein secondary structure prediction problem is one of the widely studied problems in bioinformatics. Predicting the secondary structure of a protein is an important step for determining its tertiary structure and thus its function. This paper explores the protein secondary structure problem using a novel feature selection algorithm combined with a machine learning approach based on random forests. For feature reduction, we propose an algorithm that uses a graph theoretical approach which finds cliques in the non-position specific evolutionary profiles of proteins obtained from BLOSUM62. Then, the features selected by this algorithm are used for condensing the position specific evolutionary information obtained from PSI-BLAST. Our results show that we are able to save significant amount of space and time and still achieve high accuracy results even when the features of the data are 25% reduced.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Altschul, S.F., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25(17), 3389–3402 (1997)

    Article  Google Scholar 

  2. Altun, G., et al.: Hybrid SVM kernels for protein secondary structure prediction. In: Proc. IEEE Intl Conf. on Granular Computing (GRC 2006), pp. 762–765 (2006)

    Google Scholar 

  3. Aydin, Z., Altunbasak, Y., Borodovsky, M.: Protein secondary structure prediction for a single-sequence using hidden semi-Markov models. BMC Bioinformatics 7, 178 (2006)

    Article  Google Scholar 

  4. Berman, H., et al.: The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB Data.

    Google Scholar 

  5. Birzele, F., Kramer, S.: A new representation for protein secondary structure prediction based on frequent patterns. Bioinformatics 22(21), 2628–2634 (2006)

    Article  Google Scholar 

  6. Butenko, S., Wilhelm, W.: Clique-detection models in computational biochemistry and genomics. European Journal of Operational Research, To appear (2006), Available online at http://www.sciencedirect.com/

  7. Breiman, L.: Random Forests. Machine Learning 45, 15–32 (2001)

    Google Scholar 

  8. Breiman, L., Cutler, A.: Random Forest, http://www.stat.berkeley.edu/~breiman/RandomForests/cc_software.htm

  9. Bystroff, C., Thorsson, V., Baker, D.: HMMSTR: a Hidden Markov Model for Local Sequence Structure Correlations in Proteins. J. Mol. Biol. 301, 173–190 (2000)

    Article  Google Scholar 

  10. Chou, P.Y., Fasman, G.D.: Prediction of protein conformation. Biochemistry 13(2), 222–245 (1974)

    Article  Google Scholar 

  11. Efron, B., Tibshirani, R.: An Introduction to the Bootstrap. Chapman and Hall, New York (1993)

    MATH  Google Scholar 

  12. Fleming, P.J., Gong, H., Rose, G.D.: Secondary structure determines protein topology. Protein Science 15, 1829–1834 (2006)

    Article  Google Scholar 

  13. Garnier, J., Osguthorpe, D.J., Robson, B.: Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J. Mol. Biol. 120, 97–120 (1978)

    Article  Google Scholar 

  14. Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. 89, 10915–10919 (1992)

    Article  Google Scholar 

  15. Hu, H., et al.: Improved protein secondary structure prediction using support vector machine with a new encoding scheme and an advanced tertiary classifier. IEEE Trans. NanoBiosci. 3, 265 (2004)

    Article  Google Scholar 

  16. Hua, S., Sun, Z.: A Novel Method of Protein Secondary Structure Prediction with High Segment Overlap Measure: Support Vector Machine Approach. J. Mol. Biol 308, 397–407 (2001)

    Article  Google Scholar 

  17. Jones, D.T.: Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292, 195–202 (1999)

    Article  Google Scholar 

  18. Karypis, G.: YASSPP: better kernels and coding schemes lead to improvements in protein secondary structure prediction. Proteins 64(3), 575–586 (2006)

    Article  Google Scholar 

  19. Kloczkowski, A., et al.: Combining the GOR V algorithm with evolutionary information for protein secondary structure prediction from amino acid sequence. Proteins 49, 154–166 (2002)

    Article  Google Scholar 

  20. Kim, H., Park, H.: Protein Secondary Structure based on an improved support vector machines approach. Protein Eng. (2003)

    Google Scholar 

  21. Kurgan, L., Homaeian, L.: Prediction of Secondary Protein Structure Content from Primary Sequence Alone-A Feature Selection Based Approach. In: Perner, P., Imiya, A. (eds.) MLDM 2005. LNCS (LNAI), vol. 3587, pp. 334–345. Springer, Heidelberg (2005)

    Google Scholar 

  22. Niskanen, S., Östergård, P.R.J.: Cliquer User’s Guide, Version 1.0. Communications Laboratory, Helsinki University of Technology, Espoo, Finland, Tech. Rep. T48 (2003)

    Google Scholar 

  23. Östergård, P.R.J.: A fast algorithm for the maximum clique problem. Discrete Applied Mathematics 120(1-3), 197–207 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  24. Przytycka, T., Aurora, R., Rose, G.D.: A protein taxonomy based on secondary structure. Nature Structural Biol. 6, 672–682 (1999)

    Article  Google Scholar 

  25. Przybylski, D., Rost, B.: Alignments grow, secondary structure prediction improves. Proteins 46, 197–205 (2002)

    Article  Google Scholar 

  26. Rost, B.: Rising accuracy of protein secondary structure prediction. In: Chasman, D. (ed.) Protein structure determination, analysis, and modeling for drug discovery, pp. 207–249. Dekker, New York (2003)

    Google Scholar 

  27. Rost, B., Sander, C.: Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol. 232, 584–599 (1993)

    Article  Google Scholar 

  28. Shi, S.Y.M., Suganthan, P.N.: Feature Analysis and Classification of Protein Secondary Structure Data. In: Kaynak, O., et al. (eds.) ICANN 2003 and ICONIP 2003. LNCS, vol. 2714, pp. 1151–1158. Springer, Heidelberg (2003)

    Google Scholar 

  29. Su, C.-T., Chen, C.-Y., Ou, Y.-Y.: Protein disorder prediction by condensed PSSM considering propensity for order or disorder. BMC Bioinformatics 7, 319 (2006)

    Article  Google Scholar 

  30. Vishveshwara, S., Brinda, K.V., Kannan, N.: Protein Structure: Insights from Graph Theory. J. Th. Comp. Chem. 1, 187–211 (2002)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Ion Măndoiu Alexander Zelikovsky

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Altun, G., Hu, HJ., Gremalschi, S., Harrison, R.W., Pan, Y. (2007). A Feature Selection Algorithm Based on Graph Theory and Random Forests for Protein Secondary Structure Prediction. In: Măndoiu, I., Zelikovsky, A. (eds) Bioinformatics Research and Applications. ISBRA 2007. Lecture Notes in Computer Science(), vol 4463. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72031-7_54

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-72031-7_54

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-72030-0

  • Online ISBN: 978-3-540-72031-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics