Skip to main content

Practical Graph Isomorphism for Graphlet Data Mining in Protein Structures

  • Conference paper
Computational Intelligence (IJCCI 2010)

Part of the book series: Studies in Computational Intelligence ((SCI,volume 399))

Included in the following conference series:

  • 869 Accesses

Abstract

The tertiary structure of proteins is composed of α-helices and β-sheets being referred to as Secondary Structure Elements (SSE). SSE are evolutionary conserved and define the overall fold of a protein. Therefore they can be used to classify protein families. SSE form pairwise energetical interactions which can be described by graphs. Neighbourhood graphs employ edge conditions to filter relevant interactions out of a set of pairwise relations. Graphlet analysis then employs stochastic sampling of subgraphs to identify overrepresented interaction patterns. To distinguish graphlets while sampling requires an efficient algorithm for graph isomorphism.

In this Chapter, we describe a graph isomorphism algorithm that is easy to implement. In a preprocessing phase, the presented algorithm combines marks assigned to vertices to more informative marks. Propagated over edges, marks collect information about the structure of the graph and hence allow to efficiently find isomorphisms in a subsequent backtracking step.

Applying graphlet analysis to neighbourhood graphs for structures from the ICGEB Protein Benchmark database and the Super-Secondary Structure database (SSSDB), we identify 627 significant graphlets. Subsequently trained decision trees on these features predict the four SCOP levels and SSSDB classes with a mean Area Under Curve (mAUC) better than 0.89 (5-fold CV). Regularizing these decision trees to avoid overfitting reveals that for reliable prediction of structural features about 20 graphlets are sufficient. Especially, we find that graphlets composed of five secondary structure elements are most informative for classification. Conversely, using decision trees trained on the mere sequence of SSE obtained from the protein sequence we are also able to predict graphlets directly from secondary structure annotation. Optimal prediction performance thereby reaches up to a Matthews Correlation Coefficient (MCC) of 0.7.

From our experiments in this Chapter, we conclude that SSE interactions form patterns significantly different from random. These patterns are both useful to predict structural protein features as well as they can be predicted from protein sequence. Therefore they can be used as constraints to facilitate the de novo prediction of unknown protein structures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: Scop: A structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology 247, 536–540 (1995)

    Article  Google Scholar 

  2. Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., Thornton, J.M.: CATH–a hierarchic classification of protein domain structures. Structure 5, 1093–1108 (1997)

    Article  Google Scholar 

  3. Vacic, V., Iakuoucheva, L., Lonardi, S., Radivojac, P.: Graphlet kernels for prediction of functional residues in protein structures. Journal of Computational Biology 17, 55–72 (2010)

    Article  MathSciNet  Google Scholar 

  4. Graphlet data mining of energetical interaction patterns in protein 3D structures. In: International Confererence on Neural Computation (ICNC) (2010)

    Google Scholar 

  5. Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, U.: Network Motifs: Simple Building Blocks of Complex Networks. Science 298, 824–827 (2002)

    Article  Google Scholar 

  6. Sonego, P., Pacurar, M., Dhir, S., Kertesz-Farkas, A., Kocsor, A., Gaspari, Z., Leunissen, J.A.M., Pongor, S.: A Protein Classification Benchmark collection for machine learning. Nucleic Acids Res. 35, D232–D236 (2007)

    Article  Google Scholar 

  7. Chiang, Y.S., Gelfand, T.I., Kister, A.E., Gelfand, I.M.: New classification of supersecondary structures of sandwich-like proteins uncovers strict patterns of strand assemblage. Proteins 68, 915–921 (2007)

    Article  Google Scholar 

  8. Kohlbacher, O., Lenhof, H.P.: BALLrapid software prototyping in computational molecular biology. Bioinformatics 16, 815–824 (2000)

    Article  Google Scholar 

  9. Kabsch, W., Sander, C.: Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637 (1983)

    Article  Google Scholar 

  10. Toussaint, T.G.: The relative neighbourhood graph of a finite planar set. Pattern Recognition 12, 261–268 (1980)

    Article  MathSciNet  MATH  Google Scholar 

  11. Milligan, G.W., Isaac, P.D.: The validation of four ultrametric clustering algorithms. Pattern Recognition 12, 41–50 (1980)

    Article  Google Scholar 

  12. Weisfeiler, R. (ed.): On Construction and Identification of Graphs. Number 556. Lecture Notes in Math. Springer (1976)

    Google Scholar 

  13. Wassermann, L.: All of statistic. Springer (2004), theorem 14.5.

    Google Scholar 

  14. Georgii, H.O.: Stochastik, 2nd edn., p. 198. de Gruyter (2004)

    Google Scholar 

  15. Wald, A., Wolfowitz, J.: Statistical tests based on permutations of the observations. The Annals of Mathematical Statistics 15, 358–372 (1944)

    Article  MathSciNet  MATH  Google Scholar 

  16. R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2009) ISBN 3-900051-07-0

    Google Scholar 

  17. Holm, S.: A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6, 65–70 (1979)

    MathSciNet  MATH  Google Scholar 

  18. SAS Institute Inc.: Jmp 8.0.1 (2009), http://www.jmp.com

  19. Scott, C., Nowak, R.: On the adaptive properties of decision trees. In: Advances in Neural Information Processing Systems, vol. 17. MIT Press (2005)

    Google Scholar 

  20. Baldi, P., Brunak, S., Chauvin, Y., Andersen, C.A.F., Nielsen, H.: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16, 412–424 (2000)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carsten Henneges .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag GmbH Berlin Heidelberg

About this paper

Cite this paper

Henneges, C., Behle, C., Zell, A. (2012). Practical Graph Isomorphism for Graphlet Data Mining in Protein Structures. In: Madani, K., Dourado Correia, A., Rosa, A., Filipe, J. (eds) Computational Intelligence. IJCCI 2010. Studies in Computational Intelligence, vol 399. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27534-0_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-27534-0_23

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-27533-3

  • Online ISBN: 978-3-642-27534-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics