Practical Graph Isomorphism for Graphlet Data Mining in Protein Structures

Henneges, Carsten; Behle, Christoph; Zell, Andreas

doi:10.1007/978-3-642-27534-0_23

Carsten Henneges⁵,
Christoph Behle⁶ &
Andreas Zell⁵

Part of the book series: Studies in Computational Intelligence ((SCI,volume 399))

Included in the following conference series:

International Joint Conference on Computational Intelligence

899 Accesses

Abstract

The tertiary structure of proteins is composed of α-helices and β-sheets being referred to as Secondary Structure Elements (SSE). SSE are evolutionary conserved and define the overall fold of a protein. Therefore they can be used to classify protein families. SSE form pairwise energetical interactions which can be described by graphs. Neighbourhood graphs employ edge conditions to filter relevant interactions out of a set of pairwise relations. Graphlet analysis then employs stochastic sampling of subgraphs to identify overrepresented interaction patterns. To distinguish graphlets while sampling requires an efficient algorithm for graph isomorphism.

In this Chapter, we describe a graph isomorphism algorithm that is easy to implement. In a preprocessing phase, the presented algorithm combines marks assigned to vertices to more informative marks. Propagated over edges, marks collect information about the structure of the graph and hence allow to efficiently find isomorphisms in a subsequent backtracking step.

Applying graphlet analysis to neighbourhood graphs for structures from the ICGEB Protein Benchmark database and the Super-Secondary Structure database (SSSDB), we identify 627 significant graphlets. Subsequently trained decision trees on these features predict the four SCOP levels and SSSDB classes with a mean Area Under Curve (mAUC) better than 0.89 (5-fold CV). Regularizing these decision trees to avoid overfitting reveals that for reliable prediction of structural features about 20 graphlets are sufficient. Especially, we find that graphlets composed of five secondary structure elements are most informative for classification. Conversely, using decision trees trained on the mere sequence of SSE obtained from the protein sequence we are also able to predict graphlets directly from secondary structure annotation. Optimal prediction performance thereby reaches up to a Matthews Correlation Coefficient (MCC) of 0.7.

From our experiments in this Chapter, we conclude that SSE interactions form patterns significantly different from random. These patterns are both useful to predict structural protein features as well as they can be predicted from protein sequence. Therefore they can be used as constraints to facilitate the de novo prediction of unknown protein structures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

ppiGReMLIN: a graph mining based detection of conserved structural arrangements in protein-protein interfaces

Article Open access 15 April 2020

A Novel Algorithm for Classifying Protein Structure Familiar by Using the Graph Mining Approach

Algorithms for matching partially labelled sequence graphs

Article Open access 25 September 2017

References

Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: Scop: A structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology 247, 536–540 (1995)
Article Google Scholar
Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., Thornton, J.M.: CATH–a hierarchic classification of protein domain structures. Structure 5, 1093–1108 (1997)
Article Google Scholar
Vacic, V., Iakuoucheva, L., Lonardi, S., Radivojac, P.: Graphlet kernels for prediction of functional residues in protein structures. Journal of Computational Biology 17, 55–72 (2010)
Article MathSciNet Google Scholar
Graphlet data mining of energetical interaction patterns in protein 3D structures. In: International Confererence on Neural Computation (ICNC) (2010)
Google Scholar
Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, U.: Network Motifs: Simple Building Blocks of Complex Networks. Science 298, 824–827 (2002)
Article Google Scholar
Sonego, P., Pacurar, M., Dhir, S., Kertesz-Farkas, A., Kocsor, A., Gaspari, Z., Leunissen, J.A.M., Pongor, S.: A Protein Classification Benchmark collection for machine learning. Nucleic Acids Res. 35, D232–D236 (2007)
Article Google Scholar
Chiang, Y.S., Gelfand, T.I., Kister, A.E., Gelfand, I.M.: New classification of supersecondary structures of sandwich-like proteins uncovers strict patterns of strand assemblage. Proteins 68, 915–921 (2007)
Article Google Scholar
Kohlbacher, O., Lenhof, H.P.: BALLrapid software prototyping in computational molecular biology. Bioinformatics 16, 815–824 (2000)
Article Google Scholar
Kabsch, W., Sander, C.: Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637 (1983)
Article Google Scholar
Toussaint, T.G.: The relative neighbourhood graph of a finite planar set. Pattern Recognition 12, 261–268 (1980)
Article MathSciNet MATH Google Scholar
Milligan, G.W., Isaac, P.D.: The validation of four ultrametric clustering algorithms. Pattern Recognition 12, 41–50 (1980)
Article Google Scholar
Weisfeiler, R. (ed.): On Construction and Identification of Graphs. Number 556. Lecture Notes in Math. Springer (1976)
Google Scholar
Wassermann, L.: All of statistic. Springer (2004), theorem 14.5.
Google Scholar
Georgii, H.O.: Stochastik, 2nd edn., p. 198. de Gruyter (2004)
Google Scholar
Wald, A., Wolfowitz, J.: Statistical tests based on permutations of the observations. The Annals of Mathematical Statistics 15, 358–372 (1944)
Article MathSciNet MATH Google Scholar
R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2009) ISBN 3-900051-07-0
Google Scholar
Holm, S.: A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6, 65–70 (1979)
MathSciNet MATH Google Scholar
SAS Institute Inc.: Jmp 8.0.1 (2009), http://www.jmp.com
Scott, C., Nowak, R.: On the adaptive properties of decision trees. In: Advances in Neural Information Processing Systems, vol. 17. MIT Press (2005)
Google Scholar
Baldi, P., Brunak, S., Chauvin, Y., Andersen, C.A.F., Nielsen, H.: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16, 412–424 (2000)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Center for Bioinformatics Tübingen, Eberhard Karls Universität Tübingen, Sand 1, Tübingen, Germany
Carsten Henneges & Andreas Zell
Lehrstuhl für Theoretische Informatik, Eberhard Karls Universität Tübingen, Sand 13, Tübingen, Germany
Christoph Behle

Authors

Carsten Henneges
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Behle
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Zell
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carsten Henneges .

Editor information

Editors and Affiliations

, Images, Signals and Intelligence, University Paris-Est Creteil (UPEC), LISSI EA 3956, Paris, 77127, France
Kurosh Madani
, Departamento de Engenharia Informatica, University of Coimbra, Polo II - Pinhal de Marrocos, Coimbra, 3030, Portugal
António Dourado Correia
Systems and Robotics Institute, Evolutionary Systems and Biomedical, Instituto Superior Tecnico IST, Av. Rovisco Pais, Lisboa, 1049-001, Portugal
Agostinho Rosa
INSTICC, Polytechnic Institute of Setúbal, Setubal, 2910-595, Portugal
Joaquim Filipe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Henneges, C., Behle, C., Zell, A. (2012). Practical Graph Isomorphism for Graphlet Data Mining in Protein Structures. In: Madani, K., Dourado Correia, A., Rosa, A., Filipe, J. (eds) Computational Intelligence. IJCCI 2010. Studies in Computational Intelligence, vol 399. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27534-0_23

Download citation

DOI: https://doi.org/10.1007/978-3-642-27534-0_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-27533-3
Online ISBN: 978-3-642-27534-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics