Abstract
The tertiary structure of proteins is composed of α-helices and β-sheets being referred to as Secondary Structure Elements (SSE). SSE are evolutionary conserved and define the overall fold of a protein. Therefore they can be used to classify protein families. SSE form pairwise energetical interactions which can be described by graphs. Neighbourhood graphs employ edge conditions to filter relevant interactions out of a set of pairwise relations. Graphlet analysis then employs stochastic sampling of subgraphs to identify overrepresented interaction patterns. To distinguish graphlets while sampling requires an efficient algorithm for graph isomorphism.
In this Chapter, we describe a graph isomorphism algorithm that is easy to implement. In a preprocessing phase, the presented algorithm combines marks assigned to vertices to more informative marks. Propagated over edges, marks collect information about the structure of the graph and hence allow to efficiently find isomorphisms in a subsequent backtracking step.
Applying graphlet analysis to neighbourhood graphs for structures from the ICGEB Protein Benchmark database and the Super-Secondary Structure database (SSSDB), we identify 627 significant graphlets. Subsequently trained decision trees on these features predict the four SCOP levels and SSSDB classes with a mean Area Under Curve (mAUC) better than 0.89 (5-fold CV). Regularizing these decision trees to avoid overfitting reveals that for reliable prediction of structural features about 20 graphlets are sufficient. Especially, we find that graphlets composed of five secondary structure elements are most informative for classification. Conversely, using decision trees trained on the mere sequence of SSE obtained from the protein sequence we are also able to predict graphlets directly from secondary structure annotation. Optimal prediction performance thereby reaches up to a Matthews Correlation Coefficient (MCC) of 0.7.
From our experiments in this Chapter, we conclude that SSE interactions form patterns significantly different from random. These patterns are both useful to predict structural protein features as well as they can be predicted from protein sequence. Therefore they can be used as constraints to facilitate the de novo prediction of unknown protein structures.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: Scop: A structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology 247, 536–540 (1995)
Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., Thornton, J.M.: CATH–a hierarchic classification of protein domain structures. Structure 5, 1093–1108 (1997)
Vacic, V., Iakuoucheva, L., Lonardi, S., Radivojac, P.: Graphlet kernels for prediction of functional residues in protein structures. Journal of Computational Biology 17, 55–72 (2010)
Graphlet data mining of energetical interaction patterns in protein 3D structures. In: International Confererence on Neural Computation (ICNC) (2010)
Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, U.: Network Motifs: Simple Building Blocks of Complex Networks. Science 298, 824–827 (2002)
Sonego, P., Pacurar, M., Dhir, S., Kertesz-Farkas, A., Kocsor, A., Gaspari, Z., Leunissen, J.A.M., Pongor, S.: A Protein Classification Benchmark collection for machine learning. Nucleic Acids Res. 35, D232–D236 (2007)
Chiang, Y.S., Gelfand, T.I., Kister, A.E., Gelfand, I.M.: New classification of supersecondary structures of sandwich-like proteins uncovers strict patterns of strand assemblage. Proteins 68, 915–921 (2007)
Kohlbacher, O., Lenhof, H.P.: BALLrapid software prototyping in computational molecular biology. Bioinformatics 16, 815–824 (2000)
Kabsch, W., Sander, C.: Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637 (1983)
Toussaint, T.G.: The relative neighbourhood graph of a finite planar set. Pattern Recognition 12, 261–268 (1980)
Milligan, G.W., Isaac, P.D.: The validation of four ultrametric clustering algorithms. Pattern Recognition 12, 41–50 (1980)
Weisfeiler, R. (ed.): On Construction and Identification of Graphs. Number 556. Lecture Notes in Math. Springer (1976)
Wassermann, L.: All of statistic. Springer (2004), theorem 14.5.
Georgii, H.O.: Stochastik, 2nd edn., p. 198. de Gruyter (2004)
Wald, A., Wolfowitz, J.: Statistical tests based on permutations of the observations. The Annals of Mathematical Statistics 15, 358–372 (1944)
R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2009) ISBN 3-900051-07-0
Holm, S.: A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6, 65–70 (1979)
SAS Institute Inc.: Jmp 8.0.1 (2009), http://www.jmp.com
Scott, C., Nowak, R.: On the adaptive properties of decision trees. In: Advances in Neural Information Processing Systems, vol. 17. MIT Press (2005)
Baldi, P., Brunak, S., Chauvin, Y., Andersen, C.A.F., Nielsen, H.: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16, 412–424 (2000)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag GmbH Berlin Heidelberg
About this paper
Cite this paper
Henneges, C., Behle, C., Zell, A. (2012). Practical Graph Isomorphism for Graphlet Data Mining in Protein Structures. In: Madani, K., Dourado Correia, A., Rosa, A., Filipe, J. (eds) Computational Intelligence. IJCCI 2010. Studies in Computational Intelligence, vol 399. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27534-0_23
Download citation
DOI: https://doi.org/10.1007/978-3-642-27534-0_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-27533-3
Online ISBN: 978-3-642-27534-0
eBook Packages: EngineeringEngineering (R0)