IMPRECO: Distributed prediction of protein complexes
Introduction
Proteins are elementary building blocks of biological processes occurring within cells. They play their role via mutual interaction, composing a very broad network of interactions known as interactome [1]. Biological research has therefore focused on the determination of the complete set of Protein-to-Protein Interactions (PPI) that occur in various organisms. From this work, different experimental assays have accumulated a large quantity of data about protein interactions in cells [2], [3]. There exist many different typologies of interaction among proteins considering the biochemical nature of the interactions. The common interaction involves the direct contact of molecules, but proteins may also interact through a medium or even through the exchange of ions. The set of all binary interactions is spread across different repositories, such as BIND [4], DIP [5], and MIPS [6]. These databases usually contain interaction information determined in wet labs via one or more experimental technologies.
The high number of protein interactions taking place in a cell makes the manual analysis unfeasible, e.g. the individuation of global or local properties even for a simple organism such as yeast. The need for the introduction of computer-based tools for PPI data modeling, management, and analysis therefore arises.
The basic protein interaction (binary interaction) involves only two proteins and can be modeled by the couple of the involved proteins and by an information describing the kind and if necessary the direction of the interaction. It should be noted that usually binary interactions are not named, i.e. there is not a naming standard for interactions yet. A PPI network, being the set of all the protein interactions in an organism, is commonly represented as a (if applicable directed) graph [7], [8], where nodes represent proteins and edges represent the interactions among them.
Nonetheless such model does not capture the differences among interactions. Edges, in fact, are usually not labelled, so the kind of interaction is usually an unknown parameter in the analysis phase.
The modeling of PPI networks as graphs has enabled the investigation of biological properties of an organism through the use of graph-based algorithms [9] that aim to discover biologically meaningful facts by exploring structural properties of the underlying graph.
Initial attempts tried to discover the global properties of such networks and the individuation of theoretical models to explain these. In addition to the analysis of global properties, the study of recurring local topological features, such as the overrepresented subgraphs, has found an increasing interest. Finally, a recent trend in protein interaction analysis aims to the comparative investigation of PPI networks, discovering conserved subgraphs among them.
For instance, small dense regions in a PPI network, i.e. regions with a number of interactions higher than the average of the networks, may represent a protein complex, that is a group of two or more associated proteins that interact to achieve a common biological goal. Proteins bound in a complex act as a single functional unit via non-covalent interactions. Each complex has a different lifetime, i.e., the time over which it remains stable. Moreover, the formation of protein complexes acts as an activator or an inhibitor of one or more of the members of the complex. Complexes are a fundamental building block of many biological processes, so the analysis of their conservation during evolution or the eventual correlation among complexes and various diseases are important research areas [10]. For example, Breast Cancer Protein 1 (BRCA1) is known to participate in multiple cellular processes involving multiple protein complexes that play an important role in the mechanisms for DNA repair [11]. Fig. 1 depicts a fragment of human PPI network evidencing a complex in which BRCA1 participates. The complex comprises the proteins BRCA1, RAD50, Mre11 and NBS.
Protein complexes can be determined in wet labs using various techniques such as Mass-Spectrometry (MS) [3] or yeast-two-hybrid (Y2H) [12], but these experiments are usually time-consuming.
The main idea underlying the application of MS is the use of a protein as a bait to capture all the possible interacting partners. This protein is initially inserted into a sample, then it is purified from other proteins through a series of subsequent cleavages that aim to separate the investigated protein from other proteins that are not interacting. Finally all the proteins that are bound to the bait, are analyzed through the mass spectrometer. Data generated from the spectrometer are mined and the interacting proteins are identified. The yeast-two-hybrid technique aims to verify the existence of an interaction among two selected proteins. This assay uses a protein as a bait to identify the interaction with another protein called prey. In summary MS is able to directly identify protein complexes, while Y2H is able only to check the existence of binary interactions, so it requires many experiments using the same bait to find the complex. A protein complex prediction algorithm (complex predictor) tries to find highly connected regions in a PPI network that may reveal a protein complex. Protein complexes can be extracted from PPI networks by searching for small dense regions, i.e., regions containing many interactions compared with the average degree of PPI network, i.e. a higher ratio of edges with respect to the number of nodes. After the early work of David A. Bader [13], a number of algorithms for the prediction of protein complexes [14], [15], [13], [16], [17] have been introduced.
A protein complex predictor can be evaluated taking into account the percentage of discovered subnetworks that correspond to real complexes, against the meaningless ones. Currently, there is not a common accepted benchmark and there is no gold standard. To estimate the quality of prediction, a set of databases of experimentally verified complexes can be used as a benchmark. Currently, only a few of such databases exist, including the MIPS catalog of protein complexes in yeast [18], and the CORUM Complexes Database [19]. These databases store experimentally verified complexes, i.e., complexes that have been determined or verified by using experimental assays. The performance of a prediction algorithm is therefore influenced by: (i) the kind and the initial configuration of the used algorithm, and (ii) the validity of the initial protein protein interactions (i.e., edges in the graph) of the input interaction network (i.e., the graph) [20].
We have developed a tool (IMPRECO, for IMproving PREdiction of COmplexes) that combines different predictor results using an integration algorithm which is able to gather (partial) results from different predictors and eventually produce novel predictions.
In this paper, we present a distributed architecture that implements the IMPRECO prediction algorithm and demonstrates its ability to predict protein complexes. The proposed meta-predictor first invokes different available predictors wrapped as services in a parallel way, then integrates their results using graph analysis, and finally evaluates the predicted results by comparing them against external databases storing experimentally determined protein complexes.
The remainder of the paper is organized as follows. Section 2 introduces protein complex prediction. Section 3 discusses related work. Section 4 presents the IMPRECO algorithm. Section 5 discusses the distributed architecture of IMPRECO. Section 6 presents a case study and evaluates performance of the resulting predictions with respect to those of basic predictors. Finally, Section 7 concludes the paper and outlines future work.
Section snippets
Protein complex prediction
The prediction of protein complexes from experimental data is still a challenge. This paper focuses on algorithmic approaches that rely on two main ideas: (i) modeling the whole set of interactions as a graph, and (ii) the use of clustering for finding complexes. The workflow for complex prediction comprises three main steps: (i) building a PPI network from binary interaction data, (ii) algorithmic analysis of the network, and (iii) result evaluation, as depicted in Fig. 2. After an algorithm
Related work
This section presents main algorithms for protein complex prediction. They usually take as input a graph representing a PPI network then extract complexes by using graph-clustering related methods.
The Molecular Complex Detection Algorithm (MCODE)1 [13] applies a strategy based on three main steps. Firstly, it weights all the nodes based on their edge density, calculated for a local area called the k-core. The -core of a graph is its (central) most densely
The IMPRECO algorithm
The IMPRECO algorithm is a meta-predictor that predicts complexes using four steps as depicted in Fig. 2: (i) generation of the PPI network, (ii) parallel execution of different predictors, (iii) integration of the obtained predictions, and (iv) result evaluation. Finally, the results are visualized and presented to the user. In the first step, IMPRECO collects data of binary interactions, merges them, and builds a corresponding graph. In the second step, it invokes the existing predictors in
The IMPRECO system
The IMPRECO system is based on a distributed Service-Oriented Architecture, as depicted in Fig. 3. The main modules of the system are: (i) an integration master service that implements the integration algorithms, (ii) a prediction master service that uses an internally parallel architecture to wrap existing predictors as web/grid services, (iii) an evaluation module that evaluates results, and (iv) a graphical user interface (GUI), also depicted in Fig. 5. IMPRECO system is designed on top of
Case study
This section demonstrates the ability of IMPRECO to predict complexes via a case study involving publicly available data. There is no gold standard dataset currently that can benchmark protein complex prediction algorithms, so the various approaches usually use a yeast dataset as input data, and the MIPS database to compare predicted complexes with real ones. Even though MIPS is nowadays one of the most comprehensive public datasets of yeast complexes available, it is still an incomplete
Conclusions
In this work, we presented IMPRECO a system for the prediction of protein complexes starting from protein interaction networks. IMPRECO is based on a distributed architecture built on top of the grid infrastructure. The proposed system is able to collect data from the user, then to run different predictors in a parallel way. Results are finally integrated. Using IMPRECO the user can run in a parallel way different experiments of complex prediction. IMPRECO also provides an easy data integration
Mario Cannataro is an Associate Professor of computer engineering at the University “Magna Græcia” of Catanzaro, Italy, since 2002. He received the Laurea Degree (cum laude) in computer engineering from the University of Calabria, Italy, in 1993. His current research interests include grid computing, bioinformatics, computational proteomics and genomics, grid-based problem solving environments, and adaptive hypermedia systems. He published two books and more than 150 papers on international
References (32)
- et al.
Using ontologies for preprocessing and mining spectra data on the Grid
Future Gener. Comput. Syst.
(2007) - et al.
A comprehensive two-hybrid analysis to explore the yeast protein interactome
Proc. Natl. Acad. Sci. USA
(2001) - et al.
A comprehensive analysis of protein–protein interactions in saccharomyces cerevisiae
Nature
(2000) Functional organization of the yeast proteome by systematic analysis of protein complexes
Nature
(2002)- et al.
The biomolecular interaction network database and related tools 2005 update
Nucleic Acids. Res.
(2005) - et al.
The database of interacting proteins: 2004 update
Nucleic Acids Res.
(2004) - et al.
Mips: A database for genomes and protein sequences
Nucleic Acids Res.
(2002) Complex networks: from graph theory to biology
Lett. Math. Phys.
(2006)- et al.
The small world of metabolism
Nat Biotechnol.
(2000) - et al.
Graph-based methods for analysing networks in cell biology
Brief Bioinform.
(2006)
A human phenome-interactome network of protein complexes implicated in genetic disorders
Nat. Biotechnol.
Requirement of atm-dependent phosphorylation of brca1 in the dna damage response to double-strand breaks
Science
A novel genetic system to detect protein–protein interactions
Nature
An automated method for finding molecular complexes in large protein interaction networks
BMC Bioinform.
Identification of protein complexes by comparative analysis of yeast and bacterial protein interaction data
J. Comput. Biol.
Protein complex prediction via cost-based clustering
Bioinformatics
Cited by (0)
Mario Cannataro is an Associate Professor of computer engineering at the University “Magna Græcia” of Catanzaro, Italy, since 2002. He received the Laurea Degree (cum laude) in computer engineering from the University of Calabria, Italy, in 1993. His current research interests include grid computing, bioinformatics, computational proteomics and genomics, grid-based problem solving environments, and adaptive hypermedia systems. He published two books and more than 150 papers on international journals and conference proceedings. Mario Cannataro is a co-founder of Exeura and is a member of ACM, IEEE Computer Society, HealthGrid, and BITS (Bioinformatics Italian Society). Prof. Cannataro can be reached at [email protected].
Pietro Hiram Guzzi is an Assistant Professor of Computer Engineering at the University “Magna Græcia” of Catanzaro, Italy, since 2008. He received his Ph.D. in Biomedical Engineering in 2008, from Magna Græcia University of Catanzaro. He received his Laurea degree in Computer Engineering in 2004 from the University of Calabria, Rende, Italy. His research interests comprise bioinformatics, the analysis of proteomics data, and the analysis of protein interaction networks. Pietro H. Guzzi can be reached at [email protected].
Pierangelo Veltri is an Assistant Professor of Computer Engineering at the University “Magna Græcia” of Catanzaro, Italy, since 2002. He received his Ph.D. in computer science in 2002, from the University of Orsay (Paris XI) and he was Ph.D. student from 1998 to 2002 at INRIA Rocquencourt in the Verso-Database unit. He was also contract professor at University of Paris XIII from 2000 to 2002 for database classes. His main research interests are database system and modeling, XML and semistructured database, geographical and spatio temporal databases, views, and high performance computing. From 2002 his main activities are focusing on biological data management such as mass spectra data analysis, proteomics database, protein structures prediction, and on bioinformatics topics and on designing of clinical informatics systems. Pierangelo Veltri can be reached at [email protected].