IMPRECO: Distributed prediction of protein complexes

https://doi.org/10.1016/j.future.2009.08.001Get rights and content

Abstract

Proteins interact among themselves, and different interactions form a very huge number of possible combinations representable as protein-to-protein interaction (PPI) networks that are mapped into graph structures. Protein complexes are a subset of mutually interacting proteins. Starting from a PPI network, protein complexes may be extracted by using computational methods. The paper proposes a new complexes meta-predictor which is capable of predicting protein complexes by integrating the results of different predictors. It is based on a distributed architecture that wraps predictor as web/grid services that is built on top of the grid infrastructure. The proposed meta-predictor first invokes different available predictors wrapped as services in a parallel way, then integrates their results using graph analysis, and finally evaluates the predicted results by comparing them against external databases storing experimentally determined protein complexes.

Introduction

Proteins are elementary building blocks of biological processes occurring within cells. They play their role via mutual interaction, composing a very broad network of interactions known as interactome [1]. Biological research has therefore focused on the determination of the complete set of Protein-to-Protein Interactions (PPI) that occur in various organisms. From this work, different experimental assays have accumulated a large quantity of data about protein interactions in cells [2], [3]. There exist many different typologies of interaction among proteins considering the biochemical nature of the interactions. The common interaction involves the direct contact of molecules, but proteins may also interact through a medium or even through the exchange of ions. The set of all binary interactions is spread across different repositories, such as BIND [4], DIP [5], and MIPS [6]. These databases usually contain interaction information determined in wet labs via one or more experimental technologies.

The high number of protein interactions taking place in a cell makes the manual analysis unfeasible, e.g. the individuation of global or local properties even for a simple organism such as yeast. The need for the introduction of computer-based tools for PPI data modeling, management, and analysis therefore arises.

The basic protein interaction (binary interaction) involves only two proteins and can be modeled by the couple of the involved proteins and by an information describing the kind and if necessary the direction of the interaction. It should be noted that usually binary interactions are not named, i.e. there is not a naming standard for interactions yet. A PPI network, being the set of all the protein interactions in an organism, is commonly represented as a (if applicable directed) graph [7], [8], where nodes represent proteins and edges represent the interactions among them.

Nonetheless such model does not capture the differences among interactions. Edges, in fact, are usually not labelled, so the kind of interaction is usually an unknown parameter in the analysis phase.

The modeling of PPI networks as graphs has enabled the investigation of biological properties of an organism through the use of graph-based algorithms [9] that aim to discover biologically meaningful facts by exploring structural properties of the underlying graph.

Initial attempts tried to discover the global properties of such networks and the individuation of theoretical models to explain these. In addition to the analysis of global properties, the study of recurring local topological features, such as the overrepresented subgraphs, has found an increasing interest. Finally, a recent trend in protein interaction analysis aims to the comparative investigation of PPI networks, discovering conserved subgraphs among them.

For instance, small dense regions in a PPI network, i.e. regions with a number of interactions higher than the average of the networks, may represent a protein complex, that is a group of two or more associated proteins that interact to achieve a common biological goal. Proteins bound in a complex act as a single functional unit via non-covalent interactions. Each complex has a different lifetime, i.e., the time over which it remains stable. Moreover, the formation of protein complexes acts as an activator or an inhibitor of one or more of the members of the complex. Complexes are a fundamental building block of many biological processes, so the analysis of their conservation during evolution or the eventual correlation among complexes and various diseases are important research areas [10]. For example, Breast Cancer Protein 1 (BRCA1) is known to participate in multiple cellular processes involving multiple protein complexes that play an important role in the mechanisms for DNA repair [11]. Fig. 1 depicts a fragment of human PPI network evidencing a complex in which BRCA1 participates. The complex comprises the proteins BRCA1, RAD50, Mre11 and NBS.

Protein complexes can be determined in wet labs using various techniques such as Mass-Spectrometry (MS) [3] or yeast-two-hybrid (Y2H) [12], but these experiments are usually time-consuming.

The main idea underlying the application of MS is the use of a protein as a bait to capture all the possible interacting partners. This protein is initially inserted into a sample, then it is purified from other proteins through a series of subsequent cleavages that aim to separate the investigated protein from other proteins that are not interacting. Finally all the proteins that are bound to the bait, are analyzed through the mass spectrometer. Data generated from the spectrometer are mined and the interacting proteins are identified. The yeast-two-hybrid technique aims to verify the existence of an interaction among two selected proteins. This assay uses a protein as a bait to identify the interaction with another protein called prey. In summary MS is able to directly identify protein complexes, while Y2H is able only to check the existence of binary interactions, so it requires many experiments using the same bait to find the complex. A protein complex prediction algorithm (complex predictor) tries to find highly connected regions in a PPI network that may reveal a protein complex. Protein complexes can be extracted from PPI networks by searching for small dense regions, i.e., regions containing many interactions compared with the average degree of PPI network, i.e. a higher ratio of edges with respect to the number of nodes. After the early work of David A. Bader [13], a number of algorithms for the prediction of protein complexes [14], [15], [13], [16], [17] have been introduced.

A protein complex predictor can be evaluated taking into account the percentage of discovered subnetworks that correspond to real complexes, against the meaningless ones. Currently, there is not a common accepted benchmark and there is no gold standard. To estimate the quality of prediction, a set of databases of experimentally verified complexes can be used as a benchmark. Currently, only a few of such databases exist, including the MIPS catalog of protein complexes in yeast [18], and the CORUM Complexes Database [19]. These databases store experimentally verified complexes, i.e., complexes that have been determined or verified by using experimental assays. The performance of a prediction algorithm is therefore influenced by: (i) the kind and the initial configuration of the used algorithm, and (ii) the validity of the initial protein protein interactions (i.e., edges in the graph) of the input interaction network (i.e., the graph) [20].

We have developed a tool (IMPRECO, for IMproving PREdiction of COmplexes) that combines different predictor results using an integration algorithm which is able to gather (partial) results from different predictors and eventually produce novel predictions.

In this paper, we present a distributed architecture that implements the IMPRECO prediction algorithm and demonstrates its ability to predict protein complexes. The proposed meta-predictor first invokes different available predictors wrapped as services in a parallel way, then integrates their results using graph analysis, and finally evaluates the predicted results by comparing them against external databases storing experimentally determined protein complexes.

The remainder of the paper is organized as follows. Section 2 introduces protein complex prediction. Section 3 discusses related work. Section 4 presents the IMPRECO algorithm. Section 5 discusses the distributed architecture of IMPRECO. Section 6 presents a case study and evaluates performance of the resulting predictions with respect to those of basic predictors. Finally, Section 7 concludes the paper and outlines future work.

Section snippets

Protein complex prediction

The prediction of protein complexes from experimental data is still a challenge. This paper focuses on algorithmic approaches that rely on two main ideas: (i) modeling the whole set of interactions as a graph, and (ii) the use of clustering for finding complexes. The workflow for complex prediction comprises three main steps: (i) building a PPI network from binary interaction data, (ii) algorithmic analysis of the network, and (iii) result evaluation, as depicted in Fig. 2. After an algorithm

Related work

This section presents main algorithms for protein complex prediction. They usually take as input a graph representing a PPI network then extract complexes by using graph-clustering related methods.

The Molecular Complex Detection Algorithm (MCODE)1 [13] applies a strategy based on three main steps. Firstly, it weights all the nodes based on their edge density, calculated for a local area called the k-core. The k-core of a graph is its (central) most densely

The IMPRECO algorithm

The IMPRECO algorithm is a meta-predictor that predicts complexes using four steps as depicted in Fig. 2: (i) generation of the PPI network, (ii) parallel execution of different predictors, (iii) integration of the obtained predictions, and (iv) result evaluation. Finally, the results are visualized and presented to the user. In the first step, IMPRECO collects data of binary interactions, merges them, and builds a corresponding graph. In the second step, it invokes the existing predictors in

The IMPRECO system

The IMPRECO system is based on a distributed Service-Oriented Architecture, as depicted in Fig. 3. The main modules of the system are: (i) an integration master service that implements the integration algorithms, (ii) a prediction master service that uses an internally parallel architecture to wrap existing predictors as web/grid services, (iii) an evaluation module that evaluates results, and (iv) a graphical user interface (GUI), also depicted in Fig. 5. IMPRECO system is designed on top of

Case study

This section demonstrates the ability of IMPRECO to predict complexes via a case study involving publicly available data. There is no gold standard dataset currently that can benchmark protein complex prediction algorithms, so the various approaches usually use a yeast dataset as input data, and the MIPS database to compare predicted complexes with real ones. Even though MIPS is nowadays one of the most comprehensive public datasets of yeast complexes available, it is still an incomplete

Conclusions

In this work, we presented IMPRECO a system for the prediction of protein complexes starting from protein interaction networks. IMPRECO is based on a distributed architecture built on top of the grid infrastructure. The proposed system is able to collect data from the user, then to run different predictors in a parallel way. Results are finally integrated. Using IMPRECO the user can run in a parallel way different experiments of complex prediction. IMPRECO also provides an easy data integration

Mario Cannataro is an Associate Professor of computer engineering at the University “Magna Græcia” of Catanzaro, Italy, since 2002. He received the Laurea Degree (cum laude) in computer engineering from the University of Calabria, Italy, in 1993. His current research interests include grid computing, bioinformatics, computational proteomics and genomics, grid-based problem solving environments, and adaptive hypermedia systems. He published two books and more than 150 papers on international

References (32)

  • Mario Cannataro et al.

    Using ontologies for preprocessing and mining spectra data on the Grid

    Future Gener. Comput. Syst.

    (2007)
  • T. Ito et al.

    A comprehensive two-hybrid analysis to explore the yeast protein interactome

    Proc. Natl. Acad. Sci. USA

    (2001)
  • P. Uetz et al.

    A comprehensive analysis of protein–protein interactions in saccharomyces cerevisiae

    Nature

    (2000)
  • A.C. Gavin

    Functional organization of the yeast proteome by systematic analysis of protein complexes

    Nature

    (2002)
  • C. Alfarano et al.

    The biomolecular interaction network database and related tools 2005 update

    Nucleic Acids. Res.

    (2005)
  • Lukasz Salwinski et al.

    The database of interacting proteins: 2004 update

    Nucleic Acids Res.

    (2004)
  • H.W. Mewes et al.

    Mips: A database for genomes and protein sequences

    Nucleic Acids Res.

    (2002)
  • Annick Lesne

    Complex networks: from graph theory to biology

    Lett. Math. Phys.

    (2006)
  • D.A. Fell et al.

    The small world of metabolism

    Nat Biotechnol.

    (2000)
  • T. Aittokallio et al.

    Graph-based methods for analysing networks in cell biology

    Brief Bioinform.

    (2006)
  • K. Lage et al.

    A human phenome-interactome network of protein complexes implicated in genetic disorders

    Nat. Biotechnol.

    (2007)
  • D. Cortez et al.

    Requirement of atm-dependent phosphorylation of brca1 in the dna damage response to double-strand breaks

    Science

    (1999)
  • S. Fields et al.

    A novel genetic system to detect protein–protein interactions

    Nature

    (1989)
  • Gary Bader et al.

    An automated method for finding molecular complexes in large protein interaction networks

    BMC Bioinform.

    (2003)
  • R. Sharan et al.

    Identification of protein complexes by comparative analysis of yeast and bacterial protein interaction data

    J. Comput. Biol.

    (2005)
  • A.D. King et al.

    Protein complex prediction via cost-based clustering

    Bioinformatics

    (2004)
  • Cited by (0)

    Mario Cannataro is an Associate Professor of computer engineering at the University “Magna Græcia” of Catanzaro, Italy, since 2002. He received the Laurea Degree (cum laude) in computer engineering from the University of Calabria, Italy, in 1993. His current research interests include grid computing, bioinformatics, computational proteomics and genomics, grid-based problem solving environments, and adaptive hypermedia systems. He published two books and more than 150 papers on international journals and conference proceedings. Mario Cannataro is a co-founder of Exeura and is a member of ACM, IEEE Computer Society, HealthGrid, and BITS (Bioinformatics Italian Society). Prof. Cannataro can be reached at [email protected].

    Pietro Hiram Guzzi is an Assistant Professor of Computer Engineering at the University “Magna Græcia” of Catanzaro, Italy, since 2008. He received his Ph.D. in Biomedical Engineering in 2008, from Magna Græcia University of Catanzaro. He received his Laurea degree in Computer Engineering in 2004 from the University of Calabria, Rende, Italy. His research interests comprise bioinformatics, the analysis of proteomics data, and the analysis of protein interaction networks. Pietro H. Guzzi can be reached at [email protected].

    Pierangelo Veltri is an Assistant Professor of Computer Engineering at the University “Magna Græcia” of Catanzaro, Italy, since 2002. He received his Ph.D. in computer science in 2002, from the University of Orsay (Paris XI) and he was Ph.D. student from 1998 to 2002 at INRIA Rocquencourt in the Verso-Database unit. He was also contract professor at University of Paris XIII from 2000 to 2002 for database classes. His main research interests are database system and modeling, XML and semistructured database, geographical and spatio temporal databases, views, and high performance computing. From 2002 his main activities are focusing on biological data management such as mass spectra data analysis, proteomics database, protein structures prediction, and on bioinformatics topics and on designing of clinical informatics systems. Pierangelo Veltri can be reached at [email protected].

    View full text