Keywords

1 Introduction

Proteins interact with other proteins to accomplish biological functions. The transient or more permanent complexes are formed due to such interactions. These interaction networks facilitate biological processes. To study its biological function, it is essential to recognize its probable interaction with other proteins. A PPI network can be thought as a complex system of proteins base on interactions between themselves. Wagner et al. [1] have demonstrated the PPI network as an undirected graph. The nodes are used to represent proteins and edges are used to represent the interaction among proteins. A substantial biological knowledge at the molecular level of interacting proteins can be obtained by PPINs [2]. Important directions for study of biological pathways and protein function can be obtained by mining these networks [3].

In order to discover the cluster of protein complexes in PPINs, lots of techniques have been used by researchers. These include clustering of sub-graph, finding dense regions [4,5,6], or clique finding [7]. The concept of maximum quasi-clique problem (MQP) is introduced by Matsuda et al. [8]. It is a constrained association of vertices in a graph. It leads to a \( \upgamma \)-quasi-clique.

Subsequently, Pie et al. [9] have proposed a mining algorithm (\( {\text{Crochet}} \)) to discover all quasi-cliques. Further, they have upgraded their algorithm and proposed \( Crochet^{ + } \) for the same purpose [10]. These methods have some restrictions for discovering all quasi-cliques. Brunato et al. [11] have proposed another definition for quasi-clique with a pair of parameters. Bhattacharyya et al. [12] have presented an algorithm to find the biggest quasi-cliques in PPIN of Homo sapiens.

In view of above facts, it is evident that there is an increasing necessity to localize those significant clusters of protein in an interaction network. The existing algorithms for determination of quasi-cliques are not sufficient to address the complexity of many networks. In case of homo-sapiens, simple analysis of tightly coupled cliques from the protein-protein networks may not be sufficient for investigation of key disease pathways. Therefore, we have tried to relax the constraints in an attempt to find all possible maximal quasi-cliques in the networks. The work presented here is found to be computationally efficient from the earlier works Quick [13] and Cocain [14] and the experimental results validates our claims.

In this work, we attempt to search for the largest PPI cluster using a new quasi-clique algorithm (qCliP). In the following, we first described the important preliminaries. After that method is presented along with detailed pathway analysis.

2 Methods

Here, some basic definitions and preliminaries of Maximal Quasi-clique Problem is first described. The graph indicate undirected labelled simple graph. A graph \( G \) is defined by tuples \( \left( {V,E} \right) \), where \( V \) is set of vertices and E is set of edges in between the pair of vertices. Our objective is to search for all possible maximal quasi-cliques in PPINs. A PPIN can be defined by tuples \( \left( {P,I } \right) \), where \( {\text{P}} \) denotes protein set and \( I \) denotes set of interactions. So, we have direct analogy between \( \left( {V, E} \right) \) and \( \left( {P,I} \right) \).

\( \gamma \) -quasi-clique graph: For \( \left( {0 < \gamma \le 1} \right) \), if each vertices of the graph \( G \) has at least degree \( \left\lceil {\gamma \times \left( {\left| V \right| - 1} \right)} \right\rceil \) then such graph \( G \) is called \( \gamma \)-quasi-clique graph.

An algorithm to find the largest quasi-clique is presented in the following. The proposed method finds quasi-cliques in large protein-protein interaction networks.

figure a

3 Results and Discussion

In this work, we have first started with 1831 distinct human proteins involving 2252 interactions (dip20100614) [15]. Then we have filtered the given data so that each entry has a valid Uniprot id, complete primary sequence annotation and 3D information (PDB id). So after this filtration the data size reduced to 1007 interactions with 857 distinct proteins [16]. This database is used for performance evaluation of the proposed algorithm, to identify all possible maximal quasi-cliques from this PPIN, and subsequent pathway analyses.

We have first executed our proposed algorithm for nine different values of \( \upgamma \) starting from 0.9 to 0.1 with an interval of 0.1. In this experiment, we observed that proposed method provides largest quasi-clique with cardinality of 23 for \( \upgamma \) is 0.1. In the second stage, we executed the novel algorithm for finer \( \upgamma \) values in the range (0, 0.1) with an interval of 0.01. We finally observed that algorithm mines largest quasi-clique of size 29 when \( \upgamma \) is 0.07. Considering the value of the \( \upgamma = 0.07 \), we got 49 different maximal quasi-cliques with size ranging from 3 to 29.

For pathway analysis of the clustered proteins, obtained by our proposed algorithm, we have used the web server (http://david.abcc.ncifcrf.gov/tools.jsp), where we first converted proteins ids from Uniprot id to gene id and then analysed all the corresponding genes on KEGG pathway analysis [17]. Our algorithm identifies 46 clusters for KEGG pathway. The Table 1 shows the KEGG pathway of some quasi-cliques along with their respective p-values.

Table 1. KEGG Pathway for some quasi-cliques obtained from our proposed algorithm

We have compared the performance of the proposed algorithm with the available prior works in this domain. For evaluation of the system performances, the execution time for finding non-redundant disjoint maximal quasi-cliques is considered as one of the key criteria. Two other algorithms, viz., Quick [13] and Cocain [14], are compared with our method on an uniform hardware platform with Intel 2.4 GHz CPU computer having 2 GB internal memory. As discussed before, the performance of our proposed algorithm is optimised for \( \upgamma \) = 0.07. But during comparison of the method with other algorithms, we have considered a wide spectrum of \( \upgamma \) values as 0.01, 0.07, 0.1, 0.5 and 0.9, within the range (0, 1). The detailed results of all methods for all considered \( \upgamma \) values are given in the Table 2. It has been found that our novel algorithm identifies the largest maximal quasi-cliques of cardinality 29 particularly when \( \upgamma \) is 0.07. Overall, our method identifies 49 mutually-exclusive, maximal quasi-cliques from the PPIN and the experiment is completed in 352 s.

Table 2. Comparative performance analysis of all four methods for different Gamma (γ) values is shown. Please note that DA identifies only the maximum quasi-clique and Cocain works only with the γ range (0.5, 1). Quick executes faster but extracts many overlapping quasi-cliques from the PPIN.

The Cocain algorithm works only for \( \upgamma \) values in the range (0.5, 1). Though it takes minimum time for execution in such a specific range, but the major limitation is that the method is not producing significant quasi-cliques for the PPIN. Most of the quasi-cliques are of cardinality 2 or 3 and the total number of quasi-cliques generated by Cocain method ranges in thousands within its allowable \( \upgamma \) range. A detailed comparative analysis of the above mentioned methods is given in the Table 2.

4 Conclusion

In the current work, a new algorithm is presented to find all possible non overlapping non redundant maximal quasi-cliques and also the largest size quasi-clique in the huge PPI networks. Subsequently, we apply this approach on huge PPIN and find important clusters of proteins. We have analysed these protein clusters based KEGG pathway analysis. In this work, we have attempted to cluster interactive human proteins within the PPIN using the developed algorithm and also compared its performance with other available works in this domain. It may be observed that this algorithm is better than other algorithm for identifying non-overlapping, non-redundant maximal quasi-cliques. The performance of the algorithm can be improved when applied over more robust network and also it will be biologically more important.