A Quasi-Clique Mining Algorithm for Analysis of the Human Protein-Protein Interaction Network

Sriwastava, Brijesh Kumar; Basu, Subhadip; Maulik, Ujjwal

doi:10.1007/978-3-319-69900-4_52

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10597))

Included in the following conference series:

International Conference on Pattern Recognition and Machine Intelligence

2914 Accesses
1 Citations

Abstract

The fundamental of complete interaction system of all living cell is protein- protein interactions (PPI). A protein-protein interactions network (PPIN) can be viewed as an intricate system of proteins. The proteins are linked by interactions between themselves. In this work, we developed a new algorithm to find largest quasi-cliques in human PPIN. We also identify significant clusters of proteins for subsequent pathway analysis. In the current experimental setup, we have mined 49 quasi-cliques from the human PPIN, with the largest quasi-clique having size 29. Each of these protein clusters are analysed with KEGG pathway analysis. The algorithm has been compared with the state-of-the art available in this field. We observe that our method is better than other methods available in this domain and finds larger quasi-cliques with higher size.

You have full access to this open access chapter, Download conference paper PDF

Complex Prediction in Large PPI Networks Using Expansion and Stripe of Core Cliques

Article 28 October 2022

Developing an Efficient Clique-Based Algorithm for Community Detection in Large Graphs

A framework combines supervised learning and dense subgraphs discovery to predict protein complexes

Article 30 October 2021

Keywords

1 Introduction

Proteins interact with other proteins to accomplish biological functions. The transient or more permanent complexes are formed due to such interactions. These interaction networks facilitate biological processes. To study its biological function, it is essential to recognize its probable interaction with other proteins. A PPI network can be thought as a complex system of proteins base on interactions between themselves. Wagner et al. [1] have demonstrated the PPI network as an undirected graph. The nodes are used to represent proteins and edges are used to represent the interaction among proteins. A substantial biological knowledge at the molecular level of interacting proteins can be obtained by PPINs [2]. Important directions for study of biological pathways and protein function can be obtained by mining these networks [3].

In order to discover the cluster of protein complexes in PPINs, lots of techniques have been used by researchers. These include clustering of sub-graph, finding dense regions [4,5,6], or clique finding [7]. The concept of maximum quasi-clique problem (MQP) is introduced by Matsuda et al. [8]. It is a constrained association of vertices in a graph. It leads to a \( \upgamma \)-quasi-clique.

Subsequently, Pie et al. [9] have proposed a mining algorithm (\( {\text{Crochet}} \)) to discover all quasi-cliques. Further, they have upgraded their algorithm and proposed \( Crochet^{ + } \) for the same purpose [10]. These methods have some restrictions for discovering all quasi-cliques. Brunato et al. [11] have proposed another definition for quasi-clique with a pair of parameters. Bhattacharyya et al. [12] have presented an algorithm to find the biggest quasi-cliques in PPIN of Homo sapiens.

In view of above facts, it is evident that there is an increasing necessity to localize those significant clusters of protein in an interaction network. The existing algorithms for determination of quasi-cliques are not sufficient to address the complexity of many networks. In case of homo-sapiens, simple analysis of tightly coupled cliques from the protein-protein networks may not be sufficient for investigation of key disease pathways. Therefore, we have tried to relax the constraints in an attempt to find all possible maximal quasi-cliques in the networks. The work presented here is found to be computationally efficient from the earlier works Quick [13] and Cocain [14] and the experimental results validates our claims.

In this work, we attempt to search for the largest PPI cluster using a new quasi-clique algorithm (qCliP). In the following, we first described the important preliminaries. After that method is presented along with detailed pathway analysis.

2 Methods

Here, some basic definitions and preliminaries of Maximal Quasi-clique Problem is first described. The graph indicate undirected labelled simple graph. A graph \( G \) is defined by tuples \( \left( {V,E} \right) \), where \( V \) is set of vertices and E is set of edges in between the pair of vertices. Our objective is to search for all possible maximal quasi-cliques in PPINs. A PPIN can be defined by tuples \( \left( {P,I } \right) \), where \( {\text{P}} \) denotes protein set and \( I \) denotes set of interactions. So, we have direct analogy between \( \left( {V, E} \right) \) and \( \left( {P,I} \right) \).

\( \gamma \) -quasi-clique graph: For \( \left( {0 < \gamma \le 1} \right) \), if each vertices of the graph \( G \) has at least degree \( \left\lceil {\gamma \times \left( {\left| V \right| - 1} \right)} \right\rceil \) then such graph \( G \) is called \( \gamma \)-quasi-clique graph.

An algorithm to find the largest quasi-clique is presented in the following. The proposed method finds quasi-cliques in large protein-protein interaction networks.

3 Results and Discussion

In this work, we have first started with 1831 distinct human proteins involving 2252 interactions (dip20100614) [15]. Then we have filtered the given data so that each entry has a valid Uniprot id, complete primary sequence annotation and 3D information (PDB id). So after this filtration the data size reduced to 1007 interactions with 857 distinct proteins [16]. This database is used for performance evaluation of the proposed algorithm, to identify all possible maximal quasi-cliques from this PPIN, and subsequent pathway analyses.

We have first executed our proposed algorithm for nine different values of \( \upgamma \) starting from 0.9 to 0.1 with an interval of 0.1. In this experiment, we observed that proposed method provides largest quasi-clique with cardinality of 23 for \( \upgamma \) is 0.1. In the second stage, we executed the novel algorithm for finer \( \upgamma \) values in the range (0, 0.1) with an interval of 0.01. We finally observed that algorithm mines largest quasi-clique of size 29 when \( \upgamma \) is 0.07. Considering the value of the \( \upgamma = 0.07 \), we got 49 different maximal quasi-cliques with size ranging from 3 to 29.

For pathway analysis of the clustered proteins, obtained by our proposed algorithm, we have used the web server (http://david.abcc.ncifcrf.gov/tools.jsp), where we first converted proteins ids from Uniprot id to gene id and then analysed all the corresponding genes on KEGG pathway analysis [17]. Our algorithm identifies 46 clusters for KEGG pathway. The Table 1 shows the KEGG pathway of some quasi-cliques along with their respective p-values.

Table 1. KEGG Pathway for some quasi-cliques obtained from our proposed algorithm

Full size table

We have compared the performance of the proposed algorithm with the available prior works in this domain. For evaluation of the system performances, the execution time for finding non-redundant disjoint maximal quasi-cliques is considered as one of the key criteria. Two other algorithms, viz., Quick [13] and Cocain [14], are compared with our method on an uniform hardware platform with Intel 2.4 GHz CPU computer having 2 GB internal memory. As discussed before, the performance of our proposed algorithm is optimised for \( \upgamma \) = 0.07. But during comparison of the method with other algorithms, we have considered a wide spectrum of \( \upgamma \) values as 0.01, 0.07, 0.1, 0.5 and 0.9, within the range (0, 1). The detailed results of all methods for all considered \( \upgamma \) values are given in the Table 2. It has been found that our novel algorithm identifies the largest maximal quasi-cliques of cardinality 29 particularly when \( \upgamma \) is 0.07. Overall, our method identifies 49 mutually-exclusive, maximal quasi-cliques from the PPIN and the experiment is completed in 352 s.

Table 2. Comparative performance analysis of all four methods for different Gamma (γ) values is shown. Please note that DA identifies only the maximum quasi-clique and Cocain works only with the γ range (0.5, 1). Quick executes faster but extracts many overlapping quasi-cliques from the PPIN.

Full size table

The Cocain algorithm works only for \( \upgamma \) values in the range (0.5, 1). Though it takes minimum time for execution in such a specific range, but the major limitation is that the method is not producing significant quasi-cliques for the PPIN. Most of the quasi-cliques are of cardinality 2 or 3 and the total number of quasi-cliques generated by Cocain method ranges in thousands within its allowable \( \upgamma \) range. A detailed comparative analysis of the above mentioned methods is given in the Table 2.

4 Conclusion

In the current work, a new algorithm is presented to find all possible non overlapping non redundant maximal quasi-cliques and also the largest size quasi-clique in the huge PPI networks. Subsequently, we apply this approach on huge PPIN and find important clusters of proteins. We have analysed these protein clusters based KEGG pathway analysis. In this work, we have attempted to cluster interactive human proteins within the PPIN using the developed algorithm and also compared its performance with other available works in this domain. It may be observed that this algorithm is better than other algorithm for identifying non-overlapping, non-redundant maximal quasi-cliques. The performance of the algorithm can be improved when applied over more robust network and also it will be biologically more important.

References

Wagner, A.: How the global structure of protein interaction networks evolves. Proc. Roy. Soc. London Ser. B Biol. Sci. 270, 457–466 (2003)
Google Scholar
Du, D., Pardalos, P.M.: Handbook of Combinatorial Optimization. Springer, New York (1998). doi:10.1007/978-1-4613-0303-9
Bomze, I.M., Budinich, M., Pardalos, P.M., et al.: The maximum clique problem. In: Handbook of Combinatorial Optimization, vol. 4, no. 1, pp. 1–74 (1999)
Google Scholar
Altaf-Ul-Amin, M., Shinbo, Y., Mihara, K., et al.: Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinform. 7(1), 207 (2006)
Article Google Scholar
Brohee, S., Van Helden, J.: Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinform. 7(1), 488 (2006)
Article Google Scholar
Pereira-Leal, J.B., Enright, A.J., Ouzounis, C.A.: Detection of functional modules from protein interaction networks. Proteins Struct. Funct. Bioinform. 54(1), 49–57 (2003)
Article Google Scholar
Spirin, V., Mirny, L.A.: Protein complexes and functional modules in molecular networks. Proc. Natl. Acad. Sci. U.S.A. 100(21), 12123–12128 (2003)
Article Google Scholar
Matsuda, H., Ishihara, T., Hashimoto, A.: Classifying molecular sequences using a linkage graph with their pairwise similarities. Theoret. Comput. Sci. 210(2), 305–325 (1999)
Article MATH MathSciNet Google Scholar
Pei, J., Jiang, D., Zhang, A.: On mining cross-graph quasi-cliques. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 228–238 (2005)
Google Scholar
Jiang, D., Pei, J.: Mining frequent cross-graph quasi-cliques. ACM Trans. Knowl. Discovery Data (TKDD) 2(4), 16 (2009)
Google Scholar
Brunato, M., Hoos, Holger H., Battiti, R.: On effectively finding maximal quasi-cliques in graphs. In: Maniezzo, V., Battiti, R., Watson, J.-P. (eds.) LION 2007. LNCS, vol. 5313, pp. 41–55. Springer, Heidelberg (2008). doi:10.1007/978-3-540-92695-5_4
Chapter Google Scholar
Bhattacharyya, M., Bandyopadhyay, S.: Mining the largest quasi-clique in human protein interactome. In: International Conference on Adaptive and Intelligent Systems, ICAIS 2009, pp. 194–199 (2009)
Google Scholar
Liu, G., Wong, L.: Effective pruning techniques for mining quasi-cliques. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008. LNCS, vol. 5212, pp. 33–49. Springer, Heidelberg (2008). doi:10.1007/978-3-540-87481-2_3
Chapter Google Scholar
Zeng, Z., Wang, J., Zhou, L., et al.: Coherent closed quasi-clique discovery from large dense graph databases. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 797–802 (2006)
Google Scholar
Salwinski, L., Miller, C.S., Smith, A.J., et al.: The Database of Interacting Proteins: 2004 Update. Nucleic Acids Res. 32, D449–D451 (2004)
Article Google Scholar
Sriwastava, B.K., Basu, S., Maulik, U., et al.: PPIcons: identification of protein-protein interaction sites in selected organisms. J. Mol. Model. 19(9), 4059–4070 (2013)
Article Google Scholar
Huang, D.W., Sherman, B.T., Lempicki, R.A.: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4(1), 44–57 (2008)
Article Google Scholar

Download references

Acknowledgement

This work is partially supported by the CMATER research laboratory of the Computer Science and Engineering Department, Jadavpur University, India, PURSE-II and UPE-II project and Research Award (F.30-31/2016(SA-II)) from UGC, Government of India.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Government College of Engineering and Leather Technology, Kolkata, 700098, India
Brijesh Kumar Sriwastava
Department of Computer Science and Engineering, Jadavpur University, Kolkata, 700032, India
Subhadip Basu & Ujjwal Maulik

Authors

Brijesh Kumar Sriwastava
View author publications
You can also search for this author in PubMed Google Scholar
Subhadip Basu
View author publications
You can also search for this author in PubMed Google Scholar
Ujjwal Maulik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Subhadip Basu .

Editor information

Editors and Affiliations

Indian Statistical Institute, Kolkata, India
B. Uma Shankar
Indian Statistical Institute, Kolkata, India
Kuntal Ghosh
Indian Statistical Institute, Kolkata, India
Deba Prasad Mandal
Indian Statistical Institute, Kolkata, India
Shubhra Sankar Ray
The Hong Kong Polytechnic University, Hong Kong, China
David Zhang
Indian Statistical Institute, Kolkata, India
Sankar K. Pal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sriwastava, B.K., Basu, S., Maulik, U. (2017). A Quasi-Clique Mining Algorithm for Analysis of the Human Protein-Protein Interaction Network. In: Shankar, B., Ghosh, K., Mandal, D., Ray, S., Zhang, D., Pal, S. (eds) Pattern Recognition and Machine Intelligence. PReMI 2017. Lecture Notes in Computer Science(), vol 10597. Springer, Cham. https://doi.org/10.1007/978-3-319-69900-4_52

Download citation

DOI: https://doi.org/10.1007/978-3-319-69900-4_52
Published: 01 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69899-1
Online ISBN: 978-3-319-69900-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)