Research Article
Genome-wide predicting disease-related protein complexes by walking on the heterogeneous network based on data integration and laplacian normalization

https://doi.org/10.1016/j.compbiolchem.2017.04.007Get rights and content

Highlights

  • A novel method of predicting disease-related protein complex is proposed based on the module nature of human genetic disease and the framework of RWR.

  • We combined data integration with laplacian normalization technique to strengthen the weight between seed nodes of the network.

  • Compared with some popular disease-related protein complex identification methods, the experimental results show that our method has reasonable performance.

Abstract

Background

Associating protein complexes to human inherited diseases is critical for better understanding of biological processes and functional mechanisms of the disease. Many protein complexes have been identified and functionally annotated by computational and purification methods so far, however, the particular roles they were playing in causing disease have not yet been well determined.

Results

In this study, we present a novel method to identify associations between protein complexes and diseases. First, we construct a disease-protein heterogeneous network based on data integration and laplacian normalization. Second, we apply a random walk with restart on heterogeneous network (RWRH) algorithm on this network to quantify the strength of the association between proteins and the query disease. Third, we sum over the scores of member proteins to obtain a summary score for each candidate protein complex, and then rank all candidate protein complexes according to their scores. With a series of leave-one-out cross-validation experiments, we found that our method not only possesses high performance but also demonstrates robustness regarding the parameters and the network structure. We test our approach with breast cancer and select top 20 highly ranked protein complexes, 17 of the selected protein complexes are evidenced to be connected with breast cancer.

Conclusions

Our proposed method is effective in identifying disease-related protein complexes based on data integration and laplacian normalization.

Introduction

Protein-protein interactions play key roles in cellular functions. Proteins linked by non-covalent interactions form protein complexes corresponding to specific biological functions (Le et al., 2013). These protein complexes and their cellular functions have been identified by a series of methods based on protein interaction network (Li et al., 2010, Mukhopadhyay et al., 2012, Chen et al., 2013, Peng et al., 2015a) and affinity purification-mass spectrometry experiments (Choi, 2012, Cai et al., 2012). For example, Peng et al. identified protein complexes using weighted PageRank-Nibble algorithm and core-attachment structure (Peng et al., 2015b). Zhao et al. detected protein complexes based on uncertain graph model (Zhao et al., 2014). However, the particular roles they were playing in causing disease have not yet been well determined. In a most current database of protein complexes CORUM (Ruepp et al., 2010), all protein complexes have been functionally annotated and categorized, however, only few of them had comment on their relationship with diseases.

Protein complexes do not only help us make better sense of cellular functions, but also give us insight into human diseases (Fraser and Plotkin, 2007). Protein complexes have been proved to be associated with a large number of diseases experimentally and computationally. For example, PAR4/BACE1 complex is involved in the pathogenesis of Alzheimer disease (Xie and Guo, 2005), BRCA1/CtIP/ZBRK1 complex plays a central role in breast cancer, impairment of BRCA1/CtIP/ZBRK1 repressor complex on ANG1 promoter accelerates the growth of mammary tumor (Furuta et al., 2006), mTOR complex 1 is associated with hematopoiesis and Pten-loss-evoked leukemogenesis (Kalaitzidis et al., 2012). Therefore, identifying protein complexes underlying a query disease can shed light on biological processes and functional mechanisms of the disease under investigation, thus contribute to diagnosis and treatment of human inherited diseases (Jacquemin and Jiang, 2013).

Phenotypically similar diseases often are caused by genes that are part of the same functional module such as protein complex or biological pathway, this concept is also referred as the modular nature of human genetic diseases (Kayarkar et al., 2009, Lage et al., 2007, Oti and Brunner, 2007). Some early studies made use of protein complexes to predict novel disease genes. For example, Lage et al. prioritized disease proteins through a systematic analysis of human protein complexes comprising gene products related with many different categories of human disease (Lage et al., 2007). Vanunu et al. proposed a global network-based method to prioritize disease proteins and infer disease-protein complex associations (Vanunu et al., 2010). Yang et al. detected and prioritized disease genes based on a novel protein complex network (Yang et al., 2011). In all these studies, however, the protein complexes are used to inferring disease-related genes rather than disease-related complexes, only a few studies have directly focused on this problem recently. For example, Jacquemin et al. used a three-layered heterogeneous network and then performed a network propagation algorithm on that network for discovering disease-associated protein complexes (Jacquemin and Jiang, 2013). Le et al. first constructed a protein complex network where two protein complexes are connected by shared genes and then applied random walk with restart (RWR) algorithm on that network to rank candidate protein complexes based on their relative importance to known disease protein complexes (Le et al., 2013). Chen et al. constructed a disease-protein heterogeneous network and then performed a maximum information flow algorithm to prioritize disease-related protein complexes (Chen et al., 2014).

In this paper, based on the modular nature of human genetic disease and the framework of RWR, a novel computational method is developed to infer potential disease-protein complex associations. First, we constructs a weighted gene–gene matrix and a weighted phenotype–phenotype matrix by integrating a known gene-phenotype interaction matrix with a protein–protein interaction matrix and a phenotype–phenotype similarity matrix before merging them into a big weighted matrices for a heterogeneous network. Laplacian technique is used to normalize the gene matrix and the phenotype matrix before the construction of the heterogeneous network and also the transition matrices of the heterogeneous network. Second, we apply a RWRH algorithm on this network to quantify the strength of associations between proteins and the query disease. Third, we sum over the scores of member proteins to obtain a score for each candidate protein complex, and then rank all candidate protein complexes according to their scores. The performance of this method is assessed by a series of large-scale leave-one-out cross-validation experiments. Results show that our method not only possesses high performance but also demonstrates robustness to parameters involved and the network structure. Moreover, a case study on breast cancer is performed. As a result, 17 out of top 20 protein complexes are shown to be associated with breast cancer. This indicates that our method is suitable for identification of disease-protein complex associations.

Section snippets

Data source

The protein–protein interaction (PPI) data were derived from Human Protein Reference Database (HPRD) (Keshava Prasad et al., 2009), including 9998 proteins and 41049 interactions. The disease phenotype similarity scores among 5080 diseases were obtained from the literature (van Driel et al., 2006). The similarity score is between 0 and 1, where a larger value indicates higher phenotypic similarity between a disease pair, and vice versa. Phenotype-protein associations were obtained from OMIM (

Performance evaluation

Leave-one-out cross-validation was implemented for evaluating the performance of this method and the results is showed in Fig. 1. For simplicity, we just choose γ=0.7, λ = η = 0.5, w = 0.9. The effect of parameters will be discussed in the next section. As a result, we observed that 295 (21.32%) test cases were ranked first, 444 (32.08%) were ranked among top 5, 512 (36.99%) were ranked among top 10, and 582 (42.05%) were ranked among top 20 (Fig. 1(a)), suggesting a faster accumulation of top

Conclusions

In this study, we had developed a novel computational method to predict disease-related protein complex based on the module nature of human genetic disease and the framework of RWR. In order to enhance the modular feature of the network for better performance, we combined data integration with laplacian normalization technique to strengthen the weight between seed nodes through fully exploiting the topological features of the existing network. The cross-validation and case study demonstrated

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant no.61572180) and the Hunan Provincial Natural Science Foundation of China (Grant no.13JJ2017).

References (38)

  • X. Chen et al.

    A novel candidate disease genes prioritization method based on module partition and rank fusion

    OMICS

    (2010)
  • X. Chen et al.

    Drug-target interaction prediction by random walk on the heterogeneous network

    Mol. Biosyst.

    (2012)
  • B. Chen et al.

    Identifying protein complexes in protein–protein interaction networks by using clique seeds and graph entropy

    Proteomics

    (2013)
  • Y. Chen et al.

    Prioritizing protein complexes implicated in human diseases by network optimization

    BMC Syst. Biol.

    (2014)
  • H. Choi

    Computational detection of protein complexes in AP-MS experiments

    Proteomics

    (2012)
  • A.D. D’Andrea et al.

    The fanconi anaemia/BRCA pathway

    Nat. Rev. Cancer

    (2003)
  • H.B. Fraser et al.

    Using protein complexes to predict phenotypic effects of gene mutation

    Genome Biol.

    (2007)
  • T. Jacquemin et al.

    Walking on a tissue-specific disease-protein-complex heterogeneous network for the discovery of disease-related protein complexes

    BioMed Res. Int.

    (2013)
  • N.A. Kayarkar et al.

    Protein networks in diseases

    Int. J. Drug Discov.

    (2009)
  • Cited by (4)

    View full text