Genome-wide predicting disease-related protein complexes by walking on the heterogeneous network based on data integration and laplacian normalization

doi:10.1016/j.compbiolchem.2017.04.007

Computational Biology and Chemistry

Volume 69, August 2017, Pages 41-47

https://doi.org/10.1016/j.compbiolchem.2017.04.007 Get rights and content

Highlights

•
A novel method of predicting disease-related protein complex is proposed based on the module nature of human genetic disease and the framework of RWR.
•
We combined data integration with laplacian normalization technique to strengthen the weight between seed nodes of the network.
•
Compared with some popular disease-related protein complex identification methods, the experimental results show that our method has reasonable performance.

Abstract

Background

Associating protein complexes to human inherited diseases is critical for better understanding of biological processes and functional mechanisms of the disease. Many protein complexes have been identified and functionally annotated by computational and purification methods so far, however, the particular roles they were playing in causing disease have not yet been well determined.

Results

In this study, we present a novel method to identify associations between protein complexes and diseases. First, we construct a disease-protein heterogeneous network based on data integration and laplacian normalization. Second, we apply a random walk with restart on heterogeneous network (RWRH) algorithm on this network to quantify the strength of the association between proteins and the query disease. Third, we sum over the scores of member proteins to obtain a summary score for each candidate protein complex, and then rank all candidate protein complexes according to their scores. With a series of leave-one-out cross-validation experiments, we found that our method not only possesses high performance but also demonstrates robustness regarding the parameters and the network structure. We test our approach with breast cancer and select top 20 highly ranked protein complexes, 17 of the selected protein complexes are evidenced to be connected with breast cancer.

Conclusions

Our proposed method is effective in identifying disease-related protein complexes based on data integration and laplacian normalization.

Graphical abstract

Introduction

Protein-protein interactions play key roles in cellular functions. Proteins linked by non-covalent interactions form protein complexes corresponding to specific biological functions (Le et al., 2013). These protein complexes and their cellular functions have been identified by a series of methods based on protein interaction network (Li et al., 2010, Mukhopadhyay et al., 2012, Chen et al., 2013, Peng et al., 2015a) and affinity purification-mass spectrometry experiments (Choi, 2012, Cai et al., 2012). For example, Peng et al. identified protein complexes using weighted PageRank-Nibble algorithm and core-attachment structure (Peng et al., 2015b). Zhao et al. detected protein complexes based on uncertain graph model (Zhao et al., 2014). However, the particular roles they were playing in causing disease have not yet been well determined. In a most current database of protein complexes CORUM (Ruepp et al., 2010), all protein complexes have been functionally annotated and categorized, however, only few of them had comment on their relationship with diseases.

Protein complexes do not only help us make better sense of cellular functions, but also give us insight into human diseases (Fraser and Plotkin, 2007). Protein complexes have been proved to be associated with a large number of diseases experimentally and computationally. For example, PAR4/BACE1 complex is involved in the pathogenesis of Alzheimer disease (Xie and Guo, 2005), BRCA1/CtIP/ZBRK1 complex plays a central role in breast cancer, impairment of BRCA1/CtIP/ZBRK1 repressor complex on ANG1 promoter accelerates the growth of mammary tumor (Furuta et al., 2006), mTOR complex 1 is associated with hematopoiesis and Pten-loss-evoked leukemogenesis (Kalaitzidis et al., 2012). Therefore, identifying protein complexes underlying a query disease can shed light on biological processes and functional mechanisms of the disease under investigation, thus contribute to diagnosis and treatment of human inherited diseases (Jacquemin and Jiang, 2013).

Phenotypically similar diseases often are caused by genes that are part of the same functional module such as protein complex or biological pathway, this concept is also referred as the modular nature of human genetic diseases (Kayarkar et al., 2009, Lage et al., 2007, Oti and Brunner, 2007). Some early studies made use of protein complexes to predict novel disease genes. For example, Lage et al. prioritized disease proteins through a systematic analysis of human protein complexes comprising gene products related with many different categories of human disease (Lage et al., 2007). Vanunu et al. proposed a global network-based method to prioritize disease proteins and infer disease-protein complex associations (Vanunu et al., 2010). Yang et al. detected and prioritized disease genes based on a novel protein complex network (Yang et al., 2011). In all these studies, however, the protein complexes are used to inferring disease-related genes rather than disease-related complexes, only a few studies have directly focused on this problem recently. For example, Jacquemin et al. used a three-layered heterogeneous network and then performed a network propagation algorithm on that network for discovering disease-associated protein complexes (Jacquemin and Jiang, 2013). Le et al. first constructed a protein complex network where two protein complexes are connected by shared genes and then applied random walk with restart (RWR) algorithm on that network to rank candidate protein complexes based on their relative importance to known disease protein complexes (Le et al., 2013). Chen et al. constructed a disease-protein heterogeneous network and then performed a maximum information flow algorithm to prioritize disease-related protein complexes (Chen et al., 2014).

In this paper, based on the modular nature of human genetic disease and the framework of RWR, a novel computational method is developed to infer potential disease-protein complex associations. First, we constructs a weighted gene–gene matrix and a weighted phenotype–phenotype matrix by integrating a known gene-phenotype interaction matrix with a protein–protein interaction matrix and a phenotype–phenotype similarity matrix before merging them into a big weighted matrices for a heterogeneous network. Laplacian technique is used to normalize the gene matrix and the phenotype matrix before the construction of the heterogeneous network and also the transition matrices of the heterogeneous network. Second, we apply a RWRH algorithm on this network to quantify the strength of associations between proteins and the query disease. Third, we sum over the scores of member proteins to obtain a score for each candidate protein complex, and then rank all candidate protein complexes according to their scores. The performance of this method is assessed by a series of large-scale leave-one-out cross-validation experiments. Results show that our method not only possesses high performance but also demonstrates robustness to parameters involved and the network structure. Moreover, a case study on breast cancer is performed. As a result, 17 out of top 20 protein complexes are shown to be associated with breast cancer. This indicates that our method is suitable for identification of disease-protein complex associations.

Section snippets

Data source

The protein–protein interaction (PPI) data were derived from Human Protein Reference Database (HPRD) (Keshava Prasad et al., 2009), including 9998 proteins and 41049 interactions. The disease phenotype similarity scores among 5080 diseases were obtained from the literature (van Driel et al., 2006). The similarity score is between 0 and 1, where a larger value indicates higher phenotypic similarity between a disease pair, and vice versa. Phenotype-protein associations were obtained from OMIM (

Performance evaluation

Leave-one-out cross-validation was implemented for evaluating the performance of this method and the results is showed in Fig. 1. For simplicity, we just choose γ=0.7, λ = η = 0.5, w = 0.9. The effect of parameters will be discussed in the next section. As a result, we observed that 295 (21.32%) test cases were ranked first, 444 (32.08%) were ranked among top 5, 512 (36.99%) were ranked among top 10, and 582 (42.05%) were ranked among top 20 (Fig. 1(a)), suggesting a faster accumulation of top

Conclusions

In this study, we had developed a novel computational method to predict disease-related protein complex based on the module nature of human genetic disease and the framework of RWR. In order to enhance the modular feature of the network for better performance, we combined data integration with laplacian normalization technique to strengthen the weight between seed nodes through fully exploiting the topological features of the existing network. The cross-validation and case study demonstrated

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant no.61572180) and the Hunan Provincial Natural Science Foundation of China (Grant no.13JJ2017).

References (38)

S. Furuta et al.
Removal of BRCA1/CtIP/ZBRK1 repressor complex on ANG1 promoter leads to accelerated mammary tumor growth contributed by prominent vasculature
Cancer Cell.
(2006)
D. Kalaitzidis et al.
mTOR complex 1 plays critical roles in hematopoiesis and Pten-loss-evoked leukemogenesis
Cell Stem Cell
(2012)
Z. Kleibl et al.
Women at high risk of breast cancer: molecular characteristics, clinical presentation and management
Breast
(2016)
S. Kohler et al.
Walking the interactome for prioritization of candidate disease genes
Am. J. Hum. Genet.
(2008)
J.E. Ladbury et al.
Noise in cellular signaling pathways: causes and effects
Trends Biochem. Sci.
(2012)
Duc-Hau Le et al.
Towards the identification of disease associated protein complexes
Procedia Comput. Sci.
(2013)
J. Xie et al.
PAR-4 is involved in regulation of beta-secretase cleavage of the Alzheimer amyloid precursor protein
J. Biol. Chem.
(2005)
Z.Q. Zhao et al.
Laplacian normalization and random walk on heterogeneous networks for disease-gene prioritization
Comput. Biol. Chem.
(2015)
J.S. Amberger et al.
OMIM.org: Online Mendelian Inheritance in Man (OMIM(R)), an online catalog of human genes and genetic disorders
Nucleic Acids Res.
(2015)
B. Cai et al.
Detection of protein complexes from affinity purification/mass spectrometry data
BMC Syst. Biol.
(2012)

X. Chen et al.

A novel candidate disease genes prioritization method based on module partition and rank fusion

OMICS

(2010)

X. Chen et al.

Drug-target interaction prediction by random walk on the heterogeneous network

Mol. Biosyst.

(2012)

B. Chen et al.

Identifying protein complexes in protein–protein interaction networks by using clique seeds and graph entropy

Proteomics

(2013)

Y. Chen et al.

Prioritizing protein complexes implicated in human diseases by network optimization

BMC Syst. Biol.

(2014)

H. Choi

Computational detection of protein complexes in AP-MS experiments

Proteomics

(2012)

A.D. D’Andrea et al.

The fanconi anaemia/BRCA pathway

Nat. Rev. Cancer

(2003)

H.B. Fraser et al.

Using protein complexes to predict phenotypic effects of gene mutation

Genome Biol.

(2007)

T. Jacquemin et al.

Walking on a tissue-specific disease-protein-complex heterogeneous network for the discovery of disease-related protein complexes

BioMed Res. Int.

(2013)

N.A. Kayarkar et al.

Protein networks in diseases

Int. J. Drug Discov.

(2009)

Cited by (4)

Random walk based method to identify essential proteins by integrating network topology and biological characteristics
2019, Knowledge-Based Systems
Essential proteins are regarded as the fundamental components of living organisms. The identification of essential proteins greatly contributes to the understanding of cellular functions and biological mechanisms. There are a variety of experimental as well as computational methods which have been used for essential protein detection. However, it is still a big challenge to further improve the precision of essential proteins prediction. In this paper, we introduce a novel essential proteins exploration method named RWEP, which adopts random walk algorithm and integrates the topological and biological properties to determine protein essentiality in protein–protein interaction (PPI) networks. In this method, first, PPIs are weighted based on topology of networks, gene expression and GO annotation data. Then each protein in a PPI network is assigned an initial score by exploiting subcellular localization and protein complexes information. Finally, we apply a random walk with restart (RWR) algorithm on the weighted PPI networks to iteratively score proteins. To demonstrate the performance of RWEP, we have carried out a series of experiments on four different yeast datasets (DIP, MIPS, Krogan and Gavin). The computational experiments confirm the efficiency of RWEP in predicting essential proteins. Compared with other state-of-the-art essential proteins identification methods, RWEP achieves a superior performance in terms of various evaluation criteria.
Biased random walk with restart for essential proteins prediction
2022, Chinese Physics B
Advances in predicting the risk pathogenic genes with random walk<sup>*</sup>
2021, Progress in Biochemistry and Biophysics
High performance of porous silicon/carbon/RGO network derived from rice husks as anodes for lithium-ion batteries
2018, New Journal of Chemistry

View full text

Research ArticleGenome-wide predicting disease-related protein complexes by walking on the heterogeneous network based on data integration and laplacian normalization

Highlights

Abstract

Background

Results

Conclusions

Graphical abstract

Introduction

Section snippets

Data source

Performance evaluation

Conclusions

Acknowledgements

Cancer Cell.

Cell Stem Cell

Breast

Am. J. Hum. Genet.

Trends Biochem. Sci.

Procedia Comput. Sci.

J. Biol. Chem.

Comput. Biol. Chem.

OMIM.org: Online Mendelian Inheritance in Man (OMIM(R)), an online catalog of human genes and genetic disorders

Nucleic Acids Res.

Detection of protein complexes from affinity purification/mass spectrometry data

BMC Syst. Biol.

A novel candidate disease genes prioritization method based on module partition and rank fusion

OMICS

Drug-target interaction prediction by random walk on the heterogeneous network

Mol. Biosyst.

Identifying protein complexes in protein–protein interaction networks by using clique seeds and graph entropy

Proteomics

Prioritizing protein complexes implicated in human diseases by network optimization

BMC Syst. Biol.

Computational detection of protein complexes in AP-MS experiments

Proteomics

The fanconi anaemia/BRCA pathway

Nat. Rev. Cancer

Using protein complexes to predict phenotypic effects of gene mutation

Genome Biol.

Walking on a tissue-specific disease-protein-complex heterogeneous network for the discovery of disease-related protein complexes

BioMed Res. Int.

Protein networks in diseases

Int. J. Drug Discov.

Research Article
Genome-wide predicting disease-related protein complexes by walking on the heterogeneous network based on data integration and laplacian normalization