Building and analysis of protein-protein interactions related to diabetes mellitus using support vector machine, biomedical text mining and network analysis

doi:10.1016/j.compbiolchem.2016.09.011

Computational Biology and Chemistry

Volume 65, December 2016, Pages 37-44

https://doi.org/10.1016/j.compbiolchem.2016.09.011 Get rights and content

Highlights

•
New protein fingerprints for capturing the topological properties of protein complexes in a linear format.
•
A SVM based predictive model for discriminating diabetes versus non-diabetes complexes with an AUC of 0.78.
•
Model tested on an external data set derived from text mining large number of PubMed abstracts.
•
Network modeling to identify new disease targets.

Abstract

In order to understand the molecular mechanism underlying any disease, knowledge about the interacting proteins in the disease pathway is essential. The number of revealed protein-protein interactions (PPI) is still very limited compared to the available protein sequences of different organisms. Experiment based high-throughput technologies though provide some data about these interactions, those are often fairly noisy. Computational techniques for predicting protein–protein interactions therefore assume significance. 1296 binary fingerprints that encode a combination of structural and geometric properties were developed using the crystallographic data of 15,000 protein complexes in the pdb server. In a case study, these fingerprints were created for proteins implicated in the Type 2 diabetes mellitus disease. The fingerprints were input into a SVM based model for discriminating disease proteins from non disease proteins yielding a classification accuracy of 78.2% (AUC value of 0.78) on an external data set composed of proteins retrieved via text mining of diabetes related literature. A PPI network was constructed and analysed to explore new disease targets. The integrated approach exemplified here has a potential for identifying disease related proteins, functional annotation and other proteomics studies.

Graphical abstract

Introduction

Diabetes is a chronic disease, which occurs when the pancreas does not produce enough insulin, or when the body cannot effectively use the insulin it produces. This leads to an increased concentration of glucose in the blood; this condition is known as hyperglycaemia. Diabetes are of 3 types: Type- 1 diabetes, Type- 2 diabetes and gestational diabetes (Defronzo, 1997). According to World Health Organisation, about 346 million people worldwide have been affected with diabetes. Type 2 diabetes is afflicting the third world countries, which are fast becoming the epicentre of this silent killer, mainly due to the rampant urbanization, poor nutrition and sedentary life styles (Buchwald et al., 2009, G. Diabetes Prevention Program Research, 2002). Keeping in view the rapid growth of diabetes and also its critical impact worldwide, it is of major importance to devise global models that can identify novel proteins related to this disease.

Identification of protein-protein interactions (PPIs) is considered as a key strategy for understanding the various mechanisms of any disease (Kann, 2007). Diabetes is a multi-factorial disease wherein a host of crucial interactions are still largely unknown (Virkamaki et al., 1999). Protein interactions are known to correlate with the protein’s functional properties and protein interaction networks are frequently utilized to discover the potential biological role of proteins with an unknown function. Both experimental and computational techniques have been used to identify protein-protein interactions (PPIs) in many organisms (Bader and Hogue, 2003). Experimental studies to identify protein-protein interactions (PPIs) have been carried out using techniques such as yeast two-hybrid screens and co-affinity purification followed by mass spectrometry (Phizicky and Fields, 1995). These methods, may be prone to error and may not be specific to proteins in all organisms. Moreover, there is a possibility of a number of false positives in the high throughput data from protein assays (Botstein et al., 2000). Thus use of computational methods for predicting proteins in PPI has been intensified. The identification of proteins responsible for human diseases is one of the most challenging tasks in the drug design. Some computational methods for example sequence based, high-throughput database and a combination of both have been used to predict protein-protein interactions (Karlin and Belshaw, 2012). Machine learning methods including Bayesian classifiers, probabilistic decision trees, logistic regression, support vector machines have been employed for predicting the PPIs by using a number of properties of proteins to classify the data (Pizzuti and Rombo, 2016, Oliva and Fernandez-Fuentes, 2016). However, generally the PPI networks are constructed on the basis of sequence data alone.

It is known that physically interacting proteins tend to be involved in the same cellular process, and mutations in their genes may lead to similar disease phenotypes (Estavez et al., 2009). Proteins must interact physically, at least briefly, to form temporary associations to express their biological functions in the cell (Ideker and Sharan, 2008). In the current work, we hypothesize that the probability of physical interaction between two proteins depends upon the 3D structural features derived from the 3D structure of a macromolecule. Based on this knowledge gained by studying 3D structures of 15,000 protein complexes deposited in the Brookhaven Protein Databank, we have generated 1296 binary fingerprint based descriptors encoding the geometric and structural attributes of a protein. We computed binary fingerprints for the proteins related to the type 2 diabetes disease. It is envisaged that the methodology employed here can be used efficiently to distinguish between disease related and non-disease related query proteins.

One of the main challenges in using the SVM for the prediction of PPIs from the protein sequences is finding a suitable transformation of the protein sequence information present in a fixed number of inputs to be used in SVM training. Many studies in the past have exploited the physiochemical properties of proteins to predict protein-protein interactions (Mei and Zhu, 2016). However, often unequal length inputs are considered in these studies because of the varying lengths of the protein sequences. Thus a method is proposed that converts a protein sequence into fixed-dimensional representative attributes, wherein each feature represents the relationship of amino acid to the protein sequence of interest. The approach is schematically illustrated in Fig. 1.

Section snippets

Materials and methods

An in-house developed Java based program was used to generate the binary fingerprints of bit length 1296 for a given protein structure. JProLine, a program developed in our research group was employed for constructing multiple sequence alignment and heatmap generation (Kumar et al., 2016). Another internally developed tool, MegaMiner portal was employed for the rapid intelligent text mining of biomedical records (Karthikeyan et al., 2015).

Comparison between the performances of different parameter sets to optimize the training set

The training set consisting of 2653 proteins had to be constructed using parameters value that yielded the best accuracy. In this step, optimization of the training set using two parameters i.e. C and kernel type is performed. Table below gives the performance value of different parameter sets.

The two important parameters namely C (regularization constant) and kernel type were optimized using the grid search method. The observed trend was that the performance of the training set increases with

Conclusions

Diabetes mellitus is a chronic disease which cannot be cured except in very specific situations. Exploiting protein-protein interactions can greatly increase the likelihood of finding positional candidate disease proteins for diabetes. When applied on a large scale they can lead to novel candidate protein predictions. An integrated approach involving fingerprint generation, SVM analysis, text mining, PPI networks was used to identify disease related proteins. The model developed in the present

Conflict of interest

The authors declare no conflict of interest.

Acknowledgement

RV thanks DST, New Delhi, India, MK thanks the Director NCL-Pune and CSIR New Delhi for the GENESIS (BSC0121) project.

References (31)

R.C. Edgar et al.
Multiple sequence alignment
Curr. Opin. Struct. Biol.
(2006)
B.M. Frier et al.
Serum trypsin concentration and pancreatic trypsin secretion in insulin-dependent diabetes mellitus
Clin. Chim. Acta
(1980)
S. Ji et al.
Insulin inhibits growth hormone signaling via the growth hormone receptor/JAK2/STAT5B pathway
J. Biol. Chem.
(1999)
G.D. Bader et al.
An automated method for finding molecular complexes in large protein interaction networks
BMC Bioinf.
(2003)
A.-L. Barabasi et al.
Network medicine: a network-based approach to human disease
Nat. Rev. Genet.
(2016)
D. Botstein et al.
Gene ontology: tool for the unification of biology
Nat. Genet.
(2000)
H. Buchwald et al.
Weight and type 2 diabetes after bariatric surgery: systematic review and meta-analysis
Am. J. Med.
(2009)
C.-C. Chang et al.
LIBSVM: a library for support vector machines
ACM Trans. Intell. Syst. Technol. (TIST)
(2016)
R.A. Defronzo
Pathogenesis of type 2 diabetes: metabolic and molecular implications for identifying diabetes genes
Diabetes Care
(1997)
P.A. Estavez et al.
Normalized mutual information feature selection
Neural Netw. IEEE Trans.
(2009)

G. Diabetes Prevention Program Research

Reduction in the incidence of type 2 diabetes with lifestyle intervention or metformin

N. Engl. J. Med.

(2002)

R. Hofmann

RapidMiner: Data Mining Use Cases and Business Analytics Applications

(2016)

T. Ideker et al.

Protein networks in disease

Genome Res.

(2008)

T.O. Johnson et al.

Protein tyrosine phosphatase 1B inhibitors for diabetes

Nat. Rev. Drug Discov.

(2002)

M.G. Kann

Protein interactions and disease: computational approaches to uncover the etiology of diseases

Brief. Bioinf.

(2007)

Cited by (22)

Current trends of host–pathogen relationship in shrimp infectious disease via computational protein–protein interaction: A bibliometric analysis
2023, Fish and Shellfish Immunology
Protein–protein interactions (PPIs) are essential for understanding cell physiology in normal and pathological conditions, as they might involve in all cellular processes. PPIs have been widely used to elucidate the pathobiology of human and plant diseases. Therefore, they can also be used to unveil the pathobiology of infectious diseases in shrimp, which is one of the high-risk factors influencing the success or failure of shrimp production. PPI network analysis, specifically host–pathogen PPI (HP-PPI), provides insights into the molecular interactions between the shrimp and pathogens. This review quantitatively analyzed the research trends within this field through bibliometric analysis using specific keywords, countries, authors, organizations, journals, and documents. This analysis has screened 206 records from the Scopus database for determining eligibility, resulting in 179 papers that were retrieved for bibliometric analysis. The analysis revealed that China and Thailand were the driving forces behind this specific field of research and frequently collaborated with the United States. Aquaculture and Diseases of Aquatic Organisms were the prominent sources for publications in this field. The main keywords identified included “white spot syndrome virus,” “WSSV,” and “shrimp.” We discovered that studies on HP-PPI are currently quite scarce. As a result, we further discussed the significance of HP-PPI by highlighting various approaches that have been previously adopted. These findings not only emphasize the importance of HP-PPI but also pave the way for future researchers to explore the pathogenesis of infectious diseases in shrimp. By doing so, preventative measures and enhanced treatment strategies can be identified.
Artificial intelligence and diabetes technology: A review
2021, Metabolism: Clinical and Experimental
Citation Excerpt :
Li et al. [40] identify metabolites that mark β-cell dysfunction using regularized LR, GB, and RF. Vyas et al. [41] differentiate protein-protein interactions between subjects with and without T2D by extracting features from the three-dimensional structure of proteins and train an SVM classifier to predict protein-protein interactions. The features were obtained using biomedical text mining and protein interaction network analysis.
Artificial intelligence (AI) is widely discussed in the popular literature and is portrayed as impacting many aspects of human life, both in and out of the workplace. The potential for revolutionizing healthcare is significant because of the availability of increasingly powerful computational platforms and methods, along with increasingly informative sources of patient data, both in and out of clinical settings. This review aims to provide a realistic assessment of the potential for AI in understanding and managing diabetes, accounting for the state of the art in the methodology and medical devices that collect data, process data, and act accordingly. Acknowledging that many conflicting definitions of AI have been put forth, this article attempts to characterize the main elements of the field as they relate to diabetes, identifying the main perspectives and methods that can (i) affect basic understanding of the disease, (ii) affect understanding of risk factors (genetic, clinical, and behavioral) of diabetes development, (iii) improve diagnosis, (iv) improve understanding of the arc of disease (progression and personal/societal impact), and finally (v) improve treatment.
Clinical notes as prognostic markers of mortality associated with diabetes mellitus following critical care: A retrospective cohort analysis using machine learning and unstructured big data
2021, Computers in Biology and Medicine
Citation Excerpt :
With the advent of natural language processing – a branch of artificial intelligence amenable to unstructured text data, clinical text mining is increasingly used in various domains of health [9–11]. In diabetes research, it has been used in the areas such as the analysis of protein-protein interactions [12] and early drug discovery [13]. Although text analytics encompasses a number of areas including topic modelling, sentiment analysis, association rule mining, and predictive analytics, it is still considered an evolving field [14].
Clinical notes are ubiquitous resources offering potential value in optimizing critical care via data mining technologies.
To determine the predictive value of clinical notes as prognostic markers of 1-year all-cause mortality among people with diabetes following critical care.
Mortality of diabetes patients were predicted using three cohorts of clinical text in a critical care database, written by physicians (n = 45253), nurses (159027), and both (n = 204280). Natural language processing was used to pre-process text documents and LASSO-regularized logistic regression models were trained and tested. Confusion matrix metrics of each model were calculated and AUROC estimates between models were compared. All predictive words and corresponding coefficients were extracted. Outcome probability associated with each text document was estimated.
Models built on clinical text of physicians, nurses, and the combined cohort predicted mortality with AUROC of 0.996, 0.893, and 0.922, respectively. Predictive performance of the models significantly differed from one another whereas inter-rater reliability ranged from substantial to almost perfect across them. Number of predictive words with non-zero coefficients were 3994, 8159, and 10579, respectively, in the models of physicians, nurses, and the combined cohort. Physicians’ and nursing notes, both individually and when combined, strongly predicted 1-year all-cause mortality among people with diabetes following critical care.
Clinical notes of physicians and nurses are strong and novel prognostic markers of diabetes-associated mortality in critical care, offering potentially generalizable and scalable applications. Clinical text-derived personalized risk estimates of prognostic outcomes such as mortality could be used to optimize patient care.
Multilayer View of Pathogenic SNVs in Human Interactome through In Silico Edgetic Profiling
2018, Journal of Molecular Biology
Citation Excerpt :
As a first case study, we carried out the analysis of an interaction network centered around proteins associated with T2DM. The important role of PPIs in T2DM was recently proposed [28–30]. Most of these works focused on integrating different sources of data to discover novel candidate genes for T2DM.
Non-synonymous mutations linked to the complex diseases often have a global impact on a biological system, affecting large biomolecular networks and pathways. However, the magnitude of the mutation-driven effects on the macromolecular network is yet to be fully explored. In this work, we present a systematic multi-level characterization of human mutations associated with genetic disorders by determining their individual and combined interaction-rewiring, “edgetic,” effects on the human interactome. Our in silico analysis highlights the intrinsic differences and important similarities between the pathogenic single-nucleotide variants (SNVs) and frameshift mutations. We show that pathogenic SNVs are more likely to cause gene pleiotropy than pathogenic frameshift mutations and are enriched on the protein interaction interfaces. Functional profiling of SNVs indicates widespread disruption of the protein–protein interactions and synergistic effects of SNVs. The coverage of our approach is several times greater than the recently published experimental study and has the minimal overlap with it, while the distributions of determined edgotypes between the two sets of profiled mutations are remarkably similar. Case studies reveal the central role of interaction-disrupting mutations in type 2 diabetes mellitus and suggest the importance of studying mutations that abnormally strengthen the protein interactions in cancer. With the advancement of next-generation sequencing technology that drives precision medicine, there is an increasing demand in understanding the changes in molecular mechanisms caused by the patient-specific genetic variation. The current and future in silico edgotyping tools present a cheap and fast solution to deal with the rapidly growing data sets of discovered mutations.
Prediction of post-operative survival expectancy in thoracic lung cancer surgery with soft computing
2017, Journal of Applied Biomedicine
Citation Excerpt :
A variety of activation functions and different learning rules are used for this purpose (Bajpai et al., 2011). The multi layer perceptron (MLP) (Altan et al., 2016), support vector machine (SVM) (Vyas et al., 2016), radial base function (RBF) (Zhang et al., 2016), self organize map (SOM), hopfid (Markou and Singh, 2003) are some types of neural networks. We utilized feed forward neural network and back propagation (BP) learning method in this research (Fig. 1).
Prediction of survival expectancy after surgery is so important. Soft computing approaches using training data are good approximations to model the different systems.
We present many solutions to predict 1-year the post-operative survival expectancy in thoracic lung cancer surgery base on artificial intelligence. We implement multi-layer architecture of SUB- Adaptive neuro fuzzy inference system (MLA-ANFIS) approach with various combinations of multiple input features, neural networks, regression and ELM (extreme learning machine) based on the used thoracic surgery data set with sixteen input features. Our results contribute to the ELM (wave kernel) based on 16 features is more accurate than different proposed methods for predict the post-operative survival expectancy in thoracic lung cancer surgery purpose.
Comparison of the molecular interactions of 7'-carboxyalkyl apigenin derivatives with S. cerevisiae α-glucosidase
2017, Computational Biology and Chemistry
As one of the most investigated flavonoids, apigenin, is considered to be a strong α-glucosidase inhibitor. However, the clinical utility of apigenin is limited due to its low solubility. It was reported that the solubility and biological activity can be improved by introducing sole carboxyalkyl group into apigenin, especially the 7′-substitution. With the increase of length of the alkyl chain in carboxyalkyl group, B ring of the apigenin derivative is embedded much more deeply into the binding cavity while the carboxyalkyl stretches to the neighboring cavity. All of the terminal carboxyl groups form hydrogen bonding interactions easily with the surrounding polar amino acids, such as His239, Ser244, Arg312 and Asp349. Thus, the electron density values of the carbonyl in the carboxyl group become higher than the solution status due to the strong molecular interactions. In fact, electron densities of most of the chemical bonds are decreased after molecular docking procedure. On compared with the solution phase, however, dipole moments of most of these molecules are increased, and their vectors are reoriented distinctly in the active sites. It is noticed that all of the Highest Occupied Molecular Orbital (HOMO) and Lowest Unoccupied Molecular Orbital (LUMO) are distributed throughout the whole parent apigenin ring in solution phase, whereas the disappeared situation happened on the B rings of some molecules (II–IV) in the active site, leading to higher energy gaps.

View all citing articles on Scopus

View full text

Research ArticleBuilding and analysis of protein-protein interactions related to diabetes mellitus using support vector machine, biomedical text mining and network analysis

Highlights

Abstract

Graphical abstract

Introduction

Section snippets

Materials and methods

Comparison between the performances of different parameter sets to optimize the training set

Conclusions

Conflict of interest

Acknowledgement

Curr. Opin. Struct. Biol.

Clin. Chim. Acta

J. Biol. Chem.

An automated method for finding molecular complexes in large protein interaction networks

BMC Bioinf.

Network medicine: a network-based approach to human disease

Nat. Rev. Genet.

Gene ontology: tool for the unification of biology

Nat. Genet.

Weight and type 2 diabetes after bariatric surgery: systematic review and meta-analysis

Am. J. Med.

LIBSVM: a library for support vector machines

ACM Trans. Intell. Syst. Technol. (TIST)

Pathogenesis of type 2 diabetes: metabolic and molecular implications for identifying diabetes genes

Diabetes Care

Normalized mutual information feature selection

Neural Netw. IEEE Trans.