Research Article
Building and analysis of protein-protein interactions related to diabetes mellitus using support vector machine, biomedical text mining and network analysis

https://doi.org/10.1016/j.compbiolchem.2016.09.011Get rights and content

Highlights

  • New protein fingerprints for capturing the topological properties of protein complexes in a linear format.

  • A SVM based predictive model for discriminating diabetes versus non-diabetes complexes with an AUC of 0.78.

  • Model tested on an external data set derived from text mining large number of PubMed abstracts.

  • Network modeling to identify new disease targets.

Abstract

In order to understand the molecular mechanism underlying any disease, knowledge about the interacting proteins in the disease pathway is essential. The number of revealed protein-protein interactions (PPI) is still very limited compared to the available protein sequences of different organisms. Experiment based high-throughput technologies though provide some data about these interactions, those are often fairly noisy. Computational techniques for predicting protein–protein interactions therefore assume significance. 1296 binary fingerprints that encode a combination of structural and geometric properties were developed using the crystallographic data of 15,000 protein complexes in the pdb server. In a case study, these fingerprints were created for proteins implicated in the Type 2 diabetes mellitus disease. The fingerprints were input into a SVM based model for discriminating disease proteins from non disease proteins yielding a classification accuracy of 78.2% (AUC value of 0.78) on an external data set composed of proteins retrieved via text mining of diabetes related literature. A PPI network was constructed and analysed to explore new disease targets. The integrated approach exemplified here has a potential for identifying disease related proteins, functional annotation and other proteomics studies.

Introduction

Diabetes is a chronic disease, which occurs when the pancreas does not produce enough insulin, or when the body cannot effectively use the insulin it produces. This leads to an increased concentration of glucose in the blood; this condition is known as hyperglycaemia. Diabetes are of 3 types: Type- 1 diabetes, Type- 2 diabetes and gestational diabetes (Defronzo, 1997). According to World Health Organisation, about 346 million people worldwide have been affected with diabetes. Type 2 diabetes is afflicting the third world countries, which are fast becoming the epicentre of this silent killer, mainly due to the rampant urbanization, poor nutrition and sedentary life styles (Buchwald et al., 2009, G. Diabetes Prevention Program Research, 2002). Keeping in view the rapid growth of diabetes and also its critical impact worldwide, it is of major importance to devise global models that can identify novel proteins related to this disease.

Identification of protein-protein interactions (PPIs) is considered as a key strategy for understanding the various mechanisms of any disease (Kann, 2007). Diabetes is a multi-factorial disease wherein a host of crucial interactions are still largely unknown (Virkamaki et al., 1999). Protein interactions are known to correlate with the protein’s functional properties and protein interaction networks are frequently utilized to discover the potential biological role of proteins with an unknown function. Both experimental and computational techniques have been used to identify protein-protein interactions (PPIs) in many organisms (Bader and Hogue, 2003). Experimental studies to identify protein-protein interactions (PPIs) have been carried out using techniques such as yeast two-hybrid screens and co-affinity purification followed by mass spectrometry (Phizicky and Fields, 1995). These methods, may be prone to error and may not be specific to proteins in all organisms. Moreover, there is a possibility of a number of false positives in the high throughput data from protein assays (Botstein et al., 2000). Thus use of computational methods for predicting proteins in PPI has been intensified. The identification of proteins responsible for human diseases is one of the most challenging tasks in the drug design. Some computational methods for example sequence based, high-throughput database and a combination of both have been used to predict protein-protein interactions (Karlin and Belshaw, 2012). Machine learning methods including Bayesian classifiers, probabilistic decision trees, logistic regression, support vector machines have been employed for predicting the PPIs by using a number of properties of proteins to classify the data (Pizzuti and Rombo, 2016, Oliva and Fernandez-Fuentes, 2016). However, generally the PPI networks are constructed on the basis of sequence data alone.

It is known that physically interacting proteins tend to be involved in the same cellular process, and mutations in their genes may lead to similar disease phenotypes (Estavez et al., 2009). Proteins must interact physically, at least briefly, to form temporary associations to express their biological functions in the cell (Ideker and Sharan, 2008). In the current work, we hypothesize that the probability of physical interaction between two proteins depends upon the 3D structural features derived from the 3D structure of a macromolecule. Based on this knowledge gained by studying 3D structures of 15,000 protein complexes deposited in the Brookhaven Protein Databank, we have generated 1296 binary fingerprint based descriptors encoding the geometric and structural attributes of a protein. We computed binary fingerprints for the proteins related to the type 2 diabetes disease. It is envisaged that the methodology employed here can be used efficiently to distinguish between disease related and non-disease related query proteins.

One of the main challenges in using the SVM for the prediction of PPIs from the protein sequences is finding a suitable transformation of the protein sequence information present in a fixed number of inputs to be used in SVM training. Many studies in the past have exploited the physiochemical properties of proteins to predict protein-protein interactions (Mei and Zhu, 2016). However, often unequal length inputs are considered in these studies because of the varying lengths of the protein sequences. Thus a method is proposed that converts a protein sequence into fixed-dimensional representative attributes, wherein each feature represents the relationship of amino acid to the protein sequence of interest. The approach is schematically illustrated in Fig. 1.

Section snippets

Materials and methods

An in-house developed Java based program was used to generate the binary fingerprints of bit length 1296 for a given protein structure. JProLine, a program developed in our research group was employed for constructing multiple sequence alignment and heatmap generation (Kumar et al., 2016). Another internally developed tool, MegaMiner portal was employed for the rapid intelligent text mining of biomedical records (Karthikeyan et al., 2015).

Comparison between the performances of different parameter sets to optimize the training set

The training set consisting of 2653 proteins had to be constructed using parameters value that yielded the best accuracy. In this step, optimization of the training set using two parameters i.e. C and kernel type is performed. Table below gives the performance value of different parameter sets.

The two important parameters namely C (regularization constant) and kernel type were optimized using the grid search method. The observed trend was that the performance of the training set increases with

Conclusions

Diabetes mellitus is a chronic disease which cannot be cured except in very specific situations. Exploiting protein-protein interactions can greatly increase the likelihood of finding positional candidate disease proteins for diabetes. When applied on a large scale they can lead to novel candidate protein predictions. An integrated approach involving fingerprint generation, SVM analysis, text mining, PPI networks was used to identify disease related proteins. The model developed in the present

Conflict of interest

The authors declare no conflict of interest.

Acknowledgement

RV thanks DST, New Delhi, India, MK thanks the Director NCL-Pune and CSIR New Delhi for the GENESIS (BSC0121) project.

References (31)

  • R.C. Edgar et al.

    Multiple sequence alignment

    Curr. Opin. Struct. Biol.

    (2006)
  • B.M. Frier et al.

    Serum trypsin concentration and pancreatic trypsin secretion in insulin-dependent diabetes mellitus

    Clin. Chim. Acta

    (1980)
  • S. Ji et al.

    Insulin inhibits growth hormone signaling via the growth hormone receptor/JAK2/STAT5B pathway

    J. Biol. Chem.

    (1999)
  • G.D. Bader et al.

    An automated method for finding molecular complexes in large protein interaction networks

    BMC Bioinf.

    (2003)
  • A.-L. Barabasi et al.

    Network medicine: a network-based approach to human disease

    Nat. Rev. Genet.

    (2016)
  • D. Botstein et al.

    Gene ontology: tool for the unification of biology

    Nat. Genet.

    (2000)
  • H. Buchwald et al.

    Weight and type 2 diabetes after bariatric surgery: systematic review and meta-analysis

    Am. J. Med.

    (2009)
  • C.-C. Chang et al.

    LIBSVM: a library for support vector machines

    ACM Trans. Intell. Syst. Technol. (TIST)

    (2016)
  • R.A. Defronzo

    Pathogenesis of type 2 diabetes: metabolic and molecular implications for identifying diabetes genes

    Diabetes Care

    (1997)
  • P.A. Estavez et al.

    Normalized mutual information feature selection

    Neural Netw. IEEE Trans.

    (2009)
  • G. Diabetes Prevention Program Research

    Reduction in the incidence of type 2 diabetes with lifestyle intervention or metformin

    N. Engl. J. Med.

    (2002)
  • R. Hofmann

    RapidMiner: Data Mining Use Cases and Business Analytics Applications

    (2016)
  • T. Ideker et al.

    Protein networks in disease

    Genome Res.

    (2008)
  • T.O. Johnson et al.

    Protein tyrosine phosphatase 1B inhibitors for diabetes

    Nat. Rev. Drug Discov.

    (2002)
  • M.G. Kann

    Protein interactions and disease: computational approaches to uncover the etiology of diseases

    Brief. Bioinf.

    (2007)
  • Cited by (22)

    • Artificial intelligence and diabetes technology: A review

      2021, Metabolism: Clinical and Experimental
      Citation Excerpt :

      Li et al. [40] identify metabolites that mark β-cell dysfunction using regularized LR, GB, and RF. Vyas et al. [41] differentiate protein-protein interactions between subjects with and without T2D by extracting features from the three-dimensional structure of proteins and train an SVM classifier to predict protein-protein interactions. The features were obtained using biomedical text mining and protein interaction network analysis.

    • Clinical notes as prognostic markers of mortality associated with diabetes mellitus following critical care: A retrospective cohort analysis using machine learning and unstructured big data

      2021, Computers in Biology and Medicine
      Citation Excerpt :

      With the advent of natural language processing – a branch of artificial intelligence amenable to unstructured text data, clinical text mining is increasingly used in various domains of health [9–11]. In diabetes research, it has been used in the areas such as the analysis of protein-protein interactions [12] and early drug discovery [13]. Although text analytics encompasses a number of areas including topic modelling, sentiment analysis, association rule mining, and predictive analytics, it is still considered an evolving field [14].

    • Multilayer View of Pathogenic SNVs in Human Interactome through In Silico Edgetic Profiling

      2018, Journal of Molecular Biology
      Citation Excerpt :

      As a first case study, we carried out the analysis of an interaction network centered around proteins associated with T2DM. The important role of PPIs in T2DM was recently proposed [28–30]. Most of these works focused on integrating different sources of data to discover novel candidate genes for T2DM.

    • Prediction of post-operative survival expectancy in thoracic lung cancer surgery with soft computing

      2017, Journal of Applied Biomedicine
      Citation Excerpt :

      A variety of activation functions and different learning rules are used for this purpose (Bajpai et al., 2011). The multi layer perceptron (MLP) (Altan et al., 2016), support vector machine (SVM) (Vyas et al., 2016), radial base function (RBF) (Zhang et al., 2016), self organize map (SOM), hopfid (Markou and Singh, 2003) are some types of neural networks. We utilized feed forward neural network and back propagation (BP) learning method in this research (Fig. 1).

    View all citing articles on Scopus
    View full text