Research ArticleBuilding and analysis of protein-protein interactions related to diabetes mellitus using support vector machine, biomedical text mining and network analysis
Graphical abstract
Introduction
Diabetes is a chronic disease, which occurs when the pancreas does not produce enough insulin, or when the body cannot effectively use the insulin it produces. This leads to an increased concentration of glucose in the blood; this condition is known as hyperglycaemia. Diabetes are of 3 types: Type- 1 diabetes, Type- 2 diabetes and gestational diabetes (Defronzo, 1997). According to World Health Organisation, about 346 million people worldwide have been affected with diabetes. Type 2 diabetes is afflicting the third world countries, which are fast becoming the epicentre of this silent killer, mainly due to the rampant urbanization, poor nutrition and sedentary life styles (Buchwald et al., 2009, G. Diabetes Prevention Program Research, 2002). Keeping in view the rapid growth of diabetes and also its critical impact worldwide, it is of major importance to devise global models that can identify novel proteins related to this disease.
Identification of protein-protein interactions (PPIs) is considered as a key strategy for understanding the various mechanisms of any disease (Kann, 2007). Diabetes is a multi-factorial disease wherein a host of crucial interactions are still largely unknown (Virkamaki et al., 1999). Protein interactions are known to correlate with the protein’s functional properties and protein interaction networks are frequently utilized to discover the potential biological role of proteins with an unknown function. Both experimental and computational techniques have been used to identify protein-protein interactions (PPIs) in many organisms (Bader and Hogue, 2003). Experimental studies to identify protein-protein interactions (PPIs) have been carried out using techniques such as yeast two-hybrid screens and co-affinity purification followed by mass spectrometry (Phizicky and Fields, 1995). These methods, may be prone to error and may not be specific to proteins in all organisms. Moreover, there is a possibility of a number of false positives in the high throughput data from protein assays (Botstein et al., 2000). Thus use of computational methods for predicting proteins in PPI has been intensified. The identification of proteins responsible for human diseases is one of the most challenging tasks in the drug design. Some computational methods for example sequence based, high-throughput database and a combination of both have been used to predict protein-protein interactions (Karlin and Belshaw, 2012). Machine learning methods including Bayesian classifiers, probabilistic decision trees, logistic regression, support vector machines have been employed for predicting the PPIs by using a number of properties of proteins to classify the data (Pizzuti and Rombo, 2016, Oliva and Fernandez-Fuentes, 2016). However, generally the PPI networks are constructed on the basis of sequence data alone.
It is known that physically interacting proteins tend to be involved in the same cellular process, and mutations in their genes may lead to similar disease phenotypes (Estavez et al., 2009). Proteins must interact physically, at least briefly, to form temporary associations to express their biological functions in the cell (Ideker and Sharan, 2008). In the current work, we hypothesize that the probability of physical interaction between two proteins depends upon the 3D structural features derived from the 3D structure of a macromolecule. Based on this knowledge gained by studying 3D structures of 15,000 protein complexes deposited in the Brookhaven Protein Databank, we have generated 1296 binary fingerprint based descriptors encoding the geometric and structural attributes of a protein. We computed binary fingerprints for the proteins related to the type 2 diabetes disease. It is envisaged that the methodology employed here can be used efficiently to distinguish between disease related and non-disease related query proteins.
One of the main challenges in using the SVM for the prediction of PPIs from the protein sequences is finding a suitable transformation of the protein sequence information present in a fixed number of inputs to be used in SVM training. Many studies in the past have exploited the physiochemical properties of proteins to predict protein-protein interactions (Mei and Zhu, 2016). However, often unequal length inputs are considered in these studies because of the varying lengths of the protein sequences. Thus a method is proposed that converts a protein sequence into fixed-dimensional representative attributes, wherein each feature represents the relationship of amino acid to the protein sequence of interest. The approach is schematically illustrated in Fig. 1.
Section snippets
Materials and methods
An in-house developed Java based program was used to generate the binary fingerprints of bit length 1296 for a given protein structure. JProLine, a program developed in our research group was employed for constructing multiple sequence alignment and heatmap generation (Kumar et al., 2016). Another internally developed tool, MegaMiner portal was employed for the rapid intelligent text mining of biomedical records (Karthikeyan et al., 2015).
Comparison between the performances of different parameter sets to optimize the training set
The training set consisting of 2653 proteins had to be constructed using parameters value that yielded the best accuracy. In this step, optimization of the training set using two parameters i.e. C and kernel type is performed. Table below gives the performance value of different parameter sets.
The two important parameters namely C (regularization constant) and kernel type were optimized using the grid search method. The observed trend was that the performance of the training set increases with
Conclusions
Diabetes mellitus is a chronic disease which cannot be cured except in very specific situations. Exploiting protein-protein interactions can greatly increase the likelihood of finding positional candidate disease proteins for diabetes. When applied on a large scale they can lead to novel candidate protein predictions. An integrated approach involving fingerprint generation, SVM analysis, text mining, PPI networks was used to identify disease related proteins. The model developed in the present
Conflict of interest
The authors declare no conflict of interest.
Acknowledgement
RV thanks DST, New Delhi, India, MK thanks the Director NCL-Pune and CSIR New Delhi for the GENESIS (BSC0121) project.
References (31)
- et al.
Multiple sequence alignment
Curr. Opin. Struct. Biol.
(2006) - et al.
Serum trypsin concentration and pancreatic trypsin secretion in insulin-dependent diabetes mellitus
Clin. Chim. Acta
(1980) - et al.
Insulin inhibits growth hormone signaling via the growth hormone receptor/JAK2/STAT5B pathway
J. Biol. Chem.
(1999) - et al.
An automated method for finding molecular complexes in large protein interaction networks
BMC Bioinf.
(2003) - et al.
Network medicine: a network-based approach to human disease
Nat. Rev. Genet.
(2016) - et al.
Gene ontology: tool for the unification of biology
Nat. Genet.
(2000) - et al.
Weight and type 2 diabetes after bariatric surgery: systematic review and meta-analysis
Am. J. Med.
(2009) - et al.
LIBSVM: a library for support vector machines
ACM Trans. Intell. Syst. Technol. (TIST)
(2016) Pathogenesis of type 2 diabetes: metabolic and molecular implications for identifying diabetes genes
Diabetes Care
(1997)- et al.
Normalized mutual information feature selection
Neural Netw. IEEE Trans.
(2009)
Reduction in the incidence of type 2 diabetes with lifestyle intervention or metformin
N. Engl. J. Med.
RapidMiner: Data Mining Use Cases and Business Analytics Applications
Protein networks in disease
Genome Res.
Protein tyrosine phosphatase 1B inhibitors for diabetes
Nat. Rev. Drug Discov.
Protein interactions and disease: computational approaches to uncover the etiology of diseases
Brief. Bioinf.
Cited by (22)
Artificial intelligence and diabetes technology: A review
2021, Metabolism: Clinical and ExperimentalCitation Excerpt :Li et al. [40] identify metabolites that mark β-cell dysfunction using regularized LR, GB, and RF. Vyas et al. [41] differentiate protein-protein interactions between subjects with and without T2D by extracting features from the three-dimensional structure of proteins and train an SVM classifier to predict protein-protein interactions. The features were obtained using biomedical text mining and protein interaction network analysis.
Clinical notes as prognostic markers of mortality associated with diabetes mellitus following critical care: A retrospective cohort analysis using machine learning and unstructured big data
2021, Computers in Biology and MedicineCitation Excerpt :With the advent of natural language processing – a branch of artificial intelligence amenable to unstructured text data, clinical text mining is increasingly used in various domains of health [9–11]. In diabetes research, it has been used in the areas such as the analysis of protein-protein interactions [12] and early drug discovery [13]. Although text analytics encompasses a number of areas including topic modelling, sentiment analysis, association rule mining, and predictive analytics, it is still considered an evolving field [14].
Multilayer View of Pathogenic SNVs in Human Interactome through In Silico Edgetic Profiling
2018, Journal of Molecular BiologyCitation Excerpt :As a first case study, we carried out the analysis of an interaction network centered around proteins associated with T2DM. The important role of PPIs in T2DM was recently proposed [28–30]. Most of these works focused on integrating different sources of data to discover novel candidate genes for T2DM.
Prediction of post-operative survival expectancy in thoracic lung cancer surgery with soft computing
2017, Journal of Applied BiomedicineCitation Excerpt :A variety of activation functions and different learning rules are used for this purpose (Bajpai et al., 2011). The multi layer perceptron (MLP) (Altan et al., 2016), support vector machine (SVM) (Vyas et al., 2016), radial base function (RBF) (Zhang et al., 2016), self organize map (SOM), hopfid (Markou and Singh, 2003) are some types of neural networks. We utilized feed forward neural network and back propagation (BP) learning method in this research (Fig. 1).
Comparison of the molecular interactions of 7'-carboxyalkyl apigenin derivatives with S. cerevisiae α-glucosidase
2017, Computational Biology and Chemistry