Transmembrane segments prediction and understanding using support vector machine and decision tree
Introduction
In recent years, there have been many studies focusing on improving the accuracy of trans-membrane segments prediction. Transmembrane (TM) proteins are the integral membrane proteins that can completely cross from the external to the internal surface of a biological membrane. These TM proteins have important functions in biological systems such as ion channels or receptors. Due to these essential roles in cellular function, TM proteins are critical targets for drug design. However, because of their hydrophobic properties, the conventional experimental approaches, such as X-ray crystallography or nuclear magnetic resonance (NMR) cannot be easily applied to determine their 3D structures. Therefore, computational or theoretical approaches have become important tools for identifying the structures and functions of TM proteins.
Many significant results have been achieved in the prediction of transmembrane segments (Chen et al., 2002, Sikder and Zomaya, 2005). In spite of these results, the existing methods do not explain the process of how a learning result is reached and why a prediction decision is made. The explanation of a decision made is important for the acceptance of machine learning technology in bioinformatics applications such as protein structure prediction. The interpretation of the reasons for the prediction results is not only useful to guide the ‘wet experiments’, but also the extracted rules for interpretation are helpful to integrate computational intelligence with symbolic AI systems for advanced deduction.
The support vector machine (SVM) method is a new and promising classification and regression technique proposed by Vapnik and his co-workers (Cortes and Vapnik, 1995, Vapnik, 1998). SVM, a development in statistical learning theory, is recently of increasing interest to researchers. It is not only well-founded theoretically, but also superior in practical applications. SVM has been successfully applied to a wide variety of application domains including handwriting recognition, object recognition, speaker identification, face detection, and text categorization (Acır and Güzeli, 2004, Cristianini and Shawe-Taylor, 2000, Min and Lee, 2005, Shin et al., 2005). It is especially important for the field of computational biology because it is used for pattern recognition problems including protein remote homology detection, microarray gene expression analysis, recognition of translation start sites, protein structure prediction, functional classification of promoter regions, prediction of protein–protein interactions, and peptide identification from mass spectrometry data (Noble, 2004).
In most of these cases, the performance of SVMs is either similar or significantly better than that of traditional machine learning approaches, including neural networks. Nevertheless, like the neural networks, the SVMs are black box models. They do not have the ability to produce comprehensible models that account for their predictions. Recent research tries to extract the embedded knowledge in trained neural networks in the form of symbolic rules in order to improve comprehensibility in the field of neural networks (NNs) (Andrews et al., 1995, Tickle et al., 1998, Zhou and Jiang, 2004). These rule extraction methods serve several purposes: to provide NNs with explanatory power, to acquire knowledge for symbolic AI systems, to explore data, to develop hybrid architectures and to improve adequacy for data mining applications (Núñez, Angulo, & Catala, 2002). Within the general area of rule-extraction from neural networks, two main approaches are presented: decompositional and pedagogical (Andrews et al., 1995). Decompositional rule extraction techniques extract rules at the level of each individual hidden and output unit within the trained NNs and aggregate these rules to form global relationships. As opposed to the decompositional approach, the pedagogical approach views the trained NNs at the minimum possible level of granularity, i.e. as a single entity or alternatively as a ‘black box.’ The focus is then on finding rules that map the NNs inputs directly into outputs. In addition to these two main categories of rule extraction techniques, Andrews et al. also propose a third category, which they labeled as ‘eclectic,’ to accommodate those rule extraction techniques which incorporate elements of both the decompositional and pedagogical approaches. The fourth category is the compositional approach. The compositional algorithms are mainly designed for extracting deterministic finite_state automata (DFA) from recurrent artificial neural networks. A representative is the algorithm proposed by Omlin and Giles (1996)
In case of SVM, some researchers have started to address the issue of improving the comprehensibility of SVM. Rule-extraction from technology IPOs in the US stock market (Mitsdorffer et al., 2002) and learning-based rule-extraction from support vector machines technique (Barakat & Diederich, 2004) are two examples of pedagogical method. Barakat and his group introduced an approach that handles rule-extraction as a learning task consisting of two steps. First the group used the labeled patterns from a data set to train an SVM. Second, the group applied the generated model to predict the label (class) for a different, unlabeled extended data set. The resulting patterns were then used to train a decision tree learning system and to extract the corresponding rule sets. However, the accuracy of decision tree may be much lower than that of SVM due to the limited learning ability of the SVM. One reason for the lower accuracy is that rules in Barakat's approach were generated by using a partial data set which had the same attributes but modified values and labels classified by SVM. Núñez et al. (2002) proposed another approach for rule-extraction from SVM. First, prototype vectors were determined by k-means algorithm. Then, these vectors were combined with the support vectors using geometric methods to define ellipsoids in the input space, which were later translate to if-then rules. This approach does not scale well, because in case of a large number of patterns and an overlap between different attributes, the explanation capability deteriorates.
Some researchers have started to apply support vector machines and decision trees in bioinformatics areas. Krishnan and Westhead (2003) have done a comparative study of support vector machines and decision tree to predict the effects of single nucleotide polymorphisms on protein function. They have shown that the generalization capability of the SVM is clearly a great advantage, but they also have shown that decision trees also have the significant advantages of producing interpretable rules. In his paper (Lin, Patel, & Duncan, 2003), Lin classified genes by names by using decision trees and support vector machines. CART (Breiman, 1993) was used as the algorithm of the decision tree. The result of the study showed that, although the prediction errors of both were acceptably low for production purpose, SVM outperforms CART. There is also some research on using the decision tree to produce rules for bioinformatics, such as automatic rule generation for protein annotation with the C5.0 (Quinlan, 1993, Quinlan, 1996) data mining algorithm applied on SWISS-PROT (Kretschmann et al., 2001). However, all of these have not integrated the merits of both support vector machines and decision trees.
In this paper, a novel approach of rule-extraction for understanding prediction of transmembrane segments is presented. This approach combines SVM with decision tree into a new algorithm called SVM_DT, which proceeds in four steps. This algorithm first trains an SVM. Next, a new training set is generated by careful selection from the result of SVM. Third, this new training set is used to train a decision tree learning system and to extract the corresponding rule sets. Finally, it decodes the rules into logical rules with biological meaning according to encoding schemes. The results of the experiments based on transmembrane segments prediction with 165 low-resolution data set (Chen et al., 2002) show that they have similar accuracy, while SVM_DT is more comprehensible. Hence, SVM_DT can be used, not only for prediction, but also for guiding biological experiments.
This paper is organized as follows. Section 2 describes SVM_DT and provides the brief introduction of support vector machine and decision tree. Section 3 presents an experiment of transmembrane segments prediction on data set of 165 low-resolution. Section 4 is result analysis. Finally, Section 5 summarizes the main contribution of this paper and discusses some issues of SVM_DT that should be further investigated.
Section snippets
SVM_DT
SVM represents novel learning techniques that have been introduced in the framework of structural risk minimization (SRM) inductive principle and in the theory of VC (Vapnik Chervonenkis) bounds. SVM has a number of interesting properties, including effective avoidance of over fitting, the ability to handle large feature spaces, and information condensing of the given data set, etc.
The basic idea of applying SVM for solving classification problems can be stated briefly in two steps. First, SVM
Experiments
In this study, because we focused on the rules extraction for understanding prediction of transmembrane segments, we should get the logical rules which have biological meaning. Four methods with different encoding schemes are used in the experiments. In the first method, PSSM matrix as encoding schemes are fed into SVM and DT(PSSM_PSSM). In the second method, PSSM matrix as encoding schemes are fed into SVM and the sequences are directly fed into DT(PSSM_SEQ). In the third method, the combined
Result analysis
Table 2 indicates that the average prediction accuracy of rules is 93.4 for all of the rules with a confidence greater than 90. At the same time, its support is 78.0 and the percentage of rule numbers is 62.6. This means that these rules not only have high quality, but also are the majority of the rules obtained. The rules with confidence value from 97 to 99 even have a higher support value and percentage of rule numbers. The corresponding accuracies of the rules are also very high. These
Conclusion
In recent years, there have been many studies that focused on the accuracy of the prediction of transmembrane segments using machine learning methods, and there have been many good results achieved. However, these studies were not able to explain the process by which a learning result was reached and why a decision was being made.
The support vector machine algorithm is a classification algorithm that provides state-of-the-art performance in a wide variety of application domains. It has shown
Acknowledgements
The authors would like to thank Professor Thorsten Joachims for making SVMlight software available and thank RuleQuest Research for providing C4.5 and a ten-day evaluation license of C5.0 software for us to use and Professor Burkhard Rost for providing the 165 low-resolution data sets. This research was supported in part by a scholarship under the State Scholarship Fund of China, and the US National Institutes of Health (NIH) under grants R01 GM34766-17S1, P20 GM065762-01A1, and the US National
References (37)
- et al.
Automatic recognition of sleep spindles in EEG by using artificial neural networks
Expert Systems with Applications
(2004) - et al.
A survey and critique of techniques for extracting rules from trained artificial neural networks
Knowledge-Based Systems
(1995) - et al.
A personalized recommender system based on web usage mining and decision tree induction
Expert Systems with Applications
(2002) - et al.
A novel method of protein secondary structure prediction with high segment overlap measure: Support vector machine approach
Journal of Molecular Biology
(2001) Protein secondary structure prediction based on position-specific scoring matrix
Journal of Molecular Biology
(1999)- et al.
Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters
Expert Systems with Applications
(2005) - et al.
Extraction of rules from discrete-time recurrent neural networks
Neural Networks
(1996) - et al.
An application of support vector machines in bankruptcy prediction model
Expert Systems with Applications
(2005) - et al.
Decision tree based on data envelopment analysis for effective technology commercialization
Expert Systems with Applications
(2004) - et al.
Learning-based rule-extraction from support vector machine.
The third conference on neuro-computing and evolving intelligence (NCEI'04)
(2004)
Classification and regression trees
Transmembrane helix predictions revisited
Protein Science
An introduction to support vector machines and other kernel-based learning methods
Handwritten digit recognition using statistical and rule-based decision fusion.
IEEE MELECON, May 7–9
Amino acid substitution matrices from protein blocks
PNAS
Transmembrane segments prediction with support vector machine based on high performance encoding schemes.
Proceedings of the 2004 IEEE symposium on computational intelligence in bioinformatics and computational biology, October 7–8, La Jolla, CA, USA
Cited by (38)
Production and characterization of monoclonal antibodies against recombinant ORF 049L of rock bream iridovirus
2016, Process BiochemistryCitation Excerpt :Several diagnostic methods for iridovirus infection have been recently developed, including cell culture [6], an immunofluorescence assay [7], polymerase chain reaction (PCR) analysis [8], and an enzyme-linked immunosorbent assay (ELISA) [9]. Transmembrane proteins of viruses play an important role in recognizing specific epitopes [10]. A polyclonal antibody (pAb) with high sensitivity and specificity against the recombinant transmembrane protein ORF 049L of rock bream iridovirus (RBIV) was recently developed [11].
New rule-based phishing detection method
2016, Expert Systems with ApplicationsCitation Excerpt :Since, SVM_DT algorithm can generate high quality rules with a better comprehensibility than SVM (He et al., 2006), we employed it to extract our knowledge. We extracted our rules from presented SVM model based on SVM_DT algorithm by following steps (He et al., 2006): For dataset S, we divide it into N subsets with similar sizes (k) and similar distribution of classes.
Multivariate alternating decision trees
2016, Pattern RecognitionCitation Excerpt :For example in [2], medical experts used the quantitative information obtained from the alternating decision tree model to gain a better understanding between disease phenotypes and affection status. The comprehensibility trait therefore, makes decision trees highly accessible to users outside just a machine learning community, and therefore they can be found in a wide range of applications such as business [3], manufacturing [4], computational biology [5], bioinformatics [6], etc. It is often possible to further improve the classification accuracy of an individual decision tree by combining a number of decision trees to make majority-voted decisions [7].
Applications of evolutionary SVM to prediction of membrane alpha-helices
2013, Expert Systems with ApplicationsPredicting construction cost and schedule success using artificial neural networks ensemble and support vector machines classification models
2012, International Journal of Project Management
- 1
GCC Distinguished Cancer Scholar.