Transmembrane segments prediction and understanding using support vector machine and decision tree

doi:10.1016/j.eswa.2005.09.045

Expert Systems with Applications

Volume 30, Issue 1, January 2006, Pages 64-72

https://doi.org/10.1016/j.eswa.2005.09.045 Get rights and content

Abstract

In recent years, there have been many studies focusing on improving the accuracy of prediction of transmembrane segments, and many significant results have been achieved. In spite of these considerable results, the existing methods lack the ability to explain the process of how a learning result is reached and why a prediction decision is made. The explanation of a decision made is important for the acceptance of machine learning technology in bioinformatics applications such as protein structure prediction. While support vector machines (SVM) have shown strong generalization ability in a number of application areas, including protein structure prediction, they are black box models and hard to understand. On the other hand, decision trees provide insightful interpretation, however, they have lower prediction accuracy. In this paper, we present an innovative approach to rule generation for understanding prediction of transmembrane segments by integrating the merits of both SVMs and decision trees. This approach combines SVMs with decision trees into a new algorithm called SVM_DT. The results of the experiments for prediction of transmembrane segments on 165 low-resolution test data set show that not only the comprehensibility of SVM_DT is much better than that of SVMs, but also that the test accuracy of these rules is high as well. Rules with confidence values over 90% have an average prediction accuracy of 93.4%. We also found that confidence and prediction accuracy values of the rules generated by SVM_DT are quite consistent. We believe that SVM_DT can be used not only for transmembrane segments prediction, but also for understanding the prediction. The prediction and its interpretation obtained can be used for guiding biological experiments.

Introduction

In recent years, there have been many studies focusing on improving the accuracy of trans-membrane segments prediction. Transmembrane (TM) proteins are the integral membrane proteins that can completely cross from the external to the internal surface of a biological membrane. These TM proteins have important functions in biological systems such as ion channels or receptors. Due to these essential roles in cellular function, TM proteins are critical targets for drug design. However, because of their hydrophobic properties, the conventional experimental approaches, such as X-ray crystallography or nuclear magnetic resonance (NMR) cannot be easily applied to determine their 3D structures. Therefore, computational or theoretical approaches have become important tools for identifying the structures and functions of TM proteins.

Many significant results have been achieved in the prediction of transmembrane segments (Chen et al., 2002, Sikder and Zomaya, 2005). In spite of these results, the existing methods do not explain the process of how a learning result is reached and why a prediction decision is made. The explanation of a decision made is important for the acceptance of machine learning technology in bioinformatics applications such as protein structure prediction. The interpretation of the reasons for the prediction results is not only useful to guide the ‘wet experiments’, but also the extracted rules for interpretation are helpful to integrate computational intelligence with symbolic AI systems for advanced deduction.

The support vector machine (SVM) method is a new and promising classification and regression technique proposed by Vapnik and his co-workers (Cortes and Vapnik, 1995, Vapnik, 1998). SVM, a development in statistical learning theory, is recently of increasing interest to researchers. It is not only well-founded theoretically, but also superior in practical applications. SVM has been successfully applied to a wide variety of application domains including handwriting recognition, object recognition, speaker identification, face detection, and text categorization (Acır and Güzeli, 2004, Cristianini and Shawe-Taylor, 2000, Min and Lee, 2005, Shin et al., 2005). It is especially important for the field of computational biology because it is used for pattern recognition problems including protein remote homology detection, microarray gene expression analysis, recognition of translation start sites, protein structure prediction, functional classification of promoter regions, prediction of protein–protein interactions, and peptide identification from mass spectrometry data (Noble, 2004).

In most of these cases, the performance of SVMs is either similar or significantly better than that of traditional machine learning approaches, including neural networks. Nevertheless, like the neural networks, the SVMs are black box models. They do not have the ability to produce comprehensible models that account for their predictions. Recent research tries to extract the embedded knowledge in trained neural networks in the form of symbolic rules in order to improve comprehensibility in the field of neural networks (NNs) (Andrews et al., 1995, Tickle et al., 1998, Zhou and Jiang, 2004). These rule extraction methods serve several purposes: to provide NNs with explanatory power, to acquire knowledge for symbolic AI systems, to explore data, to develop hybrid architectures and to improve adequacy for data mining applications (Núñez, Angulo, & Catala, 2002). Within the general area of rule-extraction from neural networks, two main approaches are presented: decompositional and pedagogical (Andrews et al., 1995). Decompositional rule extraction techniques extract rules at the level of each individual hidden and output unit within the trained NNs and aggregate these rules to form global relationships. As opposed to the decompositional approach, the pedagogical approach views the trained NNs at the minimum possible level of granularity, i.e. as a single entity or alternatively as a ‘black box.’ The focus is then on finding rules that map the NNs inputs directly into outputs. In addition to these two main categories of rule extraction techniques, Andrews et al. also propose a third category, which they labeled as ‘eclectic,’ to accommodate those rule extraction techniques which incorporate elements of both the decompositional and pedagogical approaches. The fourth category is the compositional approach. The compositional algorithms are mainly designed for extracting deterministic finite_state automata (DFA) from recurrent artificial neural networks. A representative is the algorithm proposed by Omlin and Giles (1996)

In case of SVM, some researchers have started to address the issue of improving the comprehensibility of SVM. Rule-extraction from technology IPOs in the US stock market (Mitsdorffer et al., 2002) and learning-based rule-extraction from support vector machines technique (Barakat & Diederich, 2004) are two examples of pedagogical method. Barakat and his group introduced an approach that handles rule-extraction as a learning task consisting of two steps. First the group used the labeled patterns from a data set to train an SVM. Second, the group applied the generated model to predict the label (class) for a different, unlabeled extended data set. The resulting patterns were then used to train a decision tree learning system and to extract the corresponding rule sets. However, the accuracy of decision tree may be much lower than that of SVM due to the limited learning ability of the SVM. One reason for the lower accuracy is that rules in Barakat's approach were generated by using a partial data set which had the same attributes but modified values and labels classified by SVM. Núñez et al. (2002) proposed another approach for rule-extraction from SVM. First, prototype vectors were determined by k-means algorithm. Then, these vectors were combined with the support vectors using geometric methods to define ellipsoids in the input space, which were later translate to if-then rules. This approach does not scale well, because in case of a large number of patterns and an overlap between different attributes, the explanation capability deteriorates.

Some researchers have started to apply support vector machines and decision trees in bioinformatics areas. Krishnan and Westhead (2003) have done a comparative study of support vector machines and decision tree to predict the effects of single nucleotide polymorphisms on protein function. They have shown that the generalization capability of the SVM is clearly a great advantage, but they also have shown that decision trees also have the significant advantages of producing interpretable rules. In his paper (Lin, Patel, & Duncan, 2003), Lin classified genes by names by using decision trees and support vector machines. CART (Breiman, 1993) was used as the algorithm of the decision tree. The result of the study showed that, although the prediction errors of both were acceptably low for production purpose, SVM outperforms CART. There is also some research on using the decision tree to produce rules for bioinformatics, such as automatic rule generation for protein annotation with the C5.0 (Quinlan, 1993, Quinlan, 1996) data mining algorithm applied on SWISS-PROT (Kretschmann et al., 2001). However, all of these have not integrated the merits of both support vector machines and decision trees.

In this paper, a novel approach of rule-extraction for understanding prediction of transmembrane segments is presented. This approach combines SVM with decision tree into a new algorithm called SVM_DT, which proceeds in four steps. This algorithm first trains an SVM. Next, a new training set is generated by careful selection from the result of SVM. Third, this new training set is used to train a decision tree learning system and to extract the corresponding rule sets. Finally, it decodes the rules into logical rules with biological meaning according to encoding schemes. The results of the experiments based on transmembrane segments prediction with 165 low-resolution data set (Chen et al., 2002) show that they have similar accuracy, while SVM_DT is more comprehensible. Hence, SVM_DT can be used, not only for prediction, but also for guiding biological experiments.

This paper is organized as follows. Section 2 describes SVM_DT and provides the brief introduction of support vector machine and decision tree. Section 3 presents an experiment of transmembrane segments prediction on data set of 165 low-resolution. Section 4 is result analysis. Finally, Section 5 summarizes the main contribution of this paper and discusses some issues of SVM_DT that should be further investigated.

Section snippets

SVM_DT

SVM represents novel learning techniques that have been introduced in the framework of structural risk minimization (SRM) inductive principle and in the theory of VC (Vapnik Chervonenkis) bounds. SVM has a number of interesting properties, including effective avoidance of over fitting, the ability to handle large feature spaces, and information condensing of the given data set, etc.

The basic idea of applying SVM for solving classification problems can be stated briefly in two steps. First, SVM

Experiments

In this study, because we focused on the rules extraction for understanding prediction of transmembrane segments, we should get the logical rules which have biological meaning. Four methods with different encoding schemes are used in the experiments. In the first method, PSSM matrix as encoding schemes are fed into SVM and DT(PSSM_PSSM). In the second method, PSSM matrix as encoding schemes are fed into SVM and the sequences are directly fed into DT(PSSM_SEQ). In the third method, the combined

Result analysis

Table 2 indicates that the average prediction accuracy of rules is 93.4 for all of the rules with a confidence greater than 90. At the same time, its support is 78.0 and the percentage of rule numbers is 62.6. This means that these rules not only have high quality, but also are the majority of the rules obtained. The rules with confidence value from 97 to 99 even have a higher support value and percentage of rule numbers. The corresponding accuracies of the rules are also very high. These

Conclusion

In recent years, there have been many studies that focused on the accuracy of the prediction of transmembrane segments using machine learning methods, and there have been many good results achieved. However, these studies were not able to explain the process by which a learning result was reached and why a decision was being made.

The support vector machine algorithm is a classification algorithm that provides state-of-the-art performance in a wide variety of application domains. It has shown

Acknowledgements

The authors would like to thank Professor Thorsten Joachims for making SVM^light software available and thank RuleQuest Research for providing C4.5 and a ten-day evaluation license of C5.0 software for us to use and Professor Burkhard Rost for providing the 165 low-resolution data sets. This research was supported in part by a scholarship under the State Scholarship Fund of China, and the US National Institutes of Health (NIH) under grants R01 GM34766-17S1, P20 GM065762-01A1, and the US National

References (37)

N. Acır et al.
Automatic recognition of sleep spindles in EEG by using artificial neural networks
Expert Systems with Applications
(2004)
R. Andrews et al.
A survey and critique of techniques for extracting rules from trained artificial neural networks
Knowledge-Based Systems
(1995)
Y.H. Cho et al.
A personalized recommender system based on web usage mining and decision tree induction
Expert Systems with Applications
(2002)
S. Hua et al.
A novel method of protein secondary structure prediction with high segment overlap measure: Support vector machine approach
Journal of Molecular Biology
(2001)
D.T. Jones
Protein secondary structure prediction based on position-specific scoring matrix
Journal of Molecular Biology
(1999)
J.H. Min et al.
Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters
Expert Systems with Applications
(2005)
C.W. Omlin et al.
Extraction of rules from discrete-time recurrent neural networks
Neural Networks
(1996)
K.S. Shin et al.
An application of support vector machines in bankruptcy prediction model
Expert Systems with Applications
(2005)
S.Y. Sohn et al.
Decision tree based on data envelopment analysis for effective technology commercialization
Expert Systems with Applications
(2004)
N. Barakat et al.
Learning-based rule-extraction from support vector machine.
The third conference on neuro-computing and evolving intelligence (NCEI'04)
(2004)

L. Breiman

Classification and regression trees

(1993)

Burges, C. J. C. (1998). SA tutorial on support vector machines for pattern recognition. Data mining and knowledge,...

C.P. Chen et al.

Transmembrane helix predictions revisited

Protein Science

(2002)

Cortes, C., & Vapnik, V. (1995). Support-vector networks Machine learning (pp. 237–297), Vol. 20. Boston, MA: Kluwer...

N. Cristianini et al.

An introduction to support vector machines and other kernel-based learning methods

(2000)

D. Gorgevik et al.

Handwritten digit recognition using statistical and rule-based decision fusion.

IEEE MELECON, May 7–9

(2002)

S. Henikoff et al.

Amino acid substitution matrices from protein blocks

PNAS

(1992)

H. Hu et al.

Transmembrane segments prediction with support vector machine based on high performance encoding schemes.

Proceedings of the 2004 IEEE symposium on computational intelligence in bioinformatics and computational biology, October 7–8, La Jolla, CA, USA

(2004)

Cited by (38)

An application based on the decision tree to classify the marbling of beef by hyperspectral imaging
2017, Meat Science
The aim of this study was to develop a system to classify the marbling of beef using the hyperspectral imaging technology. The Japanese standard classification of the degree of marbling of beef was used as reference and twelve standards were digitized to obtain the parameters of shape and spatial distribution of marbling of each class. A total of 35 samples M. longissmus dorsi muscle were scanned by the hyperspectral imaging system of 400–1000 nm in reflectance mode. The wavelength of 528 nm was selected to segment the sample and the background, and 440 nm was used for classified the samples. Processing algorithms on image, based on decision tree method, were used in the region of interest obtaining a classification error of 0.08% in the building stage. The results showed that the proposed technique has a great potential, as a non-destructive and fast technique, that can be used to classify beef with respect to the degree of marbling.
Production and characterization of monoclonal antibodies against recombinant ORF 049L of rock bream iridovirus
2016, Process Biochemistry
Citation Excerpt :
Several diagnostic methods for iridovirus infection have been recently developed, including cell culture [6], an immunofluorescence assay [7], polymerase chain reaction (PCR) analysis [8], and an enzyme-linked immunosorbent assay (ELISA) [9]. Transmembrane proteins of viruses play an important role in recognizing specific epitopes [10]. A polyclonal antibody (pAb) with high sensitivity and specificity against the recombinant transmembrane protein ORF 049L of rock bream iridovirus (RBIV) was recently developed [11].
Iridovirus has been detected in more than 20 fish species, caused severe systemic diseases. Rock bream iridovirus (RBIV) is a known causative agent of epizootics in cultured Oplegnathus fasciatus in China. In this study, monoclonal antibodies (mAbs) against the recombinant RBIV protein ORF 049L were employed to further elucidate the pathogenesis of the virus. Briefly, mAbs against recombinant ORF 049L were prepared. Western blot analysis was performed to assess the ability of the mAbs to bind to the recombinant protein, and the sensitivity of mAbs to RBIV-infected rock bream was analyzed by dot blot analysis, an indirect immunofluorescence assay, and one-step PCR analysis. Eight mAbs against recombinant ORF 049L were developed, of which three (3F8, 2H11, and 2B7) were sensitive to RBIV-infected rock bream. These mAbs provide useful tools to analyze the location and replication in vivo of structural and functional proteins, and the pathogenesis of RBIV in aquatic organisms.
New rule-based phishing detection method
2016, Expert Systems with Applications
Citation Excerpt :
Since, SVM_DT algorithm can generate high quality rules with a better comprehensibility than SVM (He et al., 2006), we employed it to extract our knowledge. We extracted our rules from presented SVM model based on SVM_DT algorithm by following steps (He et al., 2006): For dataset S, we divide it into N subsets with similar sizes (k) and similar distribution of classes.
In this paper, we present a new rule-based method to detect phishing attacks in internet banking. Our rule-based method used two novel feature sets, which have been proposed to determine the webpage identity. Our proposed feature sets include four features to evaluate the page resources identity, and four features to identify the access protocol of page resource elements. We used approximate string matching algorithms to determine the relationship between the content and the URL of a page in our first proposed feature set. Our proposed features are independent from third-party services such as search engines result and/or web browser history. We employed support vector machine (SVM) algorithm to classify webpages. Our experiments indicate that the proposed model can detect phishing pages in internet banking with accuracy of 99.14% true positive and only 0.86% false negative alarm. Output of sensitivity analysis demonstrates the significant impact of our proposed features over traditional features. We extracted the hidden knowledge from the proposed SVM model by adopting a related method. We embedded the extracted rules into a browser extension named PhishDetector to make our proposed method more functional and easy to use. Evaluating of the implemented browser extension indicates that it can detect phishing attacks in internet banking with high accuracy and reliability. PhishDetector can detect zero-day phishing attacks too.
Multivariate alternating decision trees
2016, Pattern Recognition
Citation Excerpt :
For example in [2], medical experts used the quantitative information obtained from the alternating decision tree model to gain a better understanding between disease phenotypes and affection status. The comprehensibility trait therefore, makes decision trees highly accessible to users outside just a machine learning community, and therefore they can be found in a wide range of applications such as business [3], manufacturing [4], computational biology [5], bioinformatics [6], etc. It is often possible to further improve the classification accuracy of an individual decision tree by combining a number of decision trees to make majority-voted decisions [7].
Decision trees are comprehensible, but at the cost of a relatively lower prediction accuracy compared to other powerful black-box classifiers such as SVMs. Boosting has been a popular strategy to create an ensemble of decision trees to improve their classification performance, but at the expense of comprehensibility advantage. To this end, alternating decision tree (ADTree) has been proposed to allow boosting within a single decision tree to retain comprehension. However, existing ADTrees are univariate, which limits their applicability. This research proposes a novel algorithm – multivariate ADTree. It presents and discusses its different variations (Fisher׳s ADTree, Sparse ADTree, and Regularized Logistic ADTree) along with their empirical validation on a set of publicly available datasets. It is shown that multivariate ADTree has high prediction accuracy comparable to that of decision tree ensembles, while retaining good comprehension which is close to comprehension of individual univariate decision trees.
Applications of evolutionary SVM to prediction of membrane alpha-helices
2013, Expert Systems with Applications
This paper is in the area of membrane proteins. Membrane proteins make up about 75% of possible targets for novel drugs discovery. However, membrane proteins are one of the most understudied groups of proteins in biochemical research because of technical difficulties of attaining structural information about transmembrane regions or domains. Structural determination of TM regions is an important priority in pharmaceutical industry, as it paves the way for structure based drug design.
This research presents a novel evolutionary support vector machine (SVM) based alpha-helix transmembrane region prediction algorithm to solve the membrane helices in amino acid sequences. The SVM-genetic algorithm (GA) methodology is based on the optimisation of sliding window size, evolutionary encoding selection and SVM parameter optimisation. In this research average hydrophobicity and propensity based on skew statistics are used to encode the one letter representation of amino acid sequences datasets.
The computer simulation results demonstrate that the proposed SVM-GA methodology performs better than most conventional techniques producing an accuracy of 86.71% for cross-validation and 86.43% for jack-knife for randomly selected proteins containing single and multiple transmembrane regions. Furthermore, for the amino acid sequence 3LVG, the proposed SVM-GA produces better alpha-helix region identification than PRED-TMR2, MEMSATSVM/MEMSAT3 and PSIPRED V3.0.
Predicting construction cost and schedule success using artificial neural networks ensemble and support vector machines classification models
2012, International Journal of Project Management
It is commonly perceived that how well the planning is performed during the early stage will have significant impact on final project outcome. This paper outlines the development of artificial neural networks ensemble and support vector machines classification models to predict project cost and schedule success, using status of early planning as the model inputs. Through industry survey, early planning and project performance information from a total of 92 building projects is collected. The results show that early planning status can be effectively used to predict project success and the proposed artificial intelligence models produce satisfactory prediction results.

View all citing articles on Scopus

¹: GCC Distinguished Cancer Scholar.

View full text

Transmembrane segments prediction and understanding using support vector machine and decision tree

Abstract

Introduction

Section snippets

SVM_DT

Experiments

Result analysis

Conclusion

Acknowledgements

Expert Systems with Applications

Knowledge-Based Systems

Expert Systems with Applications

Journal of Molecular Biology

Journal of Molecular Biology

Expert Systems with Applications

Neural Networks

Expert Systems with Applications

Expert Systems with Applications

Learning-based rule-extraction from support vector machine.

The third conference on neuro-computing and evolving intelligence (NCEI'04)

Classification and regression trees

Transmembrane helix predictions revisited

Protein Science

An introduction to support vector machines and other kernel-based learning methods

Handwritten digit recognition using statistical and rule-based decision fusion.

IEEE MELECON, May 7–9

Amino acid substitution matrices from protein blocks

PNAS

Transmembrane segments prediction with support vector machine based on high performance encoding schemes.

Proceedings of the 2004 IEEE symposium on computational intelligence in bioinformatics and computational biology, October 7–8, La Jolla, CA, USA