Transmembrane segments prediction and understanding using support vector machine and decision tree

https://doi.org/10.1016/j.eswa.2005.09.045Get rights and content

Abstract

In recent years, there have been many studies focusing on improving the accuracy of prediction of transmembrane segments, and many significant results have been achieved. In spite of these considerable results, the existing methods lack the ability to explain the process of how a learning result is reached and why a prediction decision is made. The explanation of a decision made is important for the acceptance of machine learning technology in bioinformatics applications such as protein structure prediction. While support vector machines (SVM) have shown strong generalization ability in a number of application areas, including protein structure prediction, they are black box models and hard to understand. On the other hand, decision trees provide insightful interpretation, however, they have lower prediction accuracy. In this paper, we present an innovative approach to rule generation for understanding prediction of transmembrane segments by integrating the merits of both SVMs and decision trees. This approach combines SVMs with decision trees into a new algorithm called SVM_DT. The results of the experiments for prediction of transmembrane segments on 165 low-resolution test data set show that not only the comprehensibility of SVM_DT is much better than that of SVMs, but also that the test accuracy of these rules is high as well. Rules with confidence values over 90% have an average prediction accuracy of 93.4%. We also found that confidence and prediction accuracy values of the rules generated by SVM_DT are quite consistent. We believe that SVM_DT can be used not only for transmembrane segments prediction, but also for understanding the prediction. The prediction and its interpretation obtained can be used for guiding biological experiments.

Introduction

In recent years, there have been many studies focusing on improving the accuracy of trans-membrane segments prediction. Transmembrane (TM) proteins are the integral membrane proteins that can completely cross from the external to the internal surface of a biological membrane. These TM proteins have important functions in biological systems such as ion channels or receptors. Due to these essential roles in cellular function, TM proteins are critical targets for drug design. However, because of their hydrophobic properties, the conventional experimental approaches, such as X-ray crystallography or nuclear magnetic resonance (NMR) cannot be easily applied to determine their 3D structures. Therefore, computational or theoretical approaches have become important tools for identifying the structures and functions of TM proteins.

Many significant results have been achieved in the prediction of transmembrane segments (Chen et al., 2002, Sikder and Zomaya, 2005). In spite of these results, the existing methods do not explain the process of how a learning result is reached and why a prediction decision is made. The explanation of a decision made is important for the acceptance of machine learning technology in bioinformatics applications such as protein structure prediction. The interpretation of the reasons for the prediction results is not only useful to guide the ‘wet experiments’, but also the extracted rules for interpretation are helpful to integrate computational intelligence with symbolic AI systems for advanced deduction.

The support vector machine (SVM) method is a new and promising classification and regression technique proposed by Vapnik and his co-workers (Cortes and Vapnik, 1995, Vapnik, 1998). SVM, a development in statistical learning theory, is recently of increasing interest to researchers. It is not only well-founded theoretically, but also superior in practical applications. SVM has been successfully applied to a wide variety of application domains including handwriting recognition, object recognition, speaker identification, face detection, and text categorization (Acır and Güzeli, 2004, Cristianini and Shawe-Taylor, 2000, Min and Lee, 2005, Shin et al., 2005). It is especially important for the field of computational biology because it is used for pattern recognition problems including protein remote homology detection, microarray gene expression analysis, recognition of translation start sites, protein structure prediction, functional classification of promoter regions, prediction of protein–protein interactions, and peptide identification from mass spectrometry data (Noble, 2004).

In most of these cases, the performance of SVMs is either similar or significantly better than that of traditional machine learning approaches, including neural networks. Nevertheless, like the neural networks, the SVMs are black box models. They do not have the ability to produce comprehensible models that account for their predictions. Recent research tries to extract the embedded knowledge in trained neural networks in the form of symbolic rules in order to improve comprehensibility in the field of neural networks (NNs) (Andrews et al., 1995, Tickle et al., 1998, Zhou and Jiang, 2004). These rule extraction methods serve several purposes: to provide NNs with explanatory power, to acquire knowledge for symbolic AI systems, to explore data, to develop hybrid architectures and to improve adequacy for data mining applications (Núñez, Angulo, & Catala, 2002). Within the general area of rule-extraction from neural networks, two main approaches are presented: decompositional and pedagogical (Andrews et al., 1995). Decompositional rule extraction techniques extract rules at the level of each individual hidden and output unit within the trained NNs and aggregate these rules to form global relationships. As opposed to the decompositional approach, the pedagogical approach views the trained NNs at the minimum possible level of granularity, i.e. as a single entity or alternatively as a ‘black box.’ The focus is then on finding rules that map the NNs inputs directly into outputs. In addition to these two main categories of rule extraction techniques, Andrews et al. also propose a third category, which they labeled as ‘eclectic,’ to accommodate those rule extraction techniques which incorporate elements of both the decompositional and pedagogical approaches. The fourth category is the compositional approach. The compositional algorithms are mainly designed for extracting deterministic finite_state automata (DFA) from recurrent artificial neural networks. A representative is the algorithm proposed by Omlin and Giles (1996)

In case of SVM, some researchers have started to address the issue of improving the comprehensibility of SVM. Rule-extraction from technology IPOs in the US stock market (Mitsdorffer et al., 2002) and learning-based rule-extraction from support vector machines technique (Barakat & Diederich, 2004) are two examples of pedagogical method. Barakat and his group introduced an approach that handles rule-extraction as a learning task consisting of two steps. First the group used the labeled patterns from a data set to train an SVM. Second, the group applied the generated model to predict the label (class) for a different, unlabeled extended data set. The resulting patterns were then used to train a decision tree learning system and to extract the corresponding rule sets. However, the accuracy of decision tree may be much lower than that of SVM due to the limited learning ability of the SVM. One reason for the lower accuracy is that rules in Barakat's approach were generated by using a partial data set which had the same attributes but modified values and labels classified by SVM. Núñez et al. (2002) proposed another approach for rule-extraction from SVM. First, prototype vectors were determined by k-means algorithm. Then, these vectors were combined with the support vectors using geometric methods to define ellipsoids in the input space, which were later translate to if-then rules. This approach does not scale well, because in case of a large number of patterns and an overlap between different attributes, the explanation capability deteriorates.

Some researchers have started to apply support vector machines and decision trees in bioinformatics areas. Krishnan and Westhead (2003) have done a comparative study of support vector machines and decision tree to predict the effects of single nucleotide polymorphisms on protein function. They have shown that the generalization capability of the SVM is clearly a great advantage, but they also have shown that decision trees also have the significant advantages of producing interpretable rules. In his paper (Lin, Patel, & Duncan, 2003), Lin classified genes by names by using decision trees and support vector machines. CART (Breiman, 1993) was used as the algorithm of the decision tree. The result of the study showed that, although the prediction errors of both were acceptably low for production purpose, SVM outperforms CART. There is also some research on using the decision tree to produce rules for bioinformatics, such as automatic rule generation for protein annotation with the C5.0 (Quinlan, 1993, Quinlan, 1996) data mining algorithm applied on SWISS-PROT (Kretschmann et al., 2001). However, all of these have not integrated the merits of both support vector machines and decision trees.

In this paper, a novel approach of rule-extraction for understanding prediction of transmembrane segments is presented. This approach combines SVM with decision tree into a new algorithm called SVM_DT, which proceeds in four steps. This algorithm first trains an SVM. Next, a new training set is generated by careful selection from the result of SVM. Third, this new training set is used to train a decision tree learning system and to extract the corresponding rule sets. Finally, it decodes the rules into logical rules with biological meaning according to encoding schemes. The results of the experiments based on transmembrane segments prediction with 165 low-resolution data set (Chen et al., 2002) show that they have similar accuracy, while SVM_DT is more comprehensible. Hence, SVM_DT can be used, not only for prediction, but also for guiding biological experiments.

This paper is organized as follows. Section 2 describes SVM_DT and provides the brief introduction of support vector machine and decision tree. Section 3 presents an experiment of transmembrane segments prediction on data set of 165 low-resolution. Section 4 is result analysis. Finally, Section 5 summarizes the main contribution of this paper and discusses some issues of SVM_DT that should be further investigated.

Section snippets

SVM_DT

SVM represents novel learning techniques that have been introduced in the framework of structural risk minimization (SRM) inductive principle and in the theory of VC (Vapnik Chervonenkis) bounds. SVM has a number of interesting properties, including effective avoidance of over fitting, the ability to handle large feature spaces, and information condensing of the given data set, etc.

The basic idea of applying SVM for solving classification problems can be stated briefly in two steps. First, SVM

Experiments

In this study, because we focused on the rules extraction for understanding prediction of transmembrane segments, we should get the logical rules which have biological meaning. Four methods with different encoding schemes are used in the experiments. In the first method, PSSM matrix as encoding schemes are fed into SVM and DT(PSSM_PSSM). In the second method, PSSM matrix as encoding schemes are fed into SVM and the sequences are directly fed into DT(PSSM_SEQ). In the third method, the combined

Result analysis

Table 2 indicates that the average prediction accuracy of rules is 93.4 for all of the rules with a confidence greater than 90. At the same time, its support is 78.0 and the percentage of rule numbers is 62.6. This means that these rules not only have high quality, but also are the majority of the rules obtained. The rules with confidence value from 97 to 99 even have a higher support value and percentage of rule numbers. The corresponding accuracies of the rules are also very high. These

Conclusion

In recent years, there have been many studies that focused on the accuracy of the prediction of transmembrane segments using machine learning methods, and there have been many good results achieved. However, these studies were not able to explain the process by which a learning result was reached and why a decision was being made.

The support vector machine algorithm is a classification algorithm that provides state-of-the-art performance in a wide variety of application domains. It has shown

Acknowledgements

The authors would like to thank Professor Thorsten Joachims for making SVMlight software available and thank RuleQuest Research for providing C4.5 and a ten-day evaluation license of C5.0 software for us to use and Professor Burkhard Rost for providing the 165 low-resolution data sets. This research was supported in part by a scholarship under the State Scholarship Fund of China, and the US National Institutes of Health (NIH) under grants R01 GM34766-17S1, P20 GM065762-01A1, and the US National

References (37)

  • L. Breiman

    Classification and regression trees

    (1993)
  • Burges, C. J. C. (1998). SA tutorial on support vector machines for pattern recognition. Data mining and knowledge,...
  • C.P. Chen et al.

    Transmembrane helix predictions revisited

    Protein Science

    (2002)
  • Cortes, C., & Vapnik, V. (1995). Support-vector networks Machine learning (pp. 237–297), Vol. 20. Boston, MA: Kluwer...
  • N. Cristianini et al.

    An introduction to support vector machines and other kernel-based learning methods

    (2000)
  • D. Gorgevik et al.

    Handwritten digit recognition using statistical and rule-based decision fusion.

    IEEE MELECON, May 7–9

    (2002)
  • S. Henikoff et al.

    Amino acid substitution matrices from protein blocks

    PNAS

    (1992)
  • H. Hu et al.

    Transmembrane segments prediction with support vector machine based on high performance encoding schemes.

    Proceedings of the 2004 IEEE symposium on computational intelligence in bioinformatics and computational biology, October 7–8, La Jolla, CA, USA

    (2004)
  • Cited by (38)

    • Production and characterization of monoclonal antibodies against recombinant ORF 049L of rock bream iridovirus

      2016, Process Biochemistry
      Citation Excerpt :

      Several diagnostic methods for iridovirus infection have been recently developed, including cell culture [6], an immunofluorescence assay [7], polymerase chain reaction (PCR) analysis [8], and an enzyme-linked immunosorbent assay (ELISA) [9]. Transmembrane proteins of viruses play an important role in recognizing specific epitopes [10]. A polyclonal antibody (pAb) with high sensitivity and specificity against the recombinant transmembrane protein ORF 049L of rock bream iridovirus (RBIV) was recently developed [11].

    • New rule-based phishing detection method

      2016, Expert Systems with Applications
      Citation Excerpt :

      Since, SVM_DT algorithm can generate high quality rules with a better comprehensibility than SVM (He et al., 2006), we employed it to extract our knowledge. We extracted our rules from presented SVM model based on SVM_DT algorithm by following steps (He et al., 2006): For dataset S, we divide it into N subsets with similar sizes (k) and similar distribution of classes.

    • Multivariate alternating decision trees

      2016, Pattern Recognition
      Citation Excerpt :

      For example in [2], medical experts used the quantitative information obtained from the alternating decision tree model to gain a better understanding between disease phenotypes and affection status. The comprehensibility trait therefore, makes decision trees highly accessible to users outside just a machine learning community, and therefore they can be found in a wide range of applications such as business [3], manufacturing [4], computational biology [5], bioinformatics [6], etc. It is often possible to further improve the classification accuracy of an individual decision tree by combining a number of decision trees to make majority-voted decisions [7].

    View all citing articles on Scopus
    1

    GCC Distinguished Cancer Scholar.

    View full text