Discrimination of disease-related non-synonymous single nucleotide polymorphisms using multi-scale RBF kernel fuzzy support vector machine

https://doi.org/10.1016/j.patrec.2008.11.003Get rights and content

Abstract

In this paper, we develop a multi-scale RBF kernel fuzzy support vector machine (MSKFSVM) and apply it to the identification of disease-associated non-synonymous single nucleotide polymorphisms (nsSNPs). The experimental results show that the proposed MSKFSVM outperforms the traditional SVM method.

Introduction

Support vector machine (SVM) is a comparatively new learning system developed by Vapnik and Cortes (1995). Since SVM has superior features such as avoiding over-fitting and obtaining global optimal, it has been applied to many problems in bioinformatics.

Nguyen and Rajapakse (2005) used SVM to predict protein secondary structures. Kim and Park (2004) applied it to predict protein relative solvent accessibility. SVM has also been applied to remote protein homology detection (Busuttil, 2004) protein–protein binding sites prediction (Bradford and Westhead, 2005), protein domains identification (Valhovicek, 2005), protein subcellular localization (Nair and Rost, 2005), and gene and tissue classification from microarray expression data (Brown, 2000). The applications of SVM in bioinformatics have been reviewed in (Byvatov and Schneider, 2003) and (Yang, 2004).

In humans, about 90% of sequence variants are due to the differences in single bases of DNA, called single nucleotide polymorphisms (SNPs) (Collins, 1998). Non-synonymous SNPs (nsSNPs) that lead to an amino acid change in the protein product are closely relevant to human inherited diseases (Stenson, 2003). Given the large number of nsSNPs discovered, a major challenge is to predict which of them are potentially disease-associated. Recent studies on nsSNPs have proposed a variety of predictors for discriminating deleterious nsSNPs from neutral ones. Wang and Moult (2001) analyzed the structural properties of deleterious nsSNPs and summarized a set of empirical rules for detecting deleterious nsSNPs. Ng and Henikoff (2001) developed the SFIT (Sorting Tolerant from Intolerant) method for predicting deleterious nsSNPs based on sequence conservation and position-specific scoring matrices. Sunyaev (2001) developed a method for predicting the phenotypic effects of nsSNPs using both structural and evolutionary information.

In contrast to the above empirical rule-based classifications, more and more researchers began to use machine-learning methods to automatically train classifiers for the identification of deleterious nsSNPs. Saunders and Baker (2002) used a classification tree to classify the data. Krishnan and Westhead (2003) applied SVM and decision tree algorithms to train predictors to predict the effects of nsSNPs. Bao and Cui (2005) trained and evaluated the classifier using SVM and random forest algorithms. Ye et al. (2007) found new structural and sequence attributes to predict possible disease association with nsSNPs. Coupled with SVM, the classification accuracy was improved greatly. A number of machine learning methods applied to the bioinformatics have been reviewed in (Larrañaga et al., 2006) and (Cruz and Wishart, 2006).

In this paper, we develop a multi-scale RBF kernel fuzzy support vector machine (MSKFSVM) for discriminating disease-associated nsSNPs from neutral nsSNPs. The MSKFSVM is innovative in two aspects. Firstly, it uses a multi-scale RBF kernel. Compared with single RBF kernels, multi-scale RBF kernels are more adaptive and powerful for complex classification problems. The parameters of the multi-scale RBF kernel are selected using an evolutional strategy. Secondly, the MSKFSVM uses a novel membership function to assign different weights to input examples in the learning of the decision surface. On the contrary, the traditional SVM method considers all input examples with equal weight. We evaluated the MSKFSVM comprehensively using the datasets in (Bao and Cui, 2005), where all the 4218 nsSNPs are splitted into two datasets according to the number of homologous sequences. 4013 nsSNPs with not less than 10 homologous sequences are used as training dataset DS1 (502 neutral and 3511 disease-related nsSNPs), while the remaining 205 nsSNPs are used as test dataset DS2 (30 neutral and 175 disease-related nsSNPs). Most recently, Ye et al. (2007) improved the prediction accuracy by using the new structural and sequence attributes. To further demonstrate the effectiveness of our method, we also tested the MSKFSVM using their dataset (2249 ‘Disease’ nsSNPs and 1189 ‘Polymorphism’ nsSNPs). The experimental results show that the proposed MSKFSVM outperforms the traditional SVM method.

The rest of the paper is organized as follows. Section 2 introduces the multi-scale RBF kernel and evolutionary strategy. Section 3 briefly reviews fuzzy support vector machine. Section 4 describes the proposed membership function in details. The experiment results are listed in section 5 and the conclusions are drawn in Section 6.

Section snippets

Multi-scale RBF kernel

All kernels in the literature are either dot product functions K(x, y) = K(x · y) or distance functions K(x, y) = K(∣∣x  y∣∣) (Ayat, 2001). The examples of distance functions are exponential RBF, Gaussian RBF and multi-quadratic kernels. The examples of dot product functions are polynomial, sigmoid and linear kernels. Among these kernel functions, Gaussian RBF is the most frequently used one. However, there is only one parameter for adjusting the width of RBF, which is not powerful enough for some

Support vector machine

SVM uses hypothesis space of linear functions in a high-dimensional feature space, and it is trained with a learning algorithm based on optimization theory (Cristianini and Shawe-Taylor, 2000).

Suppose a training set S containing n labeled points (x1, y1),  , (xn, yn), where xi  RN and yi  {−1, 1}, i = 1,  , n. Φ(x) denotes the mapping from RN to a feature space Z. We want to find the hyperplane with the maximum margin:w·z+b=0such that for each point (zi,yi), where zi = Φ(xi),yi(w·zi+b)1,i=1,,n.When the

The proposed membership function

In order to employ FSVM, we need to define membership for each input sample. Here we propose a novel membership function definition. From Eq. (6) we can see, if we increase the ξi of a misclassified data xi, the newly learned hyperplane will have a tendency to correctly classify xi in order to eliminate the larger error which xi introduced to the classifier and finally minimize Eq. (6). Correspondingly in Eq. (8), if a larger membership si for an input sample is assigned, it will increase the

Evaluation methods

We use the following two measures to evaluate the classification performance. Overall accuracy are defined asoverall accuracy=TP+TNTP+TN+FP+FNwhere TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives.

The Matthew’s correlation coefficient (MCC) is defined asMCC=TP×TN-FP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN)MCC has been widely used in measuring the performance of machine-learning methods in bioinformatics

Conclusions

In this paper, we have proposed a multi-scale RBF kernel fuzzy support vector machine (MSKFSVM). MSKFSVM uses a multi-scale RBF kernel, which has increased non-linear mapping ability and is more adaptive compared with single RBF kernel. MSKFSVM also uses a fuzzy membership function to eliminate the effect of size unbalance between positive and negative samples in the datasets. Meanwhile, using the membership function will ensure that different example makes different contribution to learning

References (26)

  • R. Nair et al.

    Mimicking cellular sorting improves prediction of sub-cellular localization

    J. Mol. Biol.

    (2005)
  • C.T. Saunders et al.

    Evaluation of structural and evolutionary contributions to deleterious mutation prediction

    J. Mol. Biol.

    (2002)
  • N.E. Ayat

    KMOD-A new support vector machine kernel with moderate decreasing for pattern recognition

    Proc. Doc. Anal. Recognition

    (2001)
  • L. Bao et al.

    Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information

    Bioinformatics

    (2005)
  • J.R. Bradford et al.

    Improved prediction of protein–protein binding sites using a support vector machines approach

    Bioinformatics

    (2005)
  • M.P.S. Brown

    Knowledge-based analysis of microarray gene expression data by using support vector machines

    Proc. Natl. Acad. Sci. USA

    (2000)
  • S. Busuttil

    Support vector machines with profile-based kernels for remote protein homology detection

    Genome Inform. Ser. Workshop Genome Inform.

    (2004)
  • E. Byvatov et al.

    Support vector machine applications in bioinformatics

    Appl. Bioinform.

    (2003)
  • Y.C. Collins

    A DNA polymorphism discovery resource for research on human genertic variation

    Genome Res.

    (1998)
  • N. Cristianini et al.

    An Introduction to Support Vector Machines

    (2000)
  • J.A. Cruz et al.

    Applications of machine learning in cancer prediction and prognosis

    Cancer Inform.

    (2006)
  • H. Kim et al.

    Prediction of protein relative solvent accessibility with support vector machines and long-range interaction 2D local descriptor

    Proteins

    (2004)
  • V.G. Krishnan et al.

    A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function

    Bioinformatics

    (2003)
  • Cited by (11)

    • Prediction of dissolved oxygen content in river crab culture based on least squares support vector regression optimized by improved particle swarm optimization

      2013, Computers and Electronics in Agriculture
      Citation Excerpt :

      This procedure is time-consuming, tedious and unable, in many cases, to converge at the global optimum. The second class of technique determines the hyperparameters with modern heuristic algorithms including simulated annealing algorithms, differential evolution, genetic algorithms, particle swarm optimization algorithm and additional evolutionary strategies, (Kennedy and Eberhart, 1995; An et al., 2007; Tang et al., 2007; Avci, 2009; Ju et al., 2009; Wu, 2011; Feng et al., 2011; Zavar et al., 2011; Yu, 2012; Hsieh et al., 2012), which are applied to implement a robust research on the hyperparameter search space. Compared with other heuristic algorithms, for example, particle swarm optimization algorithm (PSO) does not require evolutionary operators such as crossover and mutation.

    • An extension to the discriminant analysis of near-infrared spectra

      2013, Medical Engineering and Physics
      Citation Excerpt :

      It plays a critical role in NIR spectral analysis where adequate calibration spectra are required for the analysis. Common multivariate calibration methods include linear discriminant analysis (LDA) [6–9], principal component analysis (PCA) [10–13], partial least squares discriminant analysis (PLS-DA) [14–20] and support vector machine analysis (SVM) [21–25]. All of these methods are widely used to build quantitative model for classification in NIR spectral analysis.

    • Multi-Step-Ahead Stock Index Prediction Based on a Novel Hybrid Model

      2023, 2023 8th International Conference on Cloud Computing and Big Data Analytics, ICCCBDA 2023
    View all citing articles on Scopus
    View full text