Discrimination of disease-related non-synonymous single nucleotide polymorphisms using multi-scale RBF kernel fuzzy support vector machine

doi:10.1016/j.patrec.2008.11.003

Pattern Recognition Letters

Volume 30, Issue 4, 1 March 2009, Pages 391-396

https://doi.org/10.1016/j.patrec.2008.11.003 Get rights and content

Abstract

In this paper, we develop a multi-scale RBF kernel fuzzy support vector machine (MSKFSVM) and apply it to the identification of disease-associated non-synonymous single nucleotide polymorphisms (nsSNPs). The experimental results show that the proposed MSKFSVM outperforms the traditional SVM method.

Introduction

Support vector machine (SVM) is a comparatively new learning system developed by Vapnik and Cortes (1995). Since SVM has superior features such as avoiding over-fitting and obtaining global optimal, it has been applied to many problems in bioinformatics.

Nguyen and Rajapakse (2005) used SVM to predict protein secondary structures. Kim and Park (2004) applied it to predict protein relative solvent accessibility. SVM has also been applied to remote protein homology detection (Busuttil, 2004) protein–protein binding sites prediction (Bradford and Westhead, 2005), protein domains identification (Valhovicek, 2005), protein subcellular localization (Nair and Rost, 2005), and gene and tissue classification from microarray expression data (Brown, 2000). The applications of SVM in bioinformatics have been reviewed in (Byvatov and Schneider, 2003) and (Yang, 2004).

In humans, about 90% of sequence variants are due to the differences in single bases of DNA, called single nucleotide polymorphisms (SNPs) (Collins, 1998). Non-synonymous SNPs (nsSNPs) that lead to an amino acid change in the protein product are closely relevant to human inherited diseases (Stenson, 2003). Given the large number of nsSNPs discovered, a major challenge is to predict which of them are potentially disease-associated. Recent studies on nsSNPs have proposed a variety of predictors for discriminating deleterious nsSNPs from neutral ones. Wang and Moult (2001) analyzed the structural properties of deleterious nsSNPs and summarized a set of empirical rules for detecting deleterious nsSNPs. Ng and Henikoff (2001) developed the SFIT (Sorting Tolerant from Intolerant) method for predicting deleterious nsSNPs based on sequence conservation and position-specific scoring matrices. Sunyaev (2001) developed a method for predicting the phenotypic effects of nsSNPs using both structural and evolutionary information.

In contrast to the above empirical rule-based classifications, more and more researchers began to use machine-learning methods to automatically train classifiers for the identification of deleterious nsSNPs. Saunders and Baker (2002) used a classification tree to classify the data. Krishnan and Westhead (2003) applied SVM and decision tree algorithms to train predictors to predict the effects of nsSNPs. Bao and Cui (2005) trained and evaluated the classifier using SVM and random forest algorithms. Ye et al. (2007) found new structural and sequence attributes to predict possible disease association with nsSNPs. Coupled with SVM, the classification accuracy was improved greatly. A number of machine learning methods applied to the bioinformatics have been reviewed in (Larrañaga et al., 2006) and (Cruz and Wishart, 2006).

In this paper, we develop a multi-scale RBF kernel fuzzy support vector machine (MSKFSVM) for discriminating disease-associated nsSNPs from neutral nsSNPs. The MSKFSVM is innovative in two aspects. Firstly, it uses a multi-scale RBF kernel. Compared with single RBF kernels, multi-scale RBF kernels are more adaptive and powerful for complex classification problems. The parameters of the multi-scale RBF kernel are selected using an evolutional strategy. Secondly, the MSKFSVM uses a novel membership function to assign different weights to input examples in the learning of the decision surface. On the contrary, the traditional SVM method considers all input examples with equal weight. We evaluated the MSKFSVM comprehensively using the datasets in (Bao and Cui, 2005), where all the 4218 nsSNPs are splitted into two datasets according to the number of homologous sequences. 4013 nsSNPs with not less than 10 homologous sequences are used as training dataset DS1 (502 neutral and 3511 disease-related nsSNPs), while the remaining 205 nsSNPs are used as test dataset DS2 (30 neutral and 175 disease-related nsSNPs). Most recently, Ye et al. (2007) improved the prediction accuracy by using the new structural and sequence attributes. To further demonstrate the effectiveness of our method, we also tested the MSKFSVM using their dataset (2249 ‘Disease’ nsSNPs and 1189 ‘Polymorphism’ nsSNPs). The experimental results show that the proposed MSKFSVM outperforms the traditional SVM method.

The rest of the paper is organized as follows. Section 2 introduces the multi-scale RBF kernel and evolutionary strategy. Section 3 briefly reviews fuzzy support vector machine. Section 4 describes the proposed membership function in details. The experiment results are listed in section 5 and the conclusions are drawn in Section 6.

Section snippets

Multi-scale RBF kernel

All kernels in the literature are either dot product functions K(x, y) = K(x · y) or distance functions K(x, y) = K(∣∣x − y∣∣) (Ayat, 2001). The examples of distance functions are exponential RBF, Gaussian RBF and multi-quadratic kernels. The examples of dot product functions are polynomial, sigmoid and linear kernels. Among these kernel functions, Gaussian RBF is the most frequently used one. However, there is only one parameter for adjusting the width of RBF, which is not powerful enough for some

Support vector machine

SVM uses hypothesis space of linear functions in a high-dimensional feature space, and it is trained with a learning algorithm based on optimization theory (Cristianini and Shawe-Taylor, 2000).

Suppose a training set S containing n labeled points (x₁, y₁), … , (x_n, y_n), where x_i ∈ R^N and y_i ∈ {−1, 1}, i = 1, … , n. Φ(x) denotes the mapping from R^N to a feature space Z. We want to find the hyperplane with the maximum margin: $w \cdot z + b = 0$ such that for each point (z_i,y_i), where z_i = Φ(x_i), $y_{i} (w \cdot z_{i} + b) ⩾ 1, i = 1, \dots, n .$ When the

The proposed membership function

In order to employ FSVM, we need to define membership for each input sample. Here we propose a novel membership function definition. From Eq. (6) we can see, if we increase the $ξ_{i}$ of a misclassified data x_i, the newly learned hyperplane will have a tendency to correctly classify $x_{i}$ in order to eliminate the larger error which $x_{i}$ introduced to the classifier and finally minimize Eq. (6). Correspondingly in Eq. (8), if a larger membership s_i for an input sample is assigned, it will increase the

Evaluation methods

We use the following two measures to evaluate the classification performance. Overall accuracy are defined as $overall accuracy = \frac{TP + TN}{TP + TN + FP + FN}$ where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives.

The Matthew’s correlation coefficient (MCC) is defined as $MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP) (TP + FN) (TN + FP) (TN + FN)}}$ MCC has been widely used in measuring the performance of machine-learning methods in bioinformatics

Conclusions

In this paper, we have proposed a multi-scale RBF kernel fuzzy support vector machine (MSKFSVM). MSKFSVM uses a multi-scale RBF kernel, which has increased non-linear mapping ability and is more adaptive compared with single RBF kernel. MSKFSVM also uses a fuzzy membership function to eliminate the effect of size unbalance between positive and negative samples in the datasets. Meanwhile, using the membership function will ensure that different example makes different contribution to learning

References (26)

R. Nair et al.
Mimicking cellular sorting improves prediction of sub-cellular localization
J. Mol. Biol.
(2005)
C.T. Saunders et al.
Evaluation of structural and evolutionary contributions to deleterious mutation prediction
J. Mol. Biol.
(2002)
N.E. Ayat
KMOD-A new support vector machine kernel with moderate decreasing for pattern recognition
Proc. Doc. Anal. Recognition
(2001)
L. Bao et al.
Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information
Bioinformatics
(2005)
J.R. Bradford et al.
Improved prediction of protein–protein binding sites using a support vector machines approach
Bioinformatics
(2005)
M.P.S. Brown
Knowledge-based analysis of microarray gene expression data by using support vector machines
Proc. Natl. Acad. Sci. USA
(2000)
S. Busuttil
Support vector machines with profile-based kernels for remote protein homology detection
Genome Inform. Ser. Workshop Genome Inform.
(2004)
E. Byvatov et al.
Support vector machine applications in bioinformatics
Appl. Bioinform.
(2003)
Y.C. Collins
A DNA polymorphism discovery resource for research on human genertic variation
Genome Res.
(1998)
N. Cristianini et al.
An Introduction to Support Vector Machines
(2000)

J.A. Cruz et al.

Applications of machine learning in cancer prediction and prognosis

Cancer Inform.

(2006)

H. Kim et al.

Prediction of protein relative solvent accessibility with support vector machines and long-range interaction 2D local descriptor

Proteins

(2004)

V.G. Krishnan et al.

A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function

Bioinformatics

(2003)

Cited by (11)

Prediction of dissolved oxygen content in river crab culture based on least squares support vector regression optimized by improved particle swarm optimization
2013, Computers and Electronics in Agriculture
Citation Excerpt :
This procedure is time-consuming, tedious and unable, in many cases, to converge at the global optimum. The second class of technique determines the hyperparameters with modern heuristic algorithms including simulated annealing algorithms, differential evolution, genetic algorithms, particle swarm optimization algorithm and additional evolutionary strategies, (Kennedy and Eberhart, 1995; An et al., 2007; Tang et al., 2007; Avci, 2009; Ju et al., 2009; Wu, 2011; Feng et al., 2011; Zavar et al., 2011; Yu, 2012; Hsieh et al., 2012), which are applied to implement a robust research on the hyperparameter search space. Compared with other heuristic algorithms, for example, particle swarm optimization algorithm (PSO) does not require evolutionary operators such as crossover and mutation.
It is important to set up a precise predictive model to obtain clear knowledge of the prospective changing conditions of dissolved oxygen content in intensive aquaculture ponds and to reduce the financial losses of aquaculture. This paper presents a hybrid dissolved oxygen content prediction model based on the least squares support vector regression (LSSVR) model with optimal parameters selected by improved particle swarm optimization (IPSO) algorithm. In view of the slow convergence of particle swarm algorithm (PSO), improved PSO with the dynamically adjusted inertia weight was based on the fitness function value to improve convergence. Then a global optimizer, IPSO, was employed to optimize the hyperparameters needed in the LSSVR model. We adopted an IPSO-LSSVR algorithm to construct a non-linear prediction model. IPSO-LSSVR was tested and compared to other algorithms by applying it to predict dissolved oxygen content in river crab culture ponds. Experiment results show that the proposed model of IPSO-LSSVR could increase the prediction accuracy and execute generalization performance better than the standard support vector regression (SVR) and BP neural network, and it is a suitable and effective method for predicting dissolved oxygen content in intensive aquaculture.
An extension to the discriminant analysis of near-infrared spectra
2013, Medical Engineering and Physics
Citation Excerpt :
It plays a critical role in NIR spectral analysis where adequate calibration spectra are required for the analysis. Common multivariate calibration methods include linear discriminant analysis (LDA) [6–9], principal component analysis (PCA) [10–13], partial least squares discriminant analysis (PLS-DA) [14–20] and support vector machine analysis (SVM) [21–25]. All of these methods are widely used to build quantitative model for classification in NIR spectral analysis.
Partial least squares discriminant analysis (PLS-DA) is widely used in multivariate calibration method. Very often, only one single quantitative model is constructed to predict the relationship between the response and the independent variables. This approach can easily misidentify, under or over estimate the important features contained in the independent variables. The results obtained by a single prediction model are thus unstable or correlated to spurious spectral variance, particularly when the training set for PLS-DA is relatively small. A new algorithm developed by applying the Monte Carlo method to PLS-DA, namely MC–PLS-DA, is proposed to classify spectral data obtained from near-infrared blood glucose measurement. Noise in the data is removed by randomly selecting different subsets from the whole training dataset to generate a large number of models. The mean sensitivity and specificity of these models are then calculated to determine the model with the best classification rate. The results show that the MC–PLS-DA method gives more accurate prediction results when compared with other classification methods used for classifying near infrared spectroscopic data of blood glucose. Also, the stability of the PLS-DA model is enhanced.
A complete fuzzy discriminant analysis approach for face recognition
2010, Applied Soft Computing Journal
In this paper, some studies have been made on the essence of fuzzy linear discriminant analysis (F-LDA) algorithm and fuzzy support vector machine (FSVM) classifier, respectively. As a kernel-based learning machine, FSVM is represented with the fuzzy membership function while realizing the same classification results with that of the conventional pair-wise classification. It outperforms other learning machines especially when unclassifiable regions still remain in those conventional classifiers. However, a serious drawback of FSVM is that the computation requirement increases rapidly with the increase of the number of classes and training sample size. To address this problem, an improved FSVM method that combines the advantages of FSVM and decision tree, called DT-FSVM, is proposed firstly. Furthermore, in the process of feature extraction, a reformative F-LDA algorithm based on the fuzzy k-nearest neighbors (FKNN) is implemented to achieve the distribution information of each original sample represented with fuzzy membership grade, which is incorporated into the redefinition of the scatter matrices. In particular, considering the fact that the outlier samples in the patterns may have some adverse influence on the classification result, we developed a novel F-LDA algorithm using a relaxed normalized condition in the definition of fuzzy membership function. Thus, the classification limitation from the outlier samples is effectively alleviated. Finally, by making full use of the fuzzy set theory, a complete F-LDA (CF-LDA) framework is developed by combining the reformative F-LDA (RF-LDA) feature extraction method and DT-FSVM classifier. This hybrid fuzzy algorithm is applied to the face recognition problem, extensive experimental studies conducted on the ORL and NUST603 face images databases demonstrate the effectiveness of the proposed algorithm.
Multi-Step-Ahead Stock Index Prediction Based on a Novel Hybrid Model
2023, 2023 8th International Conference on Cloud Computing and Big Data Analytics, ICCCBDA 2023
A kernel fuzzy twin SVM model for early warning systems of extreme financial risks
2021, International Journal of Finance and Economics
Identification of suitable membership and kernel function for FCM based FSVM classifier model
2019, Cluster Computing

View all citing articles on Scopus

View full text

Discrimination of disease-related non-synonymous single nucleotide polymorphisms using multi-scale RBF kernel fuzzy support vector machine

Abstract

Introduction

Section snippets

Multi-scale RBF kernel

Support vector machine

The proposed membership function

Evaluation methods

Conclusions

J. Mol. Biol.

J. Mol. Biol.

KMOD-A new support vector machine kernel with moderate decreasing for pattern recognition

Proc. Doc. Anal. Recognition

Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information

Bioinformatics

Improved prediction of protein–protein binding sites using a support vector machines approach

Bioinformatics

Knowledge-based analysis of microarray gene expression data by using support vector machines

Proc. Natl. Acad. Sci. USA

Support vector machines with profile-based kernels for remote protein homology detection

Genome Inform. Ser. Workshop Genome Inform.

Support vector machine applications in bioinformatics

Appl. Bioinform.

A DNA polymorphism discovery resource for research on human genertic variation

Genome Res.

An Introduction to Support Vector Machines

Applications of machine learning in cancer prediction and prognosis

Cancer Inform.

Prediction of protein relative solvent accessibility with support vector machines and long-range interaction 2D local descriptor

Proteins

A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function

Bioinformatics