Discrimination of disease-related non-synonymous single nucleotide polymorphisms using multi-scale RBF kernel fuzzy support vector machine
Introduction
Support vector machine (SVM) is a comparatively new learning system developed by Vapnik and Cortes (1995). Since SVM has superior features such as avoiding over-fitting and obtaining global optimal, it has been applied to many problems in bioinformatics.
Nguyen and Rajapakse (2005) used SVM to predict protein secondary structures. Kim and Park (2004) applied it to predict protein relative solvent accessibility. SVM has also been applied to remote protein homology detection (Busuttil, 2004) protein–protein binding sites prediction (Bradford and Westhead, 2005), protein domains identification (Valhovicek, 2005), protein subcellular localization (Nair and Rost, 2005), and gene and tissue classification from microarray expression data (Brown, 2000). The applications of SVM in bioinformatics have been reviewed in (Byvatov and Schneider, 2003) and (Yang, 2004).
In humans, about 90% of sequence variants are due to the differences in single bases of DNA, called single nucleotide polymorphisms (SNPs) (Collins, 1998). Non-synonymous SNPs (nsSNPs) that lead to an amino acid change in the protein product are closely relevant to human inherited diseases (Stenson, 2003). Given the large number of nsSNPs discovered, a major challenge is to predict which of them are potentially disease-associated. Recent studies on nsSNPs have proposed a variety of predictors for discriminating deleterious nsSNPs from neutral ones. Wang and Moult (2001) analyzed the structural properties of deleterious nsSNPs and summarized a set of empirical rules for detecting deleterious nsSNPs. Ng and Henikoff (2001) developed the SFIT (Sorting Tolerant from Intolerant) method for predicting deleterious nsSNPs based on sequence conservation and position-specific scoring matrices. Sunyaev (2001) developed a method for predicting the phenotypic effects of nsSNPs using both structural and evolutionary information.
In contrast to the above empirical rule-based classifications, more and more researchers began to use machine-learning methods to automatically train classifiers for the identification of deleterious nsSNPs. Saunders and Baker (2002) used a classification tree to classify the data. Krishnan and Westhead (2003) applied SVM and decision tree algorithms to train predictors to predict the effects of nsSNPs. Bao and Cui (2005) trained and evaluated the classifier using SVM and random forest algorithms. Ye et al. (2007) found new structural and sequence attributes to predict possible disease association with nsSNPs. Coupled with SVM, the classification accuracy was improved greatly. A number of machine learning methods applied to the bioinformatics have been reviewed in (Larrañaga et al., 2006) and (Cruz and Wishart, 2006).
In this paper, we develop a multi-scale RBF kernel fuzzy support vector machine (MSKFSVM) for discriminating disease-associated nsSNPs from neutral nsSNPs. The MSKFSVM is innovative in two aspects. Firstly, it uses a multi-scale RBF kernel. Compared with single RBF kernels, multi-scale RBF kernels are more adaptive and powerful for complex classification problems. The parameters of the multi-scale RBF kernel are selected using an evolutional strategy. Secondly, the MSKFSVM uses a novel membership function to assign different weights to input examples in the learning of the decision surface. On the contrary, the traditional SVM method considers all input examples with equal weight. We evaluated the MSKFSVM comprehensively using the datasets in (Bao and Cui, 2005), where all the 4218 nsSNPs are splitted into two datasets according to the number of homologous sequences. 4013 nsSNPs with not less than 10 homologous sequences are used as training dataset DS1 (502 neutral and 3511 disease-related nsSNPs), while the remaining 205 nsSNPs are used as test dataset DS2 (30 neutral and 175 disease-related nsSNPs). Most recently, Ye et al. (2007) improved the prediction accuracy by using the new structural and sequence attributes. To further demonstrate the effectiveness of our method, we also tested the MSKFSVM using their dataset (2249 ‘Disease’ nsSNPs and 1189 ‘Polymorphism’ nsSNPs). The experimental results show that the proposed MSKFSVM outperforms the traditional SVM method.
The rest of the paper is organized as follows. Section 2 introduces the multi-scale RBF kernel and evolutionary strategy. Section 3 briefly reviews fuzzy support vector machine. Section 4 describes the proposed membership function in details. The experiment results are listed in section 5 and the conclusions are drawn in Section 6.
Section snippets
Multi-scale RBF kernel
All kernels in the literature are either dot product functions K(x, y) = K(x · y) or distance functions K(x, y) = K(∣∣x − y∣∣) (Ayat, 2001). The examples of distance functions are exponential RBF, Gaussian RBF and multi-quadratic kernels. The examples of dot product functions are polynomial, sigmoid and linear kernels. Among these kernel functions, Gaussian RBF is the most frequently used one. However, there is only one parameter for adjusting the width of RBF, which is not powerful enough for some
Support vector machine
SVM uses hypothesis space of linear functions in a high-dimensional feature space, and it is trained with a learning algorithm based on optimization theory (Cristianini and Shawe-Taylor, 2000).
Suppose a training set S containing n labeled points (x1, y1), … , (xn, yn), where xi ∈ RN and yi ∈ {−1, 1}, i = 1, … , n. Φ(x) denotes the mapping from RN to a feature space Z. We want to find the hyperplane with the maximum margin:such that for each point (zi,yi), where zi = Φ(xi),When the
The proposed membership function
In order to employ FSVM, we need to define membership for each input sample. Here we propose a novel membership function definition. From Eq. (6) we can see, if we increase the of a misclassified data xi, the newly learned hyperplane will have a tendency to correctly classify in order to eliminate the larger error which introduced to the classifier and finally minimize Eq. (6). Correspondingly in Eq. (8), if a larger membership si for an input sample is assigned, it will increase the
Evaluation methods
We use the following two measures to evaluate the classification performance. Overall accuracy are defined aswhere TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives.
The Matthew’s correlation coefficient (MCC) is defined asMCC has been widely used in measuring the performance of machine-learning methods in bioinformatics
Conclusions
In this paper, we have proposed a multi-scale RBF kernel fuzzy support vector machine (MSKFSVM). MSKFSVM uses a multi-scale RBF kernel, which has increased non-linear mapping ability and is more adaptive compared with single RBF kernel. MSKFSVM also uses a fuzzy membership function to eliminate the effect of size unbalance between positive and negative samples in the datasets. Meanwhile, using the membership function will ensure that different example makes different contribution to learning
References (26)
- et al.
Mimicking cellular sorting improves prediction of sub-cellular localization
J. Mol. Biol.
(2005) - et al.
Evaluation of structural and evolutionary contributions to deleterious mutation prediction
J. Mol. Biol.
(2002) KMOD-A new support vector machine kernel with moderate decreasing for pattern recognition
Proc. Doc. Anal. Recognition
(2001)- et al.
Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information
Bioinformatics
(2005) - et al.
Improved prediction of protein–protein binding sites using a support vector machines approach
Bioinformatics
(2005) Knowledge-based analysis of microarray gene expression data by using support vector machines
Proc. Natl. Acad. Sci. USA
(2000)Support vector machines with profile-based kernels for remote protein homology detection
Genome Inform. Ser. Workshop Genome Inform.
(2004)- et al.
Support vector machine applications in bioinformatics
Appl. Bioinform.
(2003) A DNA polymorphism discovery resource for research on human genertic variation
Genome Res.
(1998)- et al.
An Introduction to Support Vector Machines
(2000)
Applications of machine learning in cancer prediction and prognosis
Cancer Inform.
Prediction of protein relative solvent accessibility with support vector machines and long-range interaction 2D local descriptor
Proteins
A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function
Bioinformatics
Cited by (11)
Prediction of dissolved oxygen content in river crab culture based on least squares support vector regression optimized by improved particle swarm optimization
2013, Computers and Electronics in AgricultureCitation Excerpt :This procedure is time-consuming, tedious and unable, in many cases, to converge at the global optimum. The second class of technique determines the hyperparameters with modern heuristic algorithms including simulated annealing algorithms, differential evolution, genetic algorithms, particle swarm optimization algorithm and additional evolutionary strategies, (Kennedy and Eberhart, 1995; An et al., 2007; Tang et al., 2007; Avci, 2009; Ju et al., 2009; Wu, 2011; Feng et al., 2011; Zavar et al., 2011; Yu, 2012; Hsieh et al., 2012), which are applied to implement a robust research on the hyperparameter search space. Compared with other heuristic algorithms, for example, particle swarm optimization algorithm (PSO) does not require evolutionary operators such as crossover and mutation.
An extension to the discriminant analysis of near-infrared spectra
2013, Medical Engineering and PhysicsCitation Excerpt :It plays a critical role in NIR spectral analysis where adequate calibration spectra are required for the analysis. Common multivariate calibration methods include linear discriminant analysis (LDA) [6–9], principal component analysis (PCA) [10–13], partial least squares discriminant analysis (PLS-DA) [14–20] and support vector machine analysis (SVM) [21–25]. All of these methods are widely used to build quantitative model for classification in NIR spectral analysis.
A complete fuzzy discriminant analysis approach for face recognition
2010, Applied Soft Computing JournalMulti-Step-Ahead Stock Index Prediction Based on a Novel Hybrid Model
2023, 2023 8th International Conference on Cloud Computing and Big Data Analytics, ICCCBDA 2023A kernel fuzzy twin SVM model for early warning systems of extreme financial risks
2021, International Journal of Finance and Economics