Prediction of disulfide bonding pattern based on a support vector machine and multiple trajectory search
Introduction
Disulfide bonds significantly impacts stabilization of protein conformations. To predict protein folding, accurately predicting of disulfide bridges can significantly reduce the search space [16], [28]. Predicting the disulfide bonding pattern helps, to a certain extent, determination of the three-dimensional structure of a protein and its function because disulfide bonds impose geometrical constraints on the protein backbones. Recent works have demonstrated that the disulfide bonding patterns and protein structures are closely related to each other [9], [33].
Predicting the disulfide bond requires addressing two problems. First, the disulfide bonding states must be determined, i.e. a cysteine may be in an oxidized or reduced state and one must determine whether or not a cysteine is oxidized. Second, the disulfide bonding pattern must be predicted, with the goal of determining whether individual pairs of cysteines contain a disulfide bond. Significant progress has been made recently in predicting disulfide bonding states. Several methods have been developed based on statistical analysis [14], neural network [12], [22], and support vector machine [8], which can accurately predict the bonding state of cysteines with accuracy rates ranging from 81% to 90%. Moreover, several methods have been devised to predict disulfide bonding patterns. Fariselli and Casadio [10] pioneered the first method by reducing disulfide connectivity to a graph matching problem in which vertices represent oxidized cysteines; in addition, edges were labeled according to the interaction strength, i.e. contact potential, in the associated pair of cysteines. The Monte Carlo simulated annealing method had been adopted to derive the optimal values of contact potentials. Finally, the disulfide bridges were derived by finding the maximum weight perfect matching. Fariselli et al. [11] improved upon their previous findings by using neural networks to estimate cysteine pairwise interactions. Vullo and Frasconi [34] constructed an ad hoc recursive neural network for scoring labeled undirected graphs that represent connectivity patterns to significantly increase the prediction accuracy rate of bonding pattern from 34% to 44%. Baldi et al. [3] and Ferrè and Clote [13] improved the prediction accuracy by using two-dimensional recursive neural networks and the diresidue neural network, respectively, to predict connectivity probabilities between cysteine pairs. Ferrè and Clote [13] also trained the predictive model based on secondary structure information and diresidue frequencies. However, Tsai et al. [31] predicted connectivity probabilities between cysteine pairs by using the support vector machine. Local sequence profiles of a protein and linear distance of cysteines in the protein were the features used to train the support vector machine. The above methods are based on reducing the connectivity pattern prediction problem to the maximum weight perfect matching problem. In contrast, the following four methods are not based on this reduction. Chen and Hwang [7] determined the bonding pattern directly by using the support vector machine. Coupling between any two cysteines located in individual local sequences (the local evolutionary information derived from the position-specific scoring matrix), the cysteine spacing patterns, and the amino acid content are the features that they used to train the support vector machine. Zhao et al. [36] assumed that two proteins with similar cysteine separation share the same disulfide connectivity pattern, so they used a simple feature called cysteine separations profiles (CSP) to predict the connectivity patterns. Chen et al. [6] developed a two-level model, in which the protein local information was used as a pair-wise feature in the first level. Inputs in the second level were the scores of the first level and the global information such as the cysteine ordering, protein length and cysteine separations profiles. Lu et al. [21] achieved a prediction accuracy of 73.9% by using a genetic algorithm to optimize the feature selection for SVM. However, Song et al. [29] obtained a prediction accuracy of 74.4% based on use of multiple sequence vectors and the secondary structure. The prediction accuracy is the best one in the literature. Rubinstein and Fiser [26] recently analyzed the correlated mutation patterns in multiple sequence alignments to predict the disulfide bond connectivity. This work, adjusts the parameters of the SVM and the window sizes of the features by using a novel evolutionary algorithm called the multiple trajectory search, and integrates it with the SVM training. The accuracy rate of the proposed algorithm achieves an accuracy rate of 79.8% when testing on the same dataset SP39.
As a supervised learning method for classification [5], the support vector machine (SVM) is a highly effective means of solving problems in nonlinear classification, function estimation and density estimation. It also leads to many applications including image interpretation[23], image classification [35], intrusion detection of information security [27], pattern classification [18], data mining and other biotechnological fields [4], [17], [20]. In particular, applications of the SVM in medicine and biology have grown rapidly such as in the prediction of RNA-binding sites [30] and secondary structures [25] of proteins.
A set of training samples is given in the training of an SVM model. A training sample can be expressed as (Xi, yi), with Xi = (x1, x2, … , xn) and yi = 1 if Xi is in one class and yi = −1 if Xi is in another one. The training of SVM involves solving the following optimization problem:where Xi is mapped to a higher dimensional space by the function ϕ; ξi denotes the allowed error; and Crepresents the penalty of the error. In practice, the kernel function can be expressed as
Several kernel functions are used to construct SVM models. In this work, the radial basis function (RBF) is used, which is given by with the parameter γ > 0.
Devised for the single objective real-parameter optimization [32] the multiple trajectory search (MTS) uses multiple agents to search the solution space concurrently. Each agent conducts an iterated local search using one of three candidate local search methods. By selecting a local search method that best fits the landscape of a solution’s neighborhood, an agent may reach either a local optimum or the global optimum. Notably, the MTS can effectively solve the single objective real-parameter optimization problem.
The MTS is briefly described as follows. Further details can be found in [32]. The MTS initially generates M initial solutions based on the simulated orthogonal array (SOA). As a modified version of the orthogonal array (OA). The SOA tends to uniformly distribute these M initial solutions over a feasible solution space. The initial search range for local search methods is set to half the difference between the upper bound and the lower bound along each dimension. Afterwards, local search methods constrict the search range step by step.
The MTS consists of iterations of local searches until the termination condition is satisfied. During the first iteration, the MTS conducts local searches on all M initial solutions. However, in the following iterations, only a fixed number of best solutions are selected as foreground solutions and the MTS conducts local searches only on these foreground solutions. Three local search methods are devised in the MTS. The MTS initially evaluates the performance of the three local search methods and then, selects the one that performs best, i.e. the one that best fits the landscape of the neighborhood of the solution in hand, to perform the search. After conducting the search on foreground solutions, the MTS applies Local Search 1 to the current best solution that attempts to enhance the current best solution. Before completion of the iteration, a fixed number of best solutions are selected as the foreground solutions for the next iteration.
In the MTS, three local search methods are used to search different landscapes of the neighborhood of a solution. Local Search 1 searches along one dimension from the first dimension to the last one. Local Search 2 resembles Local Search 1 except that it only searches for approximately one-fourth of the dimensions. In both local search methods, the search range (SR) constricts to 0.5 × SR if the previous local search does not improve the current solution until it is lower than 1 × 10−4. In Local Search 1, on the dimension concerned with the search, SR first subtracts the coordinate of the solution for this dimension to determine if the objective function value is improved. If improved, the search then considers the next dimension. If unimproved, the solutions is restored and then coordinate of the solution for this dimension is added by 0.5 × SR, again to determine if the objective function value is improved. If improved, the search then considers the next dimension. If unimproved, the solution is restored and the search proceeds to consider the next dimension.
Local Search 3 differs from Local Searches 1 and 2. Local Search 3 considers three small movements along each dimension and determines heuristically the movement of the solution along each dimension. In Local Search 3, although the search is along each dimension from the first dimension to the last one, the objective function value is evaluated after searching for all dimensions. Moreover, the solution is moved to a new position only if the objective function is improved during this evaluation.
Highly effective at real-parameter optimization, the MTS won a competition held at the 2008 IEEE Congress on Evolutionary Computation. In this work, it is used to adjust the SVM parameters and the window sizes for various features. Since the parameters and the window sizes are real values, the MTS is highly appropriate for this application.
In addition to providing the definitions of the orthogonal array (OA) and the simulated orthogonal array (SOA), the Appendix lists the pseudo codes of the MTS and three local search methods.
Section snippets
Datasets
Performance of the proposed method was compared with those of previous works [3], [6], [7], [10], [11], [13], [21], [26], [29], [31], [34], in which the same dataset SP39 was used. The SP39 dataset [34] was adopted from the SWISS-PROT database, release No. 39 [2], by applying the same filtering procedure used by Fariselli and Casadio [10] to include only high quality sequences with experimentally verified intra-chain disulfide bridge annotations. In the dataset, fewer than 20% of the sequences
Results and discussion
Performance of the prediction algorithm is evaluated using two accuracy indices QP and QC:where CP represents the number of proteins whose bonding patterns are correctly predicted; TP represents the total number of proteins in the test set; CC represents the number of disulfide bridges that are correctly predicted; and TC represents the total number of disulfide bridges in test proteins.
Table 2 summarizes the accuracy rate of the proposed method in comparison with those of
Conclusions
Given the importance of feature selection and/or extraction in the classification of patterns, this study, predicted cysteine connectivity patterns based on four features. Experimental results indicate that the proposed method achieves an accuracy of 79.8% tested using fourfold cross-validation with SP39 dataset, which is better than the prediction performance reported in previous studies. Our results also indicate that the proposed method achieves prediction accuracies as high as 70.6% and
Acknowledgements
The authors thank National Science Council of the Republic of China, Taiwan (Contract No. NSC98-2221-E005-049-MY3) and Central Taiwan University of Science and Technology (Contract No. CTU99-P-33) for partially supporting this research work. The anonymous reviewers are commended for their valuable comments.
References (36)
- et al.
Ab initio fold prediction of small helical proteins using distance geometry and knowledge-based scoring functions
Journal of Molecular Biology
(1999) - et al.
Support vector machines with genetic fuzzy feature transformation for biomedical data classification
Information Sciences
(2007) Protein secondary structure prediction based on position-specific scoring matrices
Journal of Molecular Biology
(1999)- et al.
Mammographic masses characterization based on localized texture and dataset fractal analysis using linear, neural and support vector machine classifiers
Artificial Intelligence in Medicine
(2006) - et al.
A hybrid machine learning approach to network anomaly detection
Information Sciences
(2007) - et al.
MONSSTER: a method for folding globular proteins with a small number of distance restraints
Journal of Molecular Biology
(1997) - et al.
RISP: a web-based server for prediction of RNA-binding sites in proteins
Computer Methods and Programs in Biomedicine
(2008) - et al.
A novel database of disulfide patterns and its application to the discovery of distantly related homologs
JJournal of Molecular Biology
(2004) - et al.
Shape-based image retrieval using support vector machines, Fourier descriptors and self-organizing maps
Information Sciences
(2007) - et al.
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research
(1997)
The Swiss-Prot protein sequence database and its supplement TrEMBL in 2000
Nucleic Acids Research
Large-scale prediction of disulphide bond connectivity
Knowledge-based analysis of microarray gene expression data using support vector machines
PNAS
Disulfide connectivity prediction with 70% accuracy using two-level models
Proteins
Prediction of disulfide connectivity from protein sequences
Proteins
Prediction of the bonding states of cysteines using the support vector machines based on multiple feature vectors and cysteine state sequences
Proteins
Relationship between protein structures and disulfide-bonding patterns
Proteins
Cited by (4)
Improved process monitoring and supervision based on a reliable multi-stage feature-based pattern recognition technique
2014, Information SciencesCitation Excerpt :The method introduced by Cortes and Vapnik [10] is based on statistical learning theory and is considered one of the best techniques for pattern recognition. Support vector machine (SVM) implementations have been demonstrated in a wide range of applications, including economics [16], text mining [28], medicine and biology [18], remote sensing [19], image segmentation [29], in addition to machine fault diagnosis and condition monitoring [6,27]. The SVM was used for classification because of its good generalization ability and its robustness to outliers.
Disulfide connectivity prediction based on structural information without a prior knowledge of the bonding state of cysteines
2013, Computers in Biology and MedicineCitation Excerpt :Song et al. [17] used the multiple sequence vectors and secondary structure information to train a support vector regression model for the prediction of disulfide bonding patterns, with an accuracy of 74.4%. Recently, Lin and Tseng [18] used features, including position-specific scoring matrix (PSSM), normalized bond lengths, predicted secondary structure of proteins, and indices of the physicochemical properties of amino acids for training a SVM model for the prediction of disulfide bonding pattern with an accuracy of 79.8%. The computational methods for the prediction of disulfide bonding patterns can be divided into sequence-based and structural-based [19].
Soft computing methods for disulfide connectivity prediction
2015, Evolutionary Bioinformatics