Elsevier

Information Sciences

Volume 199, 15 September 2012, Pages 167-178
Information Sciences

Prediction of disulfide bonding pattern based on a support vector machine and multiple trajectory search

https://doi.org/10.1016/j.ins.2012.02.035Get rights and content

Abstract

To determine protein folding, accurately predicting the connectivity pattern of disulfide bridges can significantly reduce the search space, helping to solving the protein-folding problem. Therefore, developing an effective means of predicting disulfide connectivity patterns facilitates the estimation of the three-dimensional structure of a protein and its function. To our knowledge, with the prior knowledge of the bonding states of cysteines, the highest accuracy rate in the literature for predicting the overall disulfide connectivity pattern (Qp) is 74.4% for dataset SP39. Dataset SP39 is conventionally adopted to predict disulfide connectivity. This work presents a novel classifier based on the support vector machine (SVM) that incorporates features of position-specific scoring matrix (PSSM), normalized bond lengths, the predicted secondary structure of protein, and indices for the physicochemical properties of amino acid. The support vector machine is trained to derive the connectivity probabilities of cysteine pairs. Additionally, an evolutionary algorithm called the multiple trajectory search (MTS) is integrated with the SVM model to tune the SVM parameters and window sizes for the above features. Moreover, the disulfide connectivity pattern is identified by using the maximum weight perfect matching algorithm. Experimental results indicate that the accuracy rate for predicting the overall disulfide connectivity pattern (Qp) reaches 79.8% when tested using the same dataset SP39.

Introduction

Disulfide bonds significantly impacts stabilization of protein conformations. To predict protein folding, accurately predicting of disulfide bridges can significantly reduce the search space [16], [28]. Predicting the disulfide bonding pattern helps, to a certain extent, determination of the three-dimensional structure of a protein and its function because disulfide bonds impose geometrical constraints on the protein backbones. Recent works have demonstrated that the disulfide bonding patterns and protein structures are closely related to each other [9], [33].

Predicting the disulfide bond requires addressing two problems. First, the disulfide bonding states must be determined, i.e. a cysteine may be in an oxidized or reduced state and one must determine whether or not a cysteine is oxidized. Second, the disulfide bonding pattern must be predicted, with the goal of determining whether individual pairs of cysteines contain a disulfide bond. Significant progress has been made recently in predicting disulfide bonding states. Several methods have been developed based on statistical analysis [14], neural network [12], [22], and support vector machine [8], which can accurately predict the bonding state of cysteines with accuracy rates ranging from 81% to 90%. Moreover, several methods have been devised to predict disulfide bonding patterns. Fariselli and Casadio [10] pioneered the first method by reducing disulfide connectivity to a graph matching problem in which vertices represent oxidized cysteines; in addition, edges were labeled according to the interaction strength, i.e. contact potential, in the associated pair of cysteines. The Monte Carlo simulated annealing method had been adopted to derive the optimal values of contact potentials. Finally, the disulfide bridges were derived by finding the maximum weight perfect matching. Fariselli et al. [11] improved upon their previous findings by using neural networks to estimate cysteine pairwise interactions. Vullo and Frasconi [34] constructed an ad hoc recursive neural network for scoring labeled undirected graphs that represent connectivity patterns to significantly increase the prediction accuracy rate of bonding pattern from 34% to 44%. Baldi et al. [3] and Ferrè and Clote [13] improved the prediction accuracy by using two-dimensional recursive neural networks and the diresidue neural network, respectively, to predict connectivity probabilities between cysteine pairs. Ferrè and Clote [13] also trained the predictive model based on secondary structure information and diresidue frequencies. However, Tsai et al. [31] predicted connectivity probabilities between cysteine pairs by using the support vector machine. Local sequence profiles of a protein and linear distance of cysteines in the protein were the features used to train the support vector machine. The above methods are based on reducing the connectivity pattern prediction problem to the maximum weight perfect matching problem. In contrast, the following four methods are not based on this reduction. Chen and Hwang [7] determined the bonding pattern directly by using the support vector machine. Coupling between any two cysteines located in individual local sequences (the local evolutionary information derived from the position-specific scoring matrix), the cysteine spacing patterns, and the amino acid content are the features that they used to train the support vector machine. Zhao et al. [36] assumed that two proteins with similar cysteine separation share the same disulfide connectivity pattern, so they used a simple feature called cysteine separations profiles (CSP) to predict the connectivity patterns. Chen et al. [6] developed a two-level model, in which the protein local information was used as a pair-wise feature in the first level. Inputs in the second level were the scores of the first level and the global information such as the cysteine ordering, protein length and cysteine separations profiles. Lu et al. [21] achieved a prediction accuracy of 73.9% by using a genetic algorithm to optimize the feature selection for SVM. However, Song et al. [29] obtained a prediction accuracy of 74.4% based on use of multiple sequence vectors and the secondary structure. The prediction accuracy is the best one in the literature. Rubinstein and Fiser [26] recently analyzed the correlated mutation patterns in multiple sequence alignments to predict the disulfide bond connectivity. This work, adjusts the parameters of the SVM and the window sizes of the features by using a novel evolutionary algorithm called the multiple trajectory search, and integrates it with the SVM training. The accuracy rate of the proposed algorithm achieves an accuracy rate of 79.8% when testing on the same dataset SP39.

As a supervised learning method for classification [5], the support vector machine (SVM) is a highly effective means of solving problems in nonlinear classification, function estimation and density estimation. It also leads to many applications including image interpretation[23], image classification [35], intrusion detection of information security [27], pattern classification [18], data mining and other biotechnological fields [4], [17], [20]. In particular, applications of the SVM in medicine and biology have grown rapidly such as in the prediction of RNA-binding sites [30] and secondary structures [25] of proteins.

A set of training samples is given in the training of an SVM model. A training sample can be expressed as (Xi, yi), with Xi = (x1, x2,  , xn) and yi = 1 if Xi is in one class and yi = −1 if Xi is in another one. The training of SVM involves solving the following optimization problem:minω,b,ξ12ωTω+Ci=1nξisubject toyi(ωTϕ(Xi)+b)1-ξiandξi0,where Xi is mapped to a higher dimensional space by the function ϕ; ξi denotes the allowed error; and Crepresents the penalty of the error. In practice, the kernel function can be expressed asK(Xi,Xj)=(Xi)T(Xj).

Several kernel functions are used to construct SVM models. In this work, the radial basis function (RBF) is used, which is given by e-γXi-Xj2 with the parameter γ > 0.

Devised for the single objective real-parameter optimization [32] the multiple trajectory search (MTS) uses multiple agents to search the solution space concurrently. Each agent conducts an iterated local search using one of three candidate local search methods. By selecting a local search method that best fits the landscape of a solution’s neighborhood, an agent may reach either a local optimum or the global optimum. Notably, the MTS can effectively solve the single objective real-parameter optimization problem.

The MTS is briefly described as follows. Further details can be found in [32]. The MTS initially generates M initial solutions based on the simulated orthogonal array (SOA). As a modified version of the orthogonal array (OA). The SOA tends to uniformly distribute these M initial solutions over a feasible solution space. The initial search range for local search methods is set to half the difference between the upper bound and the lower bound along each dimension. Afterwards, local search methods constrict the search range step by step.

The MTS consists of iterations of local searches until the termination condition is satisfied. During the first iteration, the MTS conducts local searches on all M initial solutions. However, in the following iterations, only a fixed number of best solutions are selected as foreground solutions and the MTS conducts local searches only on these foreground solutions. Three local search methods are devised in the MTS. The MTS initially evaluates the performance of the three local search methods and then, selects the one that performs best, i.e. the one that best fits the landscape of the neighborhood of the solution in hand, to perform the search. After conducting the search on foreground solutions, the MTS applies Local Search 1 to the current best solution that attempts to enhance the current best solution. Before completion of the iteration, a fixed number of best solutions are selected as the foreground solutions for the next iteration.

In the MTS, three local search methods are used to search different landscapes of the neighborhood of a solution. Local Search 1 searches along one dimension from the first dimension to the last one. Local Search 2 resembles Local Search 1 except that it only searches for approximately one-fourth of the dimensions. In both local search methods, the search range (SR) constricts to 0.5 × SR if the previous local search does not improve the current solution until it is lower than 1 × 10−4. In Local Search 1, on the dimension concerned with the search, SR first subtracts the coordinate of the solution for this dimension to determine if the objective function value is improved. If improved, the search then considers the next dimension. If unimproved, the solutions is restored and then coordinate of the solution for this dimension is added by 0.5 × SR, again to determine if the objective function value is improved. If improved, the search then considers the next dimension. If unimproved, the solution is restored and the search proceeds to consider the next dimension.

Local Search 3 differs from Local Searches 1 and 2. Local Search 3 considers three small movements along each dimension and determines heuristically the movement of the solution along each dimension. In Local Search 3, although the search is along each dimension from the first dimension to the last one, the objective function value is evaluated after searching for all dimensions. Moreover, the solution is moved to a new position only if the objective function is improved during this evaluation.

Highly effective at real-parameter optimization, the MTS won a competition held at the 2008 IEEE Congress on Evolutionary Computation. In this work, it is used to adjust the SVM parameters and the window sizes for various features. Since the parameters and the window sizes are real values, the MTS is highly appropriate for this application.

In addition to providing the definitions of the orthogonal array (OA) and the simulated orthogonal array (SOA), the Appendix lists the pseudo codes of the MTS and three local search methods.

Section snippets

Datasets

Performance of the proposed method was compared with those of previous works [3], [6], [7], [10], [11], [13], [21], [26], [29], [31], [34], in which the same dataset SP39 was used. The SP39 dataset [34] was adopted from the SWISS-PROT database, release No. 39 [2], by applying the same filtering procedure used by Fariselli and Casadio [10] to include only high quality sequences with experimentally verified intra-chain disulfide bridge annotations. In the dataset, fewer than 20% of the sequences

Results and discussion

Performance of the prediction algorithm is evaluated using two accuracy indices QP and QC:QP=CP/TPandQC=CC/TC,where CP represents the number of proteins whose bonding patterns are correctly predicted; TP represents the total number of proteins in the test set; CC represents the number of disulfide bridges that are correctly predicted; and TC represents the total number of disulfide bridges in test proteins.

Table 2 summarizes the accuracy rate of the proposed method in comparison with those of

Conclusions

Given the importance of feature selection and/or extraction in the classification of patterns, this study, predicted cysteine connectivity patterns based on four features. Experimental results indicate that the proposed method achieves an accuracy of 79.8% tested using fourfold cross-validation with SP39 dataset, which is better than the prediction performance reported in previous studies. Our results also indicate that the proposed method achieves prediction accuracies as high as 70.6% and

Acknowledgements

The authors thank National Science Council of the Republic of China, Taiwan (Contract No. NSC98-2221-E005-049-MY3) and Central Taiwan University of Science and Technology (Contract No. CTU99-P-33) for partially supporting this research work. The anonymous reviewers are commended for their valuable comments.

References (36)

  • A. Bairoch et al.

    The Swiss-Prot protein sequence database and its supplement TrEMBL in 2000

    Nucleic Acids Research

    (2000)
  • P. Baldi et al.

    Large-scale prediction of disulphide bond connectivity

  • M.P.S. Brown et al.

    Knowledge-based analysis of microarray gene expression data using support vector machines

    PNAS

    (2000)
  • C.C. Chang, C.J. Lin, LIBSVM: A Library for Support Vector Machines, 2001....
  • B.J. Chen et al.

    Disulfide connectivity prediction with 70% accuracy using two-level models

    Proteins

    (2006)
  • Y.C. Chen et al.

    Prediction of disulfide connectivity from protein sequences

    Proteins

    (2005)
  • Y.C. Chen et al.

    Prediction of the bonding states of cysteines using the support vector machines based on multiple feature vectors and cysteine state sequences

    Proteins

    (2004)
  • C.C. Chuang et al.

    Relationship between protein structures and disulfide-bonding patterns

    Proteins

    (2003)
  • Cited by (4)

    • Improved process monitoring and supervision based on a reliable multi-stage feature-based pattern recognition technique

      2014, Information Sciences
      Citation Excerpt :

      The method introduced by Cortes and Vapnik [10] is based on statistical learning theory and is considered one of the best techniques for pattern recognition. Support vector machine (SVM) implementations have been demonstrated in a wide range of applications, including economics [16], text mining [28], medicine and biology [18], remote sensing [19], image segmentation [29], in addition to machine fault diagnosis and condition monitoring [6,27]. The SVM was used for classification because of its good generalization ability and its robustness to outliers.

    • Disulfide connectivity prediction based on structural information without a prior knowledge of the bonding state of cysteines

      2013, Computers in Biology and Medicine
      Citation Excerpt :

      Song et al. [17] used the multiple sequence vectors and secondary structure information to train a support vector regression model for the prediction of disulfide bonding patterns, with an accuracy of 74.4%. Recently, Lin and Tseng [18] used features, including position-specific scoring matrix (PSSM), normalized bond lengths, predicted secondary structure of proteins, and indices of the physicochemical properties of amino acids for training a SVM model for the prediction of disulfide bonding pattern with an accuracy of 79.8%. The computational methods for the prediction of disulfide bonding patterns can be divided into sequence-based and structural-based [19].

    View full text