Knowledge acquisition and development of accurate rules for predicting protein stability changes

https://doi.org/10.1016/j.compbiolchem.2006.06.004Get rights and content

Abstract

Knowing the mechanisms by which protein stability change is one of the most important and valuable tasks in molecular biology. The conventional methods of predicting protein stability changes mainly focus on improving prediction accuracy. However, it is desirable to extract domain knowledge from large databases that is beneficial to accurate prediction of the protein stability change. This paper presents an interpretable prediction tree method (named iPTREE) that produces explanatory rules to explore hidden knowledge accompanied with high prediction accuracy and consequently analyzes the factors influencing the protein stability changes. To evaluate iPTREE and the knowledge upon protein stability changes, a thermodynamic dataset consisting of 1615 mutants led by single point mutation from ProTherm is adopted. Being as a predictor for protein stability changes, the rule-based approach can achieve a prediction accuracy of 87%, which is better than other methods based on artificial neural networks (ANN) and support vector machines (SVM). Besides, these methods lack the ability in biological knowledge discovery. The human-interpretable rules produced by iPTREE reveal that temperature is a factor of concern in predicting protein stability changes. For example, one of interpretable rules with high support is as follows: if the introduced residue type is Alanine and temperature is between 4 °C and 40 °C, then the stability change will be negative (destabilizing). The present study demonstrates that iPTREE can easily be used in the application of protein stability changes where one requires more understandable knowledge.

Introduction

Understanding the relationship between structure, function, and property of proteins is helpful to protein design that produces novel protein sequences. For this purpose, interpreting stability is a precursor and also a goal to the ability to successfully design stable proteins (Daggett and Fersht, 2003). Up to now, various methods have been proposed to predict stability changes (ΔΔG) upon protein mutation, including energy-based methods and machine learning approaches. Energy-based methods base on force fields, which can be categorized into three major classes depending on the energy functions (Guerois et al., 2002): (a) those using physically effective energy functions (Prevost et al., 1991); (b) those based on statistical potentials for which energies are derived from the frequencies of residue contacts (Gilis and Rooman, 1997); and (c) those using empirically effective energy functions obtained from experimental data (Funahashi et al., 2001). Recently, machine learning approaches based on artificial neural network (ANN) (Capriotti et al., 2004) and support vector machines (Capriotti et al., 2005) have been proposed.

All the above-mentioned methods are concentrated on raising prediction accuracy but not accompanied with knowledge acquisition. However, only predicting protein stability is not satisfactory for the goal of understanding the relationship between structure, function, and property of proteins. Besides, because the sizes of datasets used to design predictors are often insufficiently large, the overfitting problem may be occurred resulting in a wrong model and incorrect inference. Therefore, the validation for the model and inference is necessary and crucial. If the prediction model was established accompanied with human-interpretable knowledge generated, it would be more credible after confirmation. Thus, it is better to design an interpretable predictor that takes both prediction accuracy and knowledge acquisition into account simultaneously. In this study, the proposed interpretable prediction tree method (named iPTREE) aims to simultaneously achieve the following three objectives, described below.

The ANN predictor (Capriotti et al., 2004) reached the accuracy as high as 81% in predicting the stability (stabilizing/destabilizing) of protein mutants (sign of ΔΔG values), and performs better than the existing energy-based methods in terms of prediction accuracy. However, the ANN predictor lacks the ability in biological knowledge discovery. The rule-based approach generated from iPTREE is able to successfully predict the sign of the ΔΔG value with accuracy 87% using a 10-fold cross-validation test, which is significantly better than the ANN-predictor using the same features and dataset. The high accuracy of prediction model will provide more confidence to the knowledge discovery derived from this model.

The mechanism of systematically and actively capturing knowledge from experiment results is valuable to understanding an unknown concept. iPTREE can reveal the important factors and decision rules about protein stability changes upon mutation from a large and confused database.

At the same time, the rule base also demonstrates interpretable decision rules. One of those rules with high support is as follows:

If the introduced residue type is Alanine and temperature is between 4 °C and 40 °C, then the stability change will be negative.

Those interpretable rules may agree with previous researches or belong to new discovery that still requires a confirmation. However, according to those interpretable conditions (temperature, introduced/deleted residue type and the environment information of the mutation position), rules can more easily be validated to be usable knowledge. In this study, although iPTREE was applied to predict protein stability changes, it can be extended to other applications and has been successfully used in prediction and analysis of DNA-binding sites of proteins (Ho et al., 2005).

From various viewpoints, several studies have similarly revealed that the positional parameters play an important role in understanding the folding and stability of protein mutants (Gilis and Rooman, 1997, Gromiha et al., 1999, Gromiha and Selvaraj, 2002, Capriotti et al., 2004). However, the comparison of relative importance between secondary structure and solvent accessibility of mutant residues from the viewpoint of predicting the stability of protein mutants has not yet been completely explored. In the recent discussion about the relative importance, the secondary structure carries similar or more information than solvent accessibility for understanding the stability of protein mutants (Saraboji et al., 2005). Through this topic, iPTREE performed the factor analysis and made a discussion.

Instead of the conventional investigation using linear correlation between one individual factor and real experiment value, iPTREE further considers interaction between the concerned factor and other pre-existing factors, namely the surrounding effect, in the factor analysis using prediction accuracy as a measurement of importance. That is to say, relationship between one feature set and real experiment value is considered. Based on the one-factor-at-once strategy for analysis of the two influence factors, secondary structure and solvent accessibility, iPTREE used the three feature sets: (1) including solvent accessibility, (2) including secondary structure, and (3) including both two, with the same surrounding effect. The statistic result F = 0.92 of one-way analysis of variance (ANOVA) for difference in means indicates the hypothesis: three conditions have equal means. It may result from that the environment information of the mutation position is enough to cover those from the secondary structure and solvent accessibility.

Section snippets

Protein and mutant datasets

For comparisons, the same dataset used by (Capriotti et al., 2004) is conducted, which is obtained from the thermodynamic database for proteins and mutants (ProTherm, Gromiha et al., 2000). The dataset (S1615) consists of 1615 single point mutations obtained from 42 protein sequences. Each record of S1615 contains the following seven features:

  • (1)

    Md: deleted-residue mutation type;

  • (2)

    Mi: introduced-residue mutation type;

  • (3)

    pH: the pH value of the experimental condition;

  • (4)

    Temp: the temperature (°C) used in

Confidence level effects of C4.5 algorithm

In order to avoid overfitting in decision tree learning, we manipulated confidence level that effects tree pruning of C4.5. Whereas the appropriate confidence level is problem-dependent, a preliminary analysis was applied using the following simulations. iPTREE was applied to S1615 with feature set F3 and based on 10-fold cross-validation test. The computing platform is Intel Celeron processor 2.4 GHz with 768 MB RAM running Microsoft Windows XP.

Table 1 shows the prediction results using various

Conclusions

In this paper, the proposed iPTREE was effectively applied to establish an accurate rule base from the thermodynamic database of proteins and mutants. On the framework of iPTREE, potential knowledge of protein stability prediction can be extracted and transform to interpretable rules which can help the further validation by biochemistry experts. Meanwhile, the importance of factors effecting protein stability changes can be compared by the prediction accuracy served as a comprehensive index.

In

References (23)

  • E. Capriotti et al.

    I-Mutant2.0: predicting stability changes upon mutation from the protein sequence or structure

    Nucleic Acids Res.

    (2005)
  • Cited by (20)

    View all citing articles on Scopus
    View full text