Keywords

1 Introduction

Prediction of protein-protein interaction (PPI) types is an important problem in life sciences because of fundamental role of PPIs in many biological processes [6]. PPIs have been investigated in various ways, involving both experimental (in vivo or in vitro) and computational (in silico) approaches [2, 10]. Experimental approaches tend to be costly, labor intensive and suffer from noise. Therefore, using computational approaches for prediction of PPIs is a good choice for many reasons.

PPIs can be divided into non-obligate and obligate complexes (binding components can/cannot form stable functional structures). Based on the duration and life time of the interactions, there are transient complexes and permanent ones. Although interfaces have been the main subject of study to predict protein-protein interactions, an accuracy of 70 % has been independently achieved by several different groups [9, 10, 13, 14]. These approaches have been carried out by analyzing a wide range of parameters, including solvation energies, amino acid composition, conservation, electrostatic energies, and hydrophobicity.

These includes a method based on PCA combined with Chernoff extension of Fisher linear discriminant analysis [9]. PCA is necessary to reduce the dimensionality of the input feature space (i.e. to be less than the sample size). As a consequence some important information is lost.

In this paper, we propose a new classification approach based on sparse discriminant analysis [12] to predict obligate (permanent) and non-obligate (transient) protein-protein interactions. To characterize properties of protein interaction, we proposed to use the binding free energies.

2 Fisher and Sparse Regularized Linear Discriminant Analyses

Fisher Linear Discriminant analysis (FLDA) [5, 11] is a multivariate technique which is concerned with the search for a linear transformation that reduces the dimension of a given p-dimensional statistical model to q (q < p) dimensions, while maximally preserving the discriminatory information for the several classes within the model.

Formally, suppose that there are k classes and let \( x_{ij} ,j = 1, \ldots ,n_{i} \) be vectors of observations from the i-th class, \( i = 1, \ldots ,k \). Set \( n = n_{1} + \ldots + n_{k} \) and let \( X_{n \times p} = (x_{11} , \ldots ,x_{{1n_{1} }} , \ldots ,x_{k1} , \ldots ,x_{{kn_{k} }} )^{T} \), where p is the dimensionality of the input space. FLDA determines a linear mapping L, i.e. a \( q \times p \) matrix A, that maximizes the so-called Fisher criterion \( J_{F} \):

$$ J_{F} (A) = tr((AS_{W} A^{T} )^{ - 1} (AS_{B} A^{T} )) $$
(1)

where \( S_{B} = \sum\nolimits_{i = 1}^{k} {p_{i} (m_{i} - \bar{m})(m_{i} - \bar{m})^{T} } \) and \( S_{W} = \sum\nolimits_{i = 1}^{k} {p_{i} S_{i} } \) are the between-class and the average within-class scatter matrix, respectively; \( S_{i} = \frac{1}{{n_{i} - 1}}\sum\nolimits_{j = 1}^{{n_{i} }} {(x_{ij} - m_{i} )} (x_{ij} - m_{i} )^{T} \) is the within-class covariance matrix of class i, mi is the mean vector of class i, pi is it’s a priori probability, and \( \bar{m} = \sum\nolimits_{i = 1}^{k} {p_{i} m_{i} } \) is the overall mean. FLDA maximizes the ratio of between-class scatter to average within-class scatter in the lower-dimensional space. Optimizing (1) comes down to determining an eigenvalue decomposition of \( S_{W}^{ - 1} S_{B} \), and taking the rows of A to equal the q eigenvectors corresponding to the q largest eigenvalues. There are no more than \( \hbox{min} (p,k - 1) \) eigenvectors corresponding to nonzero eigenvalues.

When the number of variables exceeds the sample size, i.e., in the high-dimensional, low-sample size (HDLSS) settings, the within-class covariance matrix \( S_{W} \) is singular and the classical FLDA breaks down. Several extensions have been proposed to overcome this problem but all of them possess the data pilling problem [8]. To ameliorate this problem, some sparse version of LDA have been proposed.

In our approach, to circumvent this problem, we adapt the sparse linear discriminant approach (slda) from [12] that incorporates feature selection in FLDA. The term “sparse” means that the discriminant vectors have only a small number of nonzero components. The underlying assumption is that, among the large number of variables there are many irrelevant or redundant variables for the purpose of classification. This method is based on the connection of FLDA and a generalized eigenvalue problem, stated formally by the following theorem:

Theorem [12]:

Suppose \( S_{w} \) is a positive definite matrix and denote its Cholesky decomposition as \( S_{w} = R_{w}^{T} R_{w} \) (\( R_{w} \) is an upper triangular matrix). Let \( H_{b} \) be \( k \times p \) matrix, \( V_{1} , \ldots ,V_{q} \,(q < \hbox{min} (p,k - 1)) \) denote the eigenvectors of \( S_{W}^{ - 1} S_{B} \) corresponding to the q largest eigenvalues \( \lambda_{1} \ge \ldots \ge \lambda_{q} \), \( A = [\alpha_{1} , \ldots ,\alpha_{q} ] \), \( B = [\beta_{1} , \ldots ,\beta_{q} ] \). For \( \lambda > 0 \) let \( \hat{A},\hat{B} \) be the solution to the following problem:

$$ \mathop {\hbox{min} }\limits_{A,B} \sum\limits_{i = 1}^{k} {\left\| {R_{w}^{ - T} H_{b,i} - AB^{T} H_{b,i} } \right\|^{2} + \lambda \sum\limits_{j = 1}^{q} {\beta_{j}^{T} (S_{w} )\beta_{j} } } \,{\text{subject to}}\,A^{T} A = I_{p \times q} , $$

where:

\( H_{b,i} = \sqrt {n_{i} } (\bar{x}_{i} - \bar{x})^{T} \) is the i-th row of the matrix

\( H_{b} = \left( {\sqrt {n_{1} } (\bar{x}_{1} - \bar{x}), \ldots ,\sqrt {n_{k} } (\bar{x}_{k} - \bar{x})} \right)^{T} \),

\( e^{{n_{i} }} \) is a vector of ones with length \( n_{i} \),

Then \( \hat{\beta }_{j} ,j = 1, \ldots q \), span the same linear space as \( V_{j} ,j = 1, \ldots ,q \).

The following method of regularization is applied in [12] to circumvent the singularity problem and to obtain the sparse linear discriminants: i.e. the first q sparse discriminant directions \( \beta_{1} , \ldots ,\beta_{q} \) are defined as the solutions to the following optimization problem:

$$ \mathop {\hbox{min} }\limits_{A,B} \sum\limits_{i = 1}^{k} {\left\| {R_{w}^{ - T} H_{b,i} - AB^{T} H_{b,i} } \right\|^{2} + \lambda \sum\limits_{j = 1}^{q} {\beta_{j}^{T} \left( {S_{w} + \gamma \frac{{tr(S_{w} )}}{p}I} \right)\beta_{j} } } + \sum\limits_{j = 1}^{q} {\lambda_{1,j} \left\| {\beta_{j} } \right\|_{1} } $$

subject to \( A^{T} A = I_{p \times q} \), where \( B = [\beta_{1} , \ldots ,\beta_{q} ] \), \( \left\| {\beta_{j} } \right\|_{1} \) is the 1-norm of the vector \( \beta_{j} \), the same \( \lambda \) is used for all q directions, different \( \lambda_{1,j} \)’s are allowed to penalize different discriminant directions.

Our empirical study suggests that the solution is very stable when \( \lambda \) varies in a wide range, for example in (0.01, 10000). We can use K-fold cross validation (CV) [11] to select the optimal parameters \( \lambda_{1,j} \), but when the dimension of the input data is very large, the numerical algorithm becomes time consuming and we can let \( \lambda_{1,1} = \ldots = \lambda_{1,q} \). The tuning parameter \( \gamma \) controls the strength of the regularization of the matrix \( S_{w} \), the large values will bias too much \( S_{w} \) towards identity matrix (high degree of regularization). In our empirical studies, we find that the results are not sensitive to the choice of \( \gamma \) if a small value that is less than 0.1 is used, in our studies we set \( \gamma = 0.05 \). More careful studies of choice of \( \gamma \) are left for future research.

The above problem can be numerically solved by alternating optimization over A and B [12] and the resulting algorithm is summarized below.

$$ *** $$

Regularized sparse LDA (rSLDA) algorithm (based on [12]):

  1. 1.

    Form the matrices from the input data:

    $$ H_{w} = X - \left( {\begin{array}{*{20}c} {e^{{n_{1} }} (\bar{x}_{1} )^{T} } \\ \vdots \\ {e^{{n_{K} }} (\bar{x}_{K} )^{T} } \\ \end{array} } \right) $$
    $$ H_{b} = \left( {\sqrt {n_{1} } (\bar{x}_{1} - \bar{x}), \ldots ,\sqrt {n_{k} } (\bar{x}_{k} - \bar{x})}^{T} \right) $$
  2. 2.

    Compute upper triangular matrix \( R_{w} \) from the Cholesky decomposition of:

    $$ \left( {S_{w} + \gamma \frac{{tr(S_{w} )}}{p}I} \right)\;\;{\text{such that}}\;\;\left( {S_{w} + \gamma \frac{{tr(S_{w} )}}{p}I} \right) = R_{w}^{T} R_{w} $$
  3. 3.

    Solve the q independent optimization problems

    $$ \mathop {\hbox{min} }\limits_{{\beta_{j} }} \beta_{j}^{T} (\tilde{W}^{T} \tilde{W})\beta_{j} - 2\tilde{y}^{T} \tilde{W}\beta_{j} + \lambda_{1} \left\| {\beta_{j} } \right\|_{1} ,\;j = 1, \ldots ,q $$

    where

    $$ \tilde{W}_{(n + p) \times p} = \left( {\begin{array}{*{20}c} {H_{b} } \\ {\sqrt \lambda \cdot R_{w} } \\ \end{array} } \right)\,\;\tilde{y}_{(n + p) \times 1} = \left( {\begin{array}{*{20}c} {H_{b} R_{w}^{ - 1} \alpha_{j} } \\ 0 \\ \end{array} } \right) $$
  4. 4.

    Compute SVD:

    $$ R_{w}^{ - T} (H_{B}^{T} H_{B} )B = UDV^{T} \, {\text{and let}}\,A = UV^{T} $$
  5. 5.

    Repeat steps 3 and 4 until converges.

$$ *** $$

3 Protein-Protein Interaction Classification Method

To characterize properties of protein interaction, we proposed to use the binding free energies. These were computed using FastContact [4], which obtains their fast estimates. FastContact delivers the electrostatic energy, solvation free energy, and the top 20 maximum and minimum values for:

  1. 1.

    residues contributing to the binding free energy,

  2. 2.

    ligand residues contributing to the solvation free energy,

  3. 3.

    ligand residues contributing to the electrostatic energy,

  4. 4.

    receptor residues contributing to the solvation free energy,

  5. 5.

    receptor residues contributing to the electrostatic energy,

  6. 6.

    receptor-ligand residue solvation constants,

  7. 7.

    receptor-ligand residue electrostatic constants.

Thus, all these values and the total solvation and electrostatic energy values compose a total of 282 features characterizing interaction.

To create a dataset for classification, we used the pre-classified dataset from previous study [9] containing 62 transient and 75 obligate complexes as two different classes for classification. Each complex is listed in the form of chains for ligand and receptor respectively. The relevant data about the structure of each complex was obtained from the Protein Data Bank (PDB) [1] and then obtaining the 282 features by invoking FastContact.

Due to the fact that the number of features (282) is greater than the number of samples in a dataset (137), we have HDLSS setting, so we apply sparse regularized linear discriminant analysis for the calculation of discriminant directions, i.e. the algorithm sparsed rLDA described above.

For the classification of the samples in the new discriminant space, we applied the nearest centroid method [11] as the classification algorithm.

4 Rapid Estimation of Contact and Binding Free Energies

The estimation of contact and binding free energies may be in general a time consuming job. One of components of the binding energy is electrostatic energy. This term applies to a system of charges and is defined as the work necessary to move all the electric charges from infinity to positions occupied in the analyzed system. This work does not depend on the path traveled by the charges and is one of properties of a static arrangement of charges in space. Electrostatic interaction works on relatively long distances [7]. For proteins, it refers to the interaction between electrically charged atoms in different proteins or interactions between charges in the surface of the protein and charges in the environment. The exact computation of this energy for each possible conformation would be time consuming.

We have used a method called FastContact, developed by Camacho et al. [3, 4]. The binding energy is computed as a sum of the electrostatic potential and the desolvation free energy in proteins: \( G_{\text{binding}} = E_{\text{electrostatic}} + G_{\text{desolvation}} \). In this formula, \( E_{\text{electrostatic}} \) is the standard intermolecular Coulomb electrostatic potential with relative permittivity varying with the distance r and equal to 4r. The term \( G_{\text{desolvation}} \) includes basic features of the desolvation free energy in proteins: hydrophobic interactions, self-energy change resulting from desolvating charge on polar atom groups and side-chain entropy loss. The \( G_{\text{desolvation}} \) term is calculated as a sum of atomic contact potentials (ACP) between all pairs of atoms, where the first atom belongs to the receptor, the second to the ligand. Each term of this sum is additionally multiplied by a function g(r) depending on the distance r between involved atoms. For r > 7 Å the value is 0, for r < 5 Å is 1 and between 5 Å and 7 Å g(r) is a smooth function. These ACPs are defined for 18 atom types. The function g(r) makes possible faster computation by zeroing interactions between distant atoms.

5 Experimental Results

In our experiments we have used the dataset of 137 protein complexes described in [13]. 75 samples in this dataset belong to the first class (i.e. “obligate interactions”) and 62 samples to the second class (i.e. “non-obligate interactions”). This dataset is randomly divided into a “training set” and “testing set” in a ratio of 4:1.

As we have only two classes (k = 2), there is only one discriminant direction \( \beta_{1} \) (q = 1). Using all variables in constructing the discriminant vector \( \beta_{1} \) might cause the overfitting of the training data, resulting in high testing error rate. Moreover it is computationally demanding, so sparsification would be a good choice.

Denote the number of significant variables involved in specifying the discriminant direction \( \beta_{1} \) (i.e. giving the best prediction), to be m. To find these most significant variables we have performed the experiment with varying values of m. For a given value of m, only the m maximum values of the coordinates of the vector \( \beta_{1} \) (so called beta values) are left, the rest is zeroed.

After computing the components of the \( \beta_{1} \) vector with the SLDA algorithm, taking their absolute values and sorting them in ascending order, we have found out, that they rise slowly to about 0.3 of the maximum value at the element 230 and then grow more rapidly within last 52 components.

We leave only m biggest values, zeroing all others. We keep track of indices of these biggest values and modify the original \( \beta_{1} \) leaving only m biggest values. These values are used to cast the original 282-dimensional vector onto a one-dimensional space. The projection of the samples from the protein dataset uses only these m non-zero coefficients.

Then, classification is performed in such new discriminant space by the nearest centroid method.

The results are shown in Fig. 1. We can observe that the error rate of the nearest mean classifier grows rapidly and then decreases with the rise of m, up to 28 (error = ~ 25 % ± 5 measured on the testing set). Then, for bigger values of m, almost a constant error rate was observed.

Fig. 1.
figure 1

The average classification error rate as a function of the number of variables using nearest centroid method [10] on the projected data (based on 5 random partitions of the dataset into training and test) – the local minimum is at 28

From the plot it is clear that if we specify m = 28 as the number of component variables in discriminant vector \( \beta_{1} \) - sparse LDA algorithm can discriminate the two classes fairly well (the classifier performance = ~ 75 % ± 5).

These 28 input features (“selected” by the rslda algorithm) are the most significant for classification. These are the following (corresponding to the ascending order of the absolute value of the coefficients composing vector \( \beta_{1} \)):

$$ \begin{aligned} & 20 2\;\; 1 9 8\;\; 2 8 1\;\; 200\;\; 4 8\;\; 4 2\;\; 2 4 3\;\; 20 3\;\; 4 7\;\; 1 3 3\;\; 1 2 8\;\; 1 2 1\;\; 1 6 1\;\; 1 60 \\ & 1 5 7\;\; 1 3 2\;\; 4 9\;\; 1 5 6\;\; 4 6\;\; 1 3 4\;\; 2 4 1\;\; 1 3 1\;\; 1 5 5\;\; 1 5 8\;\; 1 2 7\;\; 1 1 9\;\; 1 3 5\;\; 4 1\\ \end{aligned} $$

Among these 28 features – 13 are from the receptor residues contributing to the desolvation free energy, but these are not from the beginning of the above list. It can be observed that in each of the 7 groups of energetic features – only features with extreme (min or max) contribution to the energy are always selected. The features from the beginning of the list are those from the receptor residues contributing to the electrostatics energy. One may conclude that electrostatic energy is the most important in the prediction of obligate/non-obligate protein-protein interactions. Electrostatic energy involves a long-range interaction and occur between charged atoms of two interacting proteins.

Thus, the rslda algorithm does suggest which constituents are the most important in the classification of interactions.

6 Conclusions

We have proposed a classification approach for obligate/non-obligate (transient) protein-protein complexes. We have used regularized version of sparse linear discriminant analysis algorithm [12] for feature extraction as well as for input variable selection. To discriminate between two types of protein interactions: obligate and non-obligate, we have used the “energetic features”. These are based on the binding free energy defined as the sum of the desolvation and electrostatic energies. These were computed effectively using the package FastContact [4]. The results on the protein-protein interactions dataset showed that using only 28 from 282 input variables enables the classification of the mentioned two types of interactions with the performance of 75 %. Among the most important features are those from residues contributing to the electrostatic energy.

The hypothesis on the importance of the electrostatic energy in the prediction of obligate/non-obligate protein-protein interactions should be confirmed by the additional experiments on bigger protein datasets. This will be the subject of our future research.