1 Introduction

The automatic prediction of the disulfide-bonding state of cysteines from sequences of amino-acids in proteins is a relevant task in bioinformatics [8, 17]. It represents a significant step towards realizing functional and structural predictions from the primary structure of (generally, non-homologous) proteins [9]. Several statistical and machine learning approaches (e.g. artificial neural networks (ANN) [8], hidden Markov models (HMM) [1], support vector machines (SVM) [19], conditional random fields (CRF) [16]) have been applied to the task throughout the last couple of decades. A review is offered in [17]. In most cases, the accuracy of the different approaches could be improved by combining the bare sequence-level information (i.e., a representation of the sequence of amino-acids, or residues) with global descriptors of the chemical and physical properties of the overall protein, or with the evolutionary information stemming from the multiple alignment of homologous individuals [8]. In their raw representation, the sequences at hand may be thought of as character strings built from a finite and discrete alphabet of individual symbols (which, in the typical case of eukaryotic organisms adds up to 20 symbols) corresponding to the different, standard abbreviations of the specific amino-acids. Some feature extraction process is required in order to map these symbols onto real-valued feature vectors whenever continuous-valued models (like ANNs or SVMs) are used. The length of the sequences varies significantly on a protein-by-protein basis, ranging from a few dozens residues to a few thousands. These sequences may contain a variable number of cysteines (from none to several dozens) which, in turn, may form a disulfide bond with another cysteine within the protein. Disulfide bonds affect the folding and the conformational stability of the protein, with the obvious implications on its secondary and tertiary structures (the quaternary structure is affected, in turn, as long as one or more bonds are formed with cysteines that are present in the environment, e.g. within neighboring proteins). It is an implicit consequence of the central dogma of molecular biology [7] that the sequences (i.e., the primary structure of the proteins) encapsulate the information that an automatic learner could exploit in order to predict accurately the disulfide bonding state of the cysteines. Still, the task remains by and large an open problem to date.

The paper proposes and evaluates a hybrid automatic disulfide bond predictor that stems from the combination of a recurrent neural network (RNN) and a dynamic probabilistic graphical model for sequences. The latter, namely the dynamic hybrid random field (D-HRF) that we first proposed in [4, 18], is an extension to sequences of the hybrid random field (HRF) [10]. The RNN is used as a recurrent autoencoder that extracts dynamically a proper discrete representation of the residues that is suitable to be modeled via D-HRF (depending on the disulfide bonding state, and according to a maximum pseudo-likelihood criterion). The definition of D-HRF and the fundamentals of its algorithms for parameter and structure learning are reviewed in Sect. 2, while the hybrid RNN/D-HRF system is presented in Sect. 3. All in all, the machine is expected to exploit both local (residue-neighborhood level) and global (sequence level) information, by means of (1) a proper location-dependent encoding of the primary structure into properly defined random variables via RNN, and of (2) a short-term/long-term modeling of the probabilistic (in)dependencies among these variables realized by the D-HRF. The comparative results of experiments carried out on the dataset Protein Data Bank (PDB) [3] are reported on in Sect. 4. Conclusions are drawn in Sect. 5.

2 Review of Dynamic Hybrid Random Fields

The most popular instances of probabilistic graphical models are represented by Bayesian networks (BNs) [13], and by Markov random fields (MRFs) [12]. There are (in)dependence structures over any given set of random variables which can be modeled via BNs but not via MRFs, and vice-versa. Besides, established learning algorithms for BNs and MRFs present high computational complexity as the number of variables increases. Incidentally, no learning algorithm for MRFs has qualified as a standard reference so far. In [11], a probabilistic graphical model known as the hybrid random field (HRF) was introduced to overcome such drawbacks of traditional paradigms. The HRF was proven to subsume the modeling capabilities of both BNs and MRFs. A broad-sense definition of HRFs as a specific set of BNs is given in [10]. The reader is referred to the latter for all basic concepts of standard HRFs which are relevant to this paper. It is proven that HRFs posses the modularity property [10], that is they can factorize the overall joint probability of the modeled random variables as a product of local probabilistic quantities defined at the level of the individual BNs in the HRF. As a consequence, a strict-sense and simpler definition of HRF can be given, which roughly goes as follows: an HRF is a collection of Bayesian networks which possesses the modularity property (see, e.g., [11]). In this paper, for simplicity and coherence with the proposed algorithms, the strict-sense definition of HRF is assumed (without loss of generality).

A dynamic extension of HRFs to sequence modeling was proposed in [4, 18]. The model is referred to as the dynamic hybrid random field (D-HRF), and it is formally defined as follows.

Definition: A dynamic HRF \(\mathcal{DH}\) is a tuple \(\mathcal{DH} = (\mathbf{X}, S, \mathbf{\pi }, \mathcal{F}, \mathbf{a}, \mathcal{H})\) where

  1. 1.

    \(\mathbf {X}\) is a set of (observable) random variables \(X_{1}, \ldots , X_{n}\). Outcomes of the random variables depend on time \(t = 1, \dots , T\), that is we will write \(X_{i}(t)\) whenever we need to make the dependency explicit.

  2. 2.

    S is a set of Q latent random variables, \(S=\{S_1, \dots , S_Q\}\). It is assumed that sequences of such latent variables are responsible for the generation of sequences of outcomes of the observable variables, and that the variables in S can be thought of as the states of a discrete-time Markov chain (latent Markov assumption). We write \(q_t\) to denote the state of the Markov chain at time t for \(t=0, \dots , T\).

  3. 3.

    \(\mathbf{\pi }\) is a probability distribution of the initial latent variables, i.e. \(\mathbf{\pi } = \{Pr(S_i \mid t = 0), S_i \in S\}\), where t is the discrete time index. For instance, if the Markov chain over S may equally-likely start with any latent variable, then \(\mathbf{\pi }\) is uniform over S. Contrariwise, if a certain \(S_j\) can never occur at time \(t=0\), then \(\mathbf{\pi }(S_j) = 0\), etc.

  4. 4.

    \(\mathcal{F} \subseteq S\) is the set of final states, i.e. the latent variables which can legitimately generate sets of outcomes of the observable variables at time T (namely, at the end of sequences).

  5. 5.

    \(\mathbf{a}\) is a probability distribution that characterizes the (allowed) transitions between latent variables, that is \(\mathbf{a}_{ij} = \{Pr(S_j\) at time \(t \mid S_i\) at time \(t-1), S_i \in S, S_j \in S\}\) where the transition probabilities \(\mathbf{a}_{ij}\) are assumed to be independent of time t. Note that the definition is meaningful due to the latent Markov assumption.

  6. 6.

    \(\mathcal{H}\) is a set of HRFs over \(\mathbf{X}\), \(\mathcal{H} = \{ \mathcal{H}_1, \dots , \mathcal{H}_Q \}\), where \(\mathcal{H}_q\) is uniquely associated with q-th latent variable \(S_q\) such that the joint emission probability \(\mathbf{b}(\mathbf{X}) = P(X_{1}, \ldots , X_{n} \mid S_q)\) is modeled via HRF \(\mathcal{H}_q\) over \(\mathbf{X}\), independently of time t, and we assume that the probability distribution of \(\mathbf{X}(t)\) is independent of the probability of \(\mathbf{X}(t^{\prime })\) (for all \(t^{\prime } \ne t\)) given the latent variable (emission Markov assumption). In this definition, bearing in mind the definition of HRF, it turns out that \(\mathcal{H}_q\) is a set of Bayesian networks \(BN_{q,1}, \ldots , BN_{q,n}\) (with directed acyclic graphs \(\mathcal {G}_{q,1}, \ldots , \mathcal {G}_{q,n}\)) such that:

    1. (a)

      each \(BN_{q,i}\) contains \(X_{i}\) plus a subset \(\mathcal {R}_q(X_{i})\) of \(\mathbf {X} \setminus \{X_{i}\}\), namely the set of relatives of \(X_{i}\) in \(BN_{q,i}\);

    2. (b)

      for each \(X_{i}\), \(P(X_{i}|\mathbf {X} \setminus \{X_{i}, q\}) = P(X_{i}|\mathcal {MB}_{q,i}(X_{i}))\), where \(\mathcal {MB}_{q,i}(X_{i})\) is the set containing the parents, the children, and the parents of the children of \(X_{i}\) in \(\mathcal {G}_{q,i}\) (namely, the Markov blanket of \(X_{i}\) in \(BN_{q,i}\)).

    The Markov assumption holding in HRFs is referred to as the observable Markov assumption in the present framework.

Note that the overall D-HRF can be thought of as a probabilistic graphical model over the set of random variables \(S \cup \mathbf{X}\). Nonetheless, this definition allows for separate sets of BNs (i.e., different HRFs) for each latent variable, meaning that it does not extend regular HRFs to sequences by defining them as sets of dynamic Bayesian networks in a straightforward manner.

Learning in D-HRFs occurs at two levels, i.e. parameter learning and structure learning. Let \(\mathcal{DH}\) be a D-HRF, and let \(O = O_1, O_2,...,O_T\) be a training sequence of outcomes of the observable variables, i.e. \(O_t = (x_1, \dots , x_n)\) for \(t=1, \dots , T\). The parameter learning algorithm exploits a recursive scheme [18], which is similar (to some extent) to the popular forward-backward procedure for HMMs [14]. Assuming that a (preliminary, at least) structure of the D-HRF is given, the algorithm aims at maximizing the pseudo-likelihood \(P^{*}(O|\mathcal{DH})\), as in regular probabilistic graphical models (see [10] for a justification of why the pseudo-likelihood is used instead of the bare likelihood criterion). Each step of the forward-backward procedure involves learning the conditional probability tables (CPTs) [10] of the BNs modeling the local conditional distributions for each HRF \(\mathcal{H}_1, \dots , \mathcal{H}_Q\) associated with the latent variables. That is, for each latent variable \(q = 1, \dots , Q\) and for each observable variable \(X_{i}\), the task is to learn the parameters of the conditional distribution \(P(X_{i}| q, mb_{q,i}(X_{i}))\), for each state \(mb_{q,i}\) of the variables in \(\mathcal {MB}_{q,i}\), where \(\mathcal {MB}_{q,i}\) is the Markov blanket [10] (in \(\mathcal{H}_q\)) for i-th observable variable and q-th latent variable. In short, the overall procedure relies on combining the CPTs learning algorithm of the individual HRFs [10] with the iterative forward-backward dynamic processing of the input sequence. The details of the algorithm are handed out in [18].

As for structure learning in a generic (say, the q-th) HRF within the D-HRF, it is formulated as the problem of learning, for each variable \(X_{i}\), what other variables appear as nodes in the Bayesian network \(BN_{q,i}\), and what edges are contained in the directed acyclic graph (DAG) \(\mathcal {G}_{q,i}\). In other words, this means learning the structure of each Markov blanket \(\mathcal {MB}_{q,i}(X_{i})\) within q-th HRF. While parameter learning assumes that the Markov blanket of each variable has previously been fixed, the aim of structure learning is to identify each Markov blanket and to determine its graphical structure. A structure learning algorithm for D-HRFs, called dynamic Markov blanket merging (DMBM) was presented in [4]. The aim of DMBM is to find an assignment of Markov blankets \(\mathcal {MB}_{q,1}(X_{1}), \ldots , \mathcal {MB}_{q,n}(X_{n})\) to the nodes \(X_{1}, \ldots , X_{n}\) (within q-th HRF) that maximizes the model pseudo-likelihood given the dataset. The basic idea behind DMBM is to start from a certain assignment of neighbors to the variables of the model, learn the local BNs of the model, and then to iteratively refine the assignment so as to come up with Markov blankets that increase the model pseudo-likelihood with respect to the previous assignment. This iterative procedure stops when no further refinement of the Markov blankets assignment increases the value of the pseudo-likelihood. In other words, DMBM is nothing but a local search algorithm exploring a space of possible Markov blanket assignments to the observable variables for each state of the D-HRF. The formalization of the algorithm is found in [4].

3 The Hybrid RNN/D-HRF

The idea behind the proposed approach is related to some extent to Bengio’s hybrid ANN/HMM system for automatic speech recognition [2], where an ANN is trained to extract features for an HMM from an input sequence of acoustic observations. In the present study, a recurrent neural autoencoder is trained to develop suitable internal representations of the sequences of amino-acids. Like in regular autoencoders, this representation is a lower-dimensional encoding of any observed sub-sequence of residues up to the current location (assuming the primary structure is fed to the RNN sequentially, residue after residue, starting from the first amino-acid in the proteinFootnote 1). The encoding is expected to capture most of the relevant information necessary to reconstruct the amino-acid sequence at hand, filtering out noise and redundancies. The rationale behind the approach is fourfold: (1) reducing the local sparsity of the representation of the residues. In fact, a plausible vector-space representation of the t-th amino-acid in the protein would require a 20-dimensional vector having null components except for that corresponding to the residue itself. The autoencoder shall map this representation onto a corresponding, lower-dimensional, and denser one; (2) as a consequence of its recurrent connections, the RNN is expected to maintain a moving memory of the internal representations entailed by the presence of individual amino-acids met at an earlier time along the protein. Therefore, each residue can extend its influence over a mid-term period, not limited to its specific location; (3) reducing the noise and mis-alignments due to potential laboratory errors in protein sequencing and, above all, in the multiple alignment procedure used for mapping whole families of homologous proteins onto a single prototypical representative; (4) feeding the predictor (in this case, the D-HRF) with a representation which, besides the local information at time t, accounts for short- and mid-term dependencies among the input observations (long-term modeling, and the actual prediction of the bonding state are accomplished by the D-HRF, eventually).

Formally, let \(\mathcal{T} = \{ \mathcal{P}_1, \ldots \mathcal{P}_n \}\) be a dataset of n proteins (or, prototypical outcomes of a multiple-alignment procedure over subsets of homologous proteins). For \(i=1, \ldots , n\) we have \(\mathcal{P}_i = a_{i,1}, \ldots , a_{i,n(i)}\), that is a sequence of length n(i) of amino-acids where each \(a_{i,j}\) (\(j=1, \ldots , n(i)\)) is represented via the usual (say) 3-letter abbreviation. In order to feed a RNN with the sequences, we resort to an equivalent real-valued representation in terms of a “one-hot” coding function \(c:\{\)“Ala”, “Arg”, \(\ldots \), “Val”\( \} \rightarrow \mathbb {R}^{20}\) defined as c(“Ala”\() = (1.0, 0.0, \ldots , 0.0)\), c(“Arg”\() = (0.0, 1.0, \ldots , 0.0)\), etc. The input to the RNN at time j over sequence i is denoted accordingly as \(\mathbf{x}_{i,j} = c(a_{i,j})\). For notational convenience, from now on we write \(\mathbf{x}_1^T\) to denote a generic RNN input sequence of length T obtained this way.

The RNN architecture is a plain multilayer perceptron (MLP) backbone with an additional set of context units (which, at time \(t+1\), back-up the inputs at time t), such that each input unit has a corresponding context unit which is laterally connected to the former, and each context unit has a self-feeding recurrent connection from and to itself. In so doing, the context units form a decaying memory of the filtered history of the previous inputs, keeping a trace of the amino-acids observed in the past. Feed-forward connection stem from the context units and feed the hidden layer of the RNN (in a fully connected manner, i.e. between all possible context-to-hidden units pairs). The RNN has 20 input and 20 output units. When fed with the generic input vector \(\mathbf{x}_t\) at time t, the RNN returns an output \(\mathbf{y}_t\) which depends on \(\mathbf{x}_t\) and on the history of previous inputs. Instead of prescribing \(\mathbf{y}_t \simeq \mathbf{x}_t\), like in regular autoencoders, we rather enforce the mid-term memory and the time-smoothing mechanisms expected of the resulting machine by filtering the sequence of target outputs \({\hat{\mathbf{y}}}_1, \ldots , {\hat{\mathbf{y}}}_T\) for backpropagation though-time (BPTT) as follows. First (at time \(t = 1\)), for \(k=1, \ldots , 20\) we let \({\hat{y}}_{1,k} = x_ {1,k}\). Then, for \(t=2, \ldots , T\) we let \({\hat{y}}_{t,k} = \min (x_ {t,k} + \rho {\hat{y}}_{t-1,k}, 1.0)\), where \(\rho \in (0,1)\) is a memory decay factor. In practice, the appearance of a certain residue at time t casts a (vanishing) shadow onto the next time steps, whose length is determined by \(\rho \). The initialization at time \(t=1\) simply sets the initial target equal to the initial input (as in a regular autoencoder). Any time an amino-acid is found along the input sequence, the corresponding component of the target output is set to 1 (making sure that the target values never exceed 1, regardless of the past history: that is the rationale behind the presence of the \(\min (.)\) function in the equation). Starting from the next time step \(t+1\), the k-th target progressively lowers its value, with a persistence that depends on \(\rho \), vanishing eventually.

Once training is accomplished, the RNN realizes a time-dependent mapping \(m:\mathbb {R}^{20} \times \mathbb {N} \rightarrow \mathbb {R}^{d}\) that projects the generic 20-dimensional input vector \(\mathbf{x}_{i,t}\) at time t onto a d-dimensional vector \(\mathbf{z}_{i,t} = m(\mathbf{x}_{i,t}, t)\) (assuming \(d<< 20\)) accounting for such time-dynamics. Both d and \(\mathbf{z}_{i,t}\) depend on the dimensionality and nature of the hidden layer in the RNN (the “bottleneck” typical of autoencoders) used for realizing the encoding. Since \(\mathbf{z}_{i,t}\) is continuous-valued, it cannot be feed into the D-HRF directly. A discretization is accomplished first. Assuming the usual logistic sigmoids \(\sigma (.)\) are used in the encoding hidden layer of the RNN, having the interval (0, 1) as their counterdomain, we quantized the output \(z_{k,t} = \sigma _k(.)\) of the generic k-th hidden unit at time t onto q intervals \(I_1, \ldots , I_q\) such that \(I_\ell = (a_\ell , b_\ell )\) for \(\ell = 1, \ldots , q\), \(a_1 = 0.0\), \(b_q = 1.0\), and \(a_{\ell +1} = b_\ell \) if \(\ell < q\). A discrete alphabet of q symbols \(\{ \mathcal{S}_1, \ldots , \mathcal{S}_q \}\) is associated uniquely with the intervals and used to represent the discretized value \(disc(z_{k,t})\) of \(z_{k,t}\) as \(disc(z_{k,t}) = \mathcal{S}_s\) iff \(z_{k,t} \in I_s\). Note that the construction of the intervals \(I_1, \ldots , I_q\) shall account for the fact that the logistic sigmoid entails a non-uniform distribution of its values, and that even small differences in close-to-zero outputs of the RNN activation functions may make a difference.

At this point, each input protein is encoded as a sequence of d-tuples of symbols of the alphabet \(\{ \mathcal{S}_1, \ldots , \mathcal{S}_q \}\), which may be thought of as the time-specific outcomes of a set of d observable random variables \(X_{1}, \ldots , X_{d}\) and, in turn, modeled via a D-HRF. Bearing in mind the notation introduced in the previous section, the training sequence \(O = O_1, O_2,...,O_T\) for the D-HRF is then defined by letting \(O_t = (disc(z_{1,t}), \ldots , disc(z_{d,t}))\) for \(t=1, \dots , T\).

The prediction of the disulfide bonding state of the cysteines can be stated formally in terms of a 2-class classification problem (namely, bonding/non-bonding). The latter is faced within the framework of Bayes decision theory, by assigning each cysteine to the class predicted according to the maximum class-posterior probability. The posterior probabilities are estimated relying on Bayes’ theorem, where the prior probabilities (estimated from the relative frequencies of the two classes in the training set) are combined with the class-conditional probabilities estimated via D-HRF (two class-specific D-HRFs are trained independently of each other from the training sub-sequences of the corresponding class).

4 Results

Experiments were carried out using proteins from the crystallographic database Protein Data Bank (PDB) [3], possibly the most popular benchmark for the prediction of disulfide bonds between cysteines [17]. As in [15], a subset of the May 2010 release of PDB (available publicly at http://www.biocomp.unibo.it/savojard/PDBCYS.ssbonds.txt) was used. It consists of 1797 sequences (Eukaryotic proteins) having an average length of 273 residues and containing a variable number of cysteines, ranging from 2 to 50 (roughly 6 cysteines per protein on average), for an overall number of 10813 cysteines (3194 bonding, 7619 non-bonding). A 10-fold cross-validation evaluation strategy was applied. For each individual fold, \(80\,\%\) of the cysteines were used for training, \(10\,\%\) for validation (i.e., model selection), and the remaining \(10\,\%\) for test.

Application of the proposed RNN/D-HRF hybrid occurred according to the following steps and architectures/hyper-parameters selection. First, windows of 31 contiguous residues were extracted from the overall sequences for all cysteines, each window being centered at the location of the corresponding cysteine (the length of the windows was selected by cross-validation during a preliminary trial). If the number of residues proceeding (or, following) a certain cysteine was not enough for covering the whole window length, an adequate number of additional null items were used to pad the initial (or final, respectively) portion of the window up to the required length. The residues in th windows were encoded via RNN into sequences of 3 discrete random variables to be modeled via D-HRF, letting \(\rho -0.9\) and using \(q=12\) discrete symbols represented by the capital letters from “A” to “L” (the possible outcomes of the random variables), as explained in the previous section. The quantization of the real-valued outputs of the corresponding 3-unit hidden layer of the RNN was defined as \(disc(z) = A\) if \(z < 10^{-3}\), \(disc(z) = B\) if \(z \in (10^{-3}, 10^{-2})\), \(disc(z) = C\) if \(z \in (10^{-2}, 10^{-1})\), and finally \(disc(z) = D, \ldots , L\) at regular intervals between 0.1 and 1.0. The RNN had logistic sigmoids with adaptive bias in the hidden units, and linear output units. A single RNN was used (regardless of the bonding state of individual cysteines) and trained for 1000 epochs of BPTT with a (non-adaptive) learning rate \(\eta = 0.001\) and no momentum, starting from a uniformly random initialization of the forward connection weights (and, of the bias of the sigmoids) over the \((-10^{-3},10^{-3})\) interval and an initial weight equal to 0.9 for the lateral and self-recurrent connections.

As for the D-HRF, two class-specific models were created, having 3 latent variables each (performance could not be improved when resorting to longer left-to-right Markov chains, while the complexity of the resulting machine increased). A Viterbi segmentation [14] of the input sequences was used for the initialization, where the MBs for the BNs in the state-specific HRFs (found via the chi-square test of correlation among the variables) turned out to be the complete cliques over the 3 encoding variables \(X_1, X_2\), and \(X_3\) at any given location within the window. This initialization was followed by an application of the parameter and structure learning algorithms. Further re-iterations could increase the pseudo-likelihood at the expense of the accuracy, due to the non-discriminative nature of the maximum likelihood criterion and its likely tendency to over-explain (that is, overfit) the individual training data.

The results (in terms of average over the many-fold ± standard deviation, when available) are reported in Table 1, which includes comparisons with respect to major established prediction techniques that were evaluated on the PDB. For each technique the table reports, in the order, the model (acronyms are explicated below), the specific input attributes (i.e., the information fed to the machine), the version of PDB used, the corresponding dataset size (expressed as \(n_p(n_c)\) where \(n_p\) is the number of proteins and \(n_c\) is the number of cysteines), the bibliographic source, and the accuracy. The attributes are grouped into three major types, namely RC (the bare Residue Chain, like in the present study), PSSM (the real-valued Position Specific Score Matrix, obtained by multiple alignment via BLAST and conveying evolutionary information [8]), and AI (Additional Information, like global descriptors [9], subcellular information [15], etc.).

Table 1. Predicting the disulfide bonding state of cysteines from the PDB (Legend for attributes: RC = residue chain, PSSM = position specific score matrix, AI = additional information).

The first two rows report on the highest accuracies yielded by MLPs [8], obtained when using windows of (best) length 13. Based on the gap between the results obtained with (PSSM) or without (RC) evolutionary information, the Authors of [8] drew strong conclusions on the relevance of the latter to the end of predicting the bonding state. The gap helps us positioning correctly the degree of difficulty of the task we are facing (that is, prediction relying on protein-specific sequence-level information only).

The next row reports the result we obtained using a standard discrete HMM trained over windows of (best) length equal to 31. Two class-specific left-to-right HMMs were used, having 5 states each, trained via Baum-Welch on the 20 abbreviations (“Ala”, ..., “Val”) of the different amino-acids. The result may be considered as baseline for the D-HRF, as well, since a D-HRF with MBs limited to singletons (i.e., a single random variable \(X_1\) for the HRFs at each time t) reduces to a discrete HMM.

The results yielded by SVMs are reported in the next three rows. The first is referred to the highest accuracy reported in [19] using only windows of the RC (best window length: 15). Results with SVM could be improved by combining SVM-based classifiers trained in two stages over different input features in [6, 9]. As for the latter approach, the expression “w/o tuning” refers to the setup with no grammatical post-processing (i.e. “tuning”, as presented in [6]) aimed at preventing higher-level inconsistent predictions (e.g., an uneven number of bonding cysteines within a given protein). The accuracy yielded by the same approach with tuning (penultimate row of the table) is amongst the highest to date. Further improvements to the SVM-based predictor were presented in [5], where the SVM benefits from a bi-directional RNN preprocessing, as well as from a post-processing via finite state automaton (FSA) which, similarly to the aforementioned “tuning” [6], acts a s language model hampering unlikely/impossible sets of predictions within any given protein.

The comparison with the conditional random fields (CRFs) [16] is interesting, since CRFs are, like D-HRFs, probabilistic graphical models. Besides, CRFs are intrinsically discriminative, since they model conditional probability densities instead of joint distributions of the random variables. Nevertheless, the RNN/D-HRF approach compares favorably with the results handed out in [16]. Still, the highest accuracy reported on (91 %, last row of the table) is yielded by another variant of CRFs called Grammatical-Restrained Hidden Conditional Random Field [15], basically a CRF followed by an FSA-based post-processing (again, to ensure satisfaction of the aforementioned grammatical constraint). The difference between the accuracies yielded by the two variants of CRFs ([16] versus [15]) underlines the gap in terms of performance due to (1) using PSSM instead of bare RC, as well as to (2) post-processing via the language model.

The result obtained via the proposed RNN/D-HRF (\(85.02 \pm 0.41\) % accuracy, with average \(82\,\%\) sensitivity and \(86\,\%\) specificity) is substantially aligned with the performance offered by most major established techniques, regardless of the input attributes adopted (and, in spite of the fact that we tested the algorithm on the largest and most complex version of the PDB available to date). In particular, to the best of our knowledge the accuracy yielded by the RNN/D-HRF is the highest to date when relying on the RC only.

5 Conclusions

The paper presented the first application of D-HRFs to real-life data. It was shown that the model is effective, and that its hybridization with the recurrent autoencoder is a promising direction for learning and prediction-making over the primary structure of proteins. Albeit preliminary, the results obtained are (at least) aligned with those offered by major, state-of-the-art approaches, especially in the light of he fact that the predictions are based on the sequence of residues only (while most established methods rely on additional attributes, like global descriptors of chemical/physical properties of the proteins, and/or evolutionary information). Future research activities are going to focus on evaluating the approach using PSSM and/or additional attributes, as well as on the investigation of the effects of post-processing the output of the D-HRF according to a language model.