Keywords

1 Introduction

In recent years, the number of RNA transcripts has grown greatly due to thousands of sequencing projects [13, 21], creating a huge volume of data for analysis. Hence, advances in complete transcriptome sequencing have offered new challenges for discovering novel functional transcriptional elements [24], for instance, the Long Non-Coding RNAs (lncRNAs). The lncRNAs are a new type of Non-Coding RNA (ncRNA) with a length greater than 200 nucleotides [14]. According to recent studies, play essential roles in several critical biological processes [32], such as transcriptional regulation and immune response. The discovery of lncRNAs as essential gene regulators in many biological contexts have motivated the development of new machine learning approaches in order to extract relevant information from lncRNAs. In this context, some methods were successfully applied: CPC [12], CPAT [26], CNCI [23], PLEK [13], LncRNApred [20], RNAplonc [19], BASiNET [10], and LncFinder [9].

The CPC measures the protein-coding potential of a transcript based on two feature categories. The extent and quality of the Open Reading Frame (ORF), and derivation of BLASTX [2] search. As a prediction method, the authors used the LIBSVM package to train a Support Vector Machine (SVM) model, using the standard radial basis function kernel. CPAT classifies transcripts of coding and non-coding using Logistic Regression (LR). This model uses four sequence features: ORF coverage, ORF size, hexamer usage bias, and Fickett TESTCODE statistic. CNCI was modeled with SVM and uses profiling Adjoining Nucleotide Triplets (ANT - 64*64) and most-like CDS (MLCDS).

In contrast, PLEK (2014) is based on the k-mer scheme (\(k = 1-5\)) to predict lncRNA, also applying the SVM classifier. LncRNApred classified lncRNAs with Random Forest (RF) and features based on ORF, signal to noise ratio, k-mer (\(k = 1-3\)), sequence length, and GC content. RNAplonc considered 16 features (ORF, GC content, K-mer scheme \((k = 1-6)\), sequence length), besides classifying sequences with the REPtree algorithm. BASiNET classifies sequences based on the feature extraction from complex network measurements. Finally, LncFinder uses five classifiers (LR, SVM, RF, Extreme Learning Machine, and Deep Learning), to apply the algorithm that obtains the highest accuracy. Moreover, the authors use features of ORF, secondary structural, and electron-ion interaction.

Some of these works [9, 20] have explored Genomic Signal Processing (GSP) techniques, which according to Abo-Zahhad et al. [1], is defined as the analysis of genomic signals, whose purpose is to obtain and translate biological knowledge into systems-based applications. To use GPS techniques it is necessary to apply a numeric representation for transformation or mapping of genomic data (represented in DNA by the letters A (adenine), T (thymine), G (guanine) and C (cytosine)) [17]. In literature, distinct DNA Numerical Representation (DNR) techniques have been developed [1]. According to Mendizabal-Ruiz et al. [17], these representations can be divided into three categories: single-value mapping (e.g., integer representation [8, 17], real number representation [5], real number representation, Electron-Ion Interaction Pseudopotential (EIIP) [18]), multidimensional sequence mapping (e.g., Voss representation [25]), and cumulative sequence mapping (e.g., Z-curve representation [31]).

As previously shown, some works used this approach in the lncRNAs classification, Pian et al. [20] applied Voss representation and Han et al. [9] EIIP representation. Nevertheless, the authors used these approach in conjunction with other features extraction techniques, and without testing other numerical mappings. Furthermore, according to Abo-Zahhad et al. [1], the Voss representation (one of the most applied methods) may be redundant. Therefore, considering that it is not yet clear what the properties of each DNR and how the selection of these distinct techniques can affect the results in a signal processing approach [17], we elaborated a study with 5 numerical mapping techniques (Voss, Integer, Real, EIIP, Z-curve), in order to classify lncRNAs.

2 Materials and Methods

This section describes the methodological procedures used to achieve the proposed objectives. Fundamentally, we divided our approach into five stages: (1) Data selection and preprocessing; (2) Feature extraction; (3) Training; (4) Test; (5) Performance analysis.

2.1 Data Selection

Sequences of plant species (Arabidopsis thaliana), obtained from CPC2 [11], were adopted in order to validate the proposed method. Following the literature methods, this work also adopts two classes for the datasets: positive class, with lncRNAs, and negative class, with protein-coding genes (mRNAs). The mRNA data were obtained from the RefSeq database with protein sequences annotated by Swiss-Prot [11], and lncRNA data from the Ensembl (v87) and Ensembl Plants (v32) database. We used only sequences longer than 200nt [13], and we also removed sequence redundancy (identity \(\ge 90\%\)), using CD-HIT-EST tool (v4.6.1) [15].

2.2 Feature Extraction

At this stage, a Fourier transform based features extraction approach is performed on input samples (to detect lncRNAs and mRNAs). Thus, we adopt five representations: Voss [25], Integer [8, 17], Real [5], Z-curve [31], and EIIP [18]. Fundamentally, we denote a biological sequence \(S = (S[0], S[1], \ldots , S[N-1])\) such that \(S \in \{A, C, G, T \}^N\).

Fourier Transform: To generate features based in a Fourier approach, we apply the Discrete Fourier Transform (DFT), widely used for digital image processing and digital signal processing, that can reveal hidden periodicities after transformation of time domain data to frequency domain space [27]. According to Yin and Yau [28], the DFT of a signal with length N, x[n] (\(n = 0, 1, \ldots , N - 1\)), at frequency k, can be defined by Eq. (1):

$$\begin{aligned} X[k] = \sum _{n=0}^{N - 1} x[n] \, e^{-j \frac{2\pi }{N} kn}, \qquad k = 0, 1,\ldots , N-1. \end{aligned}$$
(1)

This method is extensively studied in bioinformatics, mainly for analysis of periodicities and repetitive elements in DNA sequences [3] and protein structures [16].

Voss Representation: This representation can use single or multidimensional vectors. Fundamentally, this approach transforms a sequence \(S \in \{A, \, C, \, G, \, T\}^N\) into a matrix \(\mathbf {V} \in \{0,1\}^{4 \times N}\) such that \(\mathbf {V} = [\mathbf {v}_1,\,\mathbf {v}_2,\,\mathbf {v}_3,\,\mathbf {v}_4]^T\), where T is the transpose operator and each \(\mathbf {v}_i\) array is constructed according to the following relation:

$$\begin{aligned} v_{i}[n] = \left\{ \begin{matrix} 1, &{} S[n] = \alpha [i] \\ 0, &{} S[n] \ne \alpha [i] \end{matrix}, \text { where } \alpha = (A,\,C,\,G,\,T), \qquad n = 0, 1, \ldots , N - 1. \right. \end{aligned}$$
(2)

As a result, each row of matrix \(\mathbf {V}\) may be seen as an array that marks each base position such that the first row denotes the presence of base A, row two for base C, row three base G and the last row for base T. For example, let \(S = (G, A, G, A, G, T, G, A, C, C, A)\) be a sequence that needs to be represented using Voss representation, therefore, \(\mathbf {v}_1 = (0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1)\), which represents the locations of bases A, \(\mathbf {v}_2 = (0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0)\) for bases C, \(\mathbf {v}_3 = (1, 0, 1, 0, 1, 0, 1,\) 0, 0, 0, 0) for the G bases, \(\mathbf {v}_4 = (0, 0, 0, 0, 0, 1, 0, \) 0, 0, 0, 0) for T bases. Then, using the DFT in the indicator sequences shown above, we obtain (see Eq. 3):

$$\begin{aligned} V_i[k] = \sum _{n=0}^{N - 1} v_{i}[n] e^{-j \frac{2\pi }{N} kn},\ \forall \ i \in [1,4], \qquad k = 0, 1,\ldots , N-1. \end{aligned}$$
(3)

The power spectrum of a biological sequence can be obtained by Eq. (4):

$$\begin{aligned} P_V[k] = \sum \limits _{i=1}^4 \, \left| V_i[k]\right| ^2, \qquad k = 0, 1,\ldots , N-1. \end{aligned}$$
(4)

Integer Representation: This representation is one-dimensional [8, 17]. This mapping can be obtained by substituting the four nucleotides (G, A, C, T) of a biological sequence for integers (0, 1, 2, 3), respectively, e.g., let \(S = (G, A, G, A, G, T, G, A, C, C, A)\), thus, d = (3, 2, 3, 2, 3, 0, 3, 2, 1, 1, 2), as exposed in Eq. (5). The DFT and power spectrum are exposed in Eq. (6).

$$\begin{aligned} d[n] = \left\{ \begin{matrix} 3, &{} S[n] = G\\ 2, &{} S[n] = A\\ 1, &{} S[n] = C\\ 0, &{} S[n] = T\\ \end{matrix}, \right. \qquad n = 0, 1,\ldots , N-1. \end{aligned}$$
(5)
$$\begin{aligned} D[k] = \sum _{n=0}^{N - 1} d[n] e^{-j \frac{2\pi }{N} kn}, \qquad P_D[k] = |D[k]|^2 , \qquad k = 0, 1,\ldots , N-1. \end{aligned}$$
(6)

Real Representation: In this representation, Chakravarthy et al. [5] use real mapping based on the complement property of the complex mapping of [3]. This mapping applies positive decimal values for the purines (AG), and negative decimal values for the pyrimidines (CT), e.g., let \(S = (G, A, G, A, G, T, G, A, C, C, A)\), thus, r = (−0.5, −1.5, −0.5, −1.5, −0.5, 1.5, −0.5, −1.5, 0.5, 0.5, −1.5), as Eqs. (7) and (8).

$$\begin{aligned} r[n] = \left\{ \begin{matrix} -0.5, &{} S[n] = G\\ -1.5, &{} S[n] = A\\ 0.5, &{} S[n] = C\\ 1.5, &{} S[n] = T\\ \end{matrix}, \right. \qquad n = 0, 1,\ldots , N-1. \end{aligned}$$
(7)
$$\begin{aligned} R[k] = \sum _{n=0}^{N - 1} r[n] e^{-j \frac{2\pi }{N} kn}, \qquad P_R[k] = |R[k]|^2, \qquad k = 0, 1,\ldots , N-1. \end{aligned}$$
(8)

Z-Curve Representation: The Z-curve scheme is a three-dimensional curve presented by [31], to encode DNA sequences with more biological semantics. Essentially, we can inspect a given sequence S[n] of length N, taking into account the n-th element of the sequence (\(n = 1, 2, \ldots , N\)). Then, we denote the cumulative occurrence numbers \(A_n\), \(C_n\), \(G_n\) and \(T_n\) for each base A, C, G and T, as the number of times that a base occurred from S[1] up until S[n]. Therefore:

$$\begin{aligned} A_n + C_n + G_n + T_n = n \end{aligned}$$
(9)

Where the Z-curve consists of a series of nodes \(P_1, P_2, \ldots , P_{N}\), whose coordinates x[n], y[n], and z[n] \((n = 1, 2, \ldots , N)\) are uniquely determined by the Z-transform, shown in Eq. (9):

$$\begin{aligned} \begin{aligned} P[n] = \left\{ \begin{matrix} x[n] = (A_n + G_n) - (C_n + T_n) \equiv R_n - Y_n\\ y[n] = (A_n + C_n) - (G_n + T_n) \equiv M_n - K_n\\ z[n] = (A_n + T_n) - (C_n + G_n) \equiv W_n - S_n\end{matrix},\right. \quad \\ x[n], y[n], z[n] \in [-n, n], \qquad n = 1, 2, \ldots , N. \end{aligned} \end{aligned}$$
(10)

Where R, Y, M, K, W and S denote the bases of purine (\(R = A, G\)), pyrimidine (\(Y = C, T\)), amino (\(M = A, C\)), keto (\(K = G, T\)), weak hydrogen bonds (\(W = A, T\)) and strong hydrogen bonds (\(S = G, C\)), respectively [22, 30]. The coordinates x[n], y[n], and z[n] represent three independent distributions that completely describe a sequence [1]. Therefore, we will have three distributions with definite biological significance: (1) \(x[n] = \) purine/pyrimidine, (2) \(y[n] = \) amino/keto, (3) \(z[n] = \) strong hydrogen bonds/weak hydrogen bonds [31], e.g., let \(S = (G, A, G, A, G, T, G, A, C, C, A)\), thus, \(x = (1, 2, 3, 4, 5, 4, 5, 6, 5, 4, 5)\); \(y = (-1, 0, -1, 0, -1, -2, -3, -2, -1, 0, 1)\); \(z = (-1, 0, -1, 0, -1, 0, -1, 0, -1, -2, -1)\). Essentially, the difference between each dimension at the n-th position and the previous (\(n-1\)) position can be either 1 or \(-1\) [31]. Finally, the DFT and the power spectrum of the Z-Curve representation may be defined as [30]:

$$\begin{aligned} X[k] = \sum _{n=1}^{N} x[n] e^{-j \frac{2\pi }{N} kn}, \quad Y[k] = \sum _{n=1}^{N} y[n] e^{-j \frac{2\pi }{N} kn},\quad Z[k] = \sum _{n=1}^{N} z[n] e^{-j \frac{2\pi }{N} kn}.\quad \end{aligned}$$
(11)
$$\begin{aligned} P_C[k] = |X[k]|^2 + |Y[k]|^2 + |Z[k]|^2, \qquad k = 1, 2,\ldots , N. \end{aligned}$$
(12)

EIIP Representation: Nair and Sreenadhan [18] proposed EIIP values of nucleotides to represent biological sequences and to locate exons. According to the authors, a numerical sequence representing the distribution of free electron energies can be called “EIIP indicator sequence", e.g., let \(S =\) (G, A, G, A, G, T, G, A, C, C, A), thus, b = (0.0806, 0.1260, 0.0806, 0.1260, 0.0806, 0.1335, 0.0806, 0.1260, 0.1340, 0.1340, 0.1260), as shown in Eq. (13). The DFT and power spectrum of this representation are presented in Eq. (14).

$$\begin{aligned} b[n] = \left\{ \begin{matrix} 0.0806, &{} S[n] = G\\ 0.1260, &{} S[n] = A\\ 0.1340, &{} S[n] = C\\ 0.1335, &{} S[n] = T\\ \end{matrix}, \right. \qquad n = 0, 1,\ldots , N-1. \end{aligned}$$
(13)
$$\begin{aligned} { E[k] = \sum _{n=0}^{N - 1} b[n] e^{-j \frac{2\pi }{N} kn}, \qquad P_E[k] = |E[k]|^2 , \qquad k = 0, 1,\ldots , N-1. } \end{aligned}$$
(14)

Features: Finally, we apply the feature extraction in all representations, adopting Signal to Noise Ratio (SNR - [22]), average power spectrum, median, maximum, minimum, sample standard deviation, population standard deviation, percentile (15/25/50/75), amplitude, and variance. The SNR uses the statistical phenomenon known as period-3 behavior or 3-base periodicity [29].

2.3 Normalization, Sampling, Training and Evaluation Metrics

We adopt the min-max normalization method, which fits the data range to 0 and 1 (or -1 to 1, if there are negative values) for each feature, in order to use them on classification step. Moreover, the sampling method was adopted in our dataset, since we are faced with the imbalanced data problem (A. thaliana (2,540 lncRNA/13,973 mRNA)). Thus, we applied SMOTE [6], an over-sampling approach (to adjust the class distribution), in which “synthetic" examples are created, over-sampling the minority class. Next, we investigate four classification algorithms, like Naive Bayes (NB), Random Forest (RF), SVM and AdaBoost. To induce our models, we used \(70\%\) of samples for training (with 10-fold cross-validation) and \(30\%\) for testing. Finally, the representations were evaluated with sensitivity (SE - correctly predicted lncRNAs), specificity (SPC - correctly classified mRNAs), accuracy (ACC), and Cohen’s kappa coefficient [7].

3 Results and Discussion

First, we induced our models with the NB, RF, SVM, and AdaBoost classifiers in the training set. Then, to estimate the real accuracy of this set, we used 10-fold cross-validation, as exposed in Table 1. Evaluating each classifier individually, we observed that the best performance was of the Random Forest with Z-curve (0.9605), followed by AdaBoost (EIIP - 0.9521), SVM (EIIP - 0.9476), and NB (Real - 0.9300). After training, the predictive models induced by NB, RF, SVM, and AdaBoost were applied to the test set, in which Fig. 1 summarizes in a polar chart, the SE, SPC, kappa and ACC metrics for each representation.

Fig. 1.
figure 1

Polar Chart: This figure compares the sensitivity, specificity, kappa and accuracy metrics for each representation in the test set. In addition, it shows how the classifiers (NB, RF, SVM, and AdaBoost) are behaving in each representation.

Table 1. Real accuracy estimates for the training set using 10-fold cross-validation.

As can be seen, in Fig. 1, the RF classifier maintained the best performance in the test set using Z-curve (ACC = 0.9553), followed by AdaBoost (ACC = 0.9526) adopting EIIP. In general, the best results are contained in the Real, Z-curve and EIIP representation. However, if we use the AdaBoost classifier as an example, the greatest difference in accuracy between the mappings is approximately 0.0072. Although this, we noted that the mappings have higher peaks of ACC, SPC in NB. We also evaluated the performance of our best predictive model (RF with Z-curve) against other four state-of-the-art tools; CPC2 [11] (an updated version of the CPC method), CNCI, PLEK and RNAplonc (specifically for plants), as shown in Table 2.

Table 2. Comparative performance between our approach and state-of-the-art tools.
Table 3. Comparative between CPC2 and our approach to a new dataset with only non-coding sequences (lncRNA and Small ncRNA - A. thaliana).

CPC2 (0.9574) reported a similar performance along with our predictive model (0.9553), followed by RNAplonc (0.9443), CNCI (0.8997), and PLEK (0.6649). Nevertheless, it is important to emphasize that CPC2 and RNAplonc use the ORF descriptor, a highly employed feature for discovering coding sequences and which, according to Baek et al. [4] is an essential guideline for distinguishing lncRNAs from mRNA. Considering this, our approach has an advantage in terms of generalization to distinguish other classes of ncRNA, since this would not be possible only with the ORF. To evaluate this hypothesis, we apply a second experiment with CPC2 using a new dataset with only non-coding sequences (lncRNA and Small ncRNA - A. thaliana - also obtained from [11]) without mRNA sequences. For such, we used the features provided by CPC2 to construct a model with similar procedures to our approach. However, we eliminated the sequence length descriptor provided by CPC2 and also any attribute that would generate this information in our approach, since that any explicit bias to this feature may facilitate the prediction of these sequences. Therefore, we applied new experiments according to the same methodology described in this work (70% training and 30% test) and using the RF classifier, as shown in Table 3.

The tests confirm again the hypothesis that the proposed method is efficient, in which we reached an ACC of 0.9595 against 0.8071 of the features provided by CPC2 (e.g., ORF). That is, our approach is robust in terms of generalization to distinguish lncRNA from mRNA, as well as other classes of ncRNA.

4 Conclusion

In this work, we investigated five numerical mapping techniques (Voss, Integer, Real, EIIP, Z-curve) with the Fourier transform, for feature extraction and classification of lncRNAsFootnote 1. Thereby, sequences of plant species (A. thaliana), obtained from [11] were adopted in order to validate the proposed method. As results, we conclude that the RF and AdaBoost classifiers presented the best performance using the Z-curve and EIIP representations, respectively. Furthermore, to validate our study, we also compared with other available methods in the literature (CPC2, CNCI, PLEK, and RNAplonc). The proposed approach presented suitable results, being superior or competitive to other methods, and robust in terms of generalization. Finally, as future works we will analyze these representations more deeply, in order to propose a new numerical mapping with nucleotides triplets and amino acid features, e.g., molar mass, acidity, Van Der Waals volume, to consider more RNA classes and different organisms for the feature extraction analysis.