Abstract
Due to the high number of genomic sequencing projects, the number of RNA transcripts increased significantly, creating a huge volume of data. Thus, new computational methods are needed for the analysis and information extraction from these data. In particular, when parts of a genome are transcribed into RNA molecules, some specific classes of RNA are produced, such as mRNA and ncRNA with different functions. In this way, long non-coding RNAs have emerged as key regulators of many biological processes. Therefore, machine learning approaches are being used to identify this enigmatic RNA class. Considering this, we present a Fourier transform-based features extraction approach with 5 numerical mapping techniques (Voss, Integer, Real, EIIP and Z-curve), in order to classify lncRNAs from plants. We investigate four classification algorithms like Naive Bayes, Random Forest, Support Vector Machine and AdaBoost. Moreover, the proposed approach was compared with 4 competing methods available in the literature (CPC2, CNCI, PLEK, and RNAplonc). The experimental results demonstrated high efficiency for the classification of lncRNAs, providing competitive performance.
All authors thank the Federal University of Technology - Paraná (UTFPR), CNPq, Fundação Araucária/SETI and CAPES for supporting this study.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
In recent years, the number of RNA transcripts has grown greatly due to thousands of sequencing projects [13, 21], creating a huge volume of data for analysis. Hence, advances in complete transcriptome sequencing have offered new challenges for discovering novel functional transcriptional elements [24], for instance, the Long Non-Coding RNAs (lncRNAs). The lncRNAs are a new type of Non-Coding RNA (ncRNA) with a length greater than 200 nucleotides [14]. According to recent studies, play essential roles in several critical biological processes [32], such as transcriptional regulation and immune response. The discovery of lncRNAs as essential gene regulators in many biological contexts have motivated the development of new machine learning approaches in order to extract relevant information from lncRNAs. In this context, some methods were successfully applied: CPC [12], CPAT [26], CNCI [23], PLEK [13], LncRNApred [20], RNAplonc [19], BASiNET [10], and LncFinder [9].
The CPC measures the protein-coding potential of a transcript based on two feature categories. The extent and quality of the Open Reading Frame (ORF), and derivation of BLASTX [2] search. As a prediction method, the authors used the LIBSVM package to train a Support Vector Machine (SVM) model, using the standard radial basis function kernel. CPAT classifies transcripts of coding and non-coding using Logistic Regression (LR). This model uses four sequence features: ORF coverage, ORF size, hexamer usage bias, and Fickett TESTCODE statistic. CNCI was modeled with SVM and uses profiling Adjoining Nucleotide Triplets (ANT - 64*64) and most-like CDS (MLCDS).
In contrast, PLEK (2014) is based on the k-mer scheme (\(k = 1-5\)) to predict lncRNA, also applying the SVM classifier. LncRNApred classified lncRNAs with Random Forest (RF) and features based on ORF, signal to noise ratio, k-mer (\(k = 1-3\)), sequence length, and GC content. RNAplonc considered 16 features (ORF, GC content, K-mer scheme \((k = 1-6)\), sequence length), besides classifying sequences with the REPtree algorithm. BASiNET classifies sequences based on the feature extraction from complex network measurements. Finally, LncFinder uses five classifiers (LR, SVM, RF, Extreme Learning Machine, and Deep Learning), to apply the algorithm that obtains the highest accuracy. Moreover, the authors use features of ORF, secondary structural, and electron-ion interaction.
Some of these works [9, 20] have explored Genomic Signal Processing (GSP) techniques, which according to Abo-Zahhad et al. [1], is defined as the analysis of genomic signals, whose purpose is to obtain and translate biological knowledge into systems-based applications. To use GPS techniques it is necessary to apply a numeric representation for transformation or mapping of genomic data (represented in DNA by the letters A (adenine), T (thymine), G (guanine) and C (cytosine)) [17]. In literature, distinct DNA Numerical Representation (DNR) techniques have been developed [1]. According to Mendizabal-Ruiz et al. [17], these representations can be divided into three categories: single-value mapping (e.g., integer representation [8, 17], real number representation [5], real number representation, Electron-Ion Interaction Pseudopotential (EIIP) [18]), multidimensional sequence mapping (e.g., Voss representation [25]), and cumulative sequence mapping (e.g., Z-curve representation [31]).
As previously shown, some works used this approach in the lncRNAs classification, Pian et al. [20] applied Voss representation and Han et al. [9] EIIP representation. Nevertheless, the authors used these approach in conjunction with other features extraction techniques, and without testing other numerical mappings. Furthermore, according to Abo-Zahhad et al. [1], the Voss representation (one of the most applied methods) may be redundant. Therefore, considering that it is not yet clear what the properties of each DNR and how the selection of these distinct techniques can affect the results in a signal processing approach [17], we elaborated a study with 5 numerical mapping techniques (Voss, Integer, Real, EIIP, Z-curve), in order to classify lncRNAs.
2 Materials and Methods
This section describes the methodological procedures used to achieve the proposed objectives. Fundamentally, we divided our approach into five stages: (1) Data selection and preprocessing; (2) Feature extraction; (3) Training; (4) Test; (5) Performance analysis.
2.1 Data Selection
Sequences of plant species (Arabidopsis thaliana), obtained from CPC2 [11], were adopted in order to validate the proposed method. Following the literature methods, this work also adopts two classes for the datasets: positive class, with lncRNAs, and negative class, with protein-coding genes (mRNAs). The mRNA data were obtained from the RefSeq database with protein sequences annotated by Swiss-Prot [11], and lncRNA data from the Ensembl (v87) and Ensembl Plants (v32) database. We used only sequences longer than 200nt [13], and we also removed sequence redundancy (identity \(\ge 90\%\)), using CD-HIT-EST tool (v4.6.1) [15].
2.2 Feature Extraction
At this stage, a Fourier transform based features extraction approach is performed on input samples (to detect lncRNAs and mRNAs). Thus, we adopt five representations: Voss [25], Integer [8, 17], Real [5], Z-curve [31], and EIIP [18]. Fundamentally, we denote a biological sequence \(S = (S[0], S[1], \ldots , S[N-1])\) such that \(S \in \{A, C, G, T \}^N\).
Fourier Transform: To generate features based in a Fourier approach, we apply the Discrete Fourier Transform (DFT), widely used for digital image processing and digital signal processing, that can reveal hidden periodicities after transformation of time domain data to frequency domain space [27]. According to Yin and Yau [28], the DFT of a signal with length N, x[n] (\(n = 0, 1, \ldots , N - 1\)), at frequency k, can be defined by Eq. (1):
This method is extensively studied in bioinformatics, mainly for analysis of periodicities and repetitive elements in DNA sequences [3] and protein structures [16].
Voss Representation: This representation can use single or multidimensional vectors. Fundamentally, this approach transforms a sequence \(S \in \{A, \, C, \, G, \, T\}^N\) into a matrix \(\mathbf {V} \in \{0,1\}^{4 \times N}\) such that \(\mathbf {V} = [\mathbf {v}_1,\,\mathbf {v}_2,\,\mathbf {v}_3,\,\mathbf {v}_4]^T\), where T is the transpose operator and each \(\mathbf {v}_i\) array is constructed according to the following relation:
As a result, each row of matrix \(\mathbf {V}\) may be seen as an array that marks each base position such that the first row denotes the presence of base A, row two for base C, row three base G and the last row for base T. For example, let \(S = (G, A, G, A, G, T, G, A, C, C, A)\) be a sequence that needs to be represented using Voss representation, therefore, \(\mathbf {v}_1 = (0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1)\), which represents the locations of bases A, \(\mathbf {v}_2 = (0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0)\) for bases C, \(\mathbf {v}_3 = (1, 0, 1, 0, 1, 0, 1,\) 0, 0, 0, 0) for the G bases, \(\mathbf {v}_4 = (0, 0, 0, 0, 0, 1, 0, \) 0, 0, 0, 0) for T bases. Then, using the DFT in the indicator sequences shown above, we obtain (see Eq. 3):
The power spectrum of a biological sequence can be obtained by Eq. (4):
Integer Representation: This representation is one-dimensional [8, 17]. This mapping can be obtained by substituting the four nucleotides (G, A, C, T) of a biological sequence for integers (0, 1, 2, 3), respectively, e.g., let \(S = (G, A, G, A, G, T, G, A, C, C, A)\), thus, d = (3, 2, 3, 2, 3, 0, 3, 2, 1, 1, 2), as exposed in Eq. (5). The DFT and power spectrum are exposed in Eq. (6).
Real Representation: In this representation, Chakravarthy et al. [5] use real mapping based on the complement property of the complex mapping of [3]. This mapping applies positive decimal values for the purines (A, G), and negative decimal values for the pyrimidines (C, T), e.g., let \(S = (G, A, G, A, G, T, G, A, C, C, A)\), thus, r = (−0.5, −1.5, −0.5, −1.5, −0.5, 1.5, −0.5, −1.5, 0.5, 0.5, −1.5), as Eqs. (7) and (8).
Z-Curve Representation: The Z-curve scheme is a three-dimensional curve presented by [31], to encode DNA sequences with more biological semantics. Essentially, we can inspect a given sequence S[n] of length N, taking into account the n-th element of the sequence (\(n = 1, 2, \ldots , N\)). Then, we denote the cumulative occurrence numbers \(A_n\), \(C_n\), \(G_n\) and \(T_n\) for each base A, C, G and T, as the number of times that a base occurred from S[1] up until S[n]. Therefore:
Where the Z-curve consists of a series of nodes \(P_1, P_2, \ldots , P_{N}\), whose coordinates x[n], y[n], and z[n] \((n = 1, 2, \ldots , N)\) are uniquely determined by the Z-transform, shown in Eq. (9):
Where R, Y, M, K, W and S denote the bases of purine (\(R = A, G\)), pyrimidine (\(Y = C, T\)), amino (\(M = A, C\)), keto (\(K = G, T\)), weak hydrogen bonds (\(W = A, T\)) and strong hydrogen bonds (\(S = G, C\)), respectively [22, 30]. The coordinates x[n], y[n], and z[n] represent three independent distributions that completely describe a sequence [1]. Therefore, we will have three distributions with definite biological significance: (1) \(x[n] = \) purine/pyrimidine, (2) \(y[n] = \) amino/keto, (3) \(z[n] = \) strong hydrogen bonds/weak hydrogen bonds [31], e.g., let \(S = (G, A, G, A, G, T, G, A, C, C, A)\), thus, \(x = (1, 2, 3, 4, 5, 4, 5, 6, 5, 4, 5)\); \(y = (-1, 0, -1, 0, -1, -2, -3, -2, -1, 0, 1)\); \(z = (-1, 0, -1, 0, -1, 0, -1, 0, -1, -2, -1)\). Essentially, the difference between each dimension at the n-th position and the previous (\(n-1\)) position can be either 1 or \(-1\) [31]. Finally, the DFT and the power spectrum of the Z-Curve representation may be defined as [30]:
EIIP Representation: Nair and Sreenadhan [18] proposed EIIP values of nucleotides to represent biological sequences and to locate exons. According to the authors, a numerical sequence representing the distribution of free electron energies can be called “EIIP indicator sequence", e.g., let \(S =\) (G, A, G, A, G, T, G, A, C, C, A), thus, b = (0.0806, 0.1260, 0.0806, 0.1260, 0.0806, 0.1335, 0.0806, 0.1260, 0.1340, 0.1340, 0.1260), as shown in Eq. (13). The DFT and power spectrum of this representation are presented in Eq. (14).
Features: Finally, we apply the feature extraction in all representations, adopting Signal to Noise Ratio (SNR - [22]), average power spectrum, median, maximum, minimum, sample standard deviation, population standard deviation, percentile (15/25/50/75), amplitude, and variance. The SNR uses the statistical phenomenon known as period-3 behavior or 3-base periodicity [29].
2.3 Normalization, Sampling, Training and Evaluation Metrics
We adopt the min-max normalization method, which fits the data range to 0 and 1 (or -1 to 1, if there are negative values) for each feature, in order to use them on classification step. Moreover, the sampling method was adopted in our dataset, since we are faced with the imbalanced data problem (A. thaliana (2,540 lncRNA/13,973 mRNA)). Thus, we applied SMOTE [6], an over-sampling approach (to adjust the class distribution), in which “synthetic" examples are created, over-sampling the minority class. Next, we investigate four classification algorithms, like Naive Bayes (NB), Random Forest (RF), SVM and AdaBoost. To induce our models, we used \(70\%\) of samples for training (with 10-fold cross-validation) and \(30\%\) for testing. Finally, the representations were evaluated with sensitivity (SE - correctly predicted lncRNAs), specificity (SPC - correctly classified mRNAs), accuracy (ACC), and Cohen’s kappa coefficient [7].
3 Results and Discussion
First, we induced our models with the NB, RF, SVM, and AdaBoost classifiers in the training set. Then, to estimate the real accuracy of this set, we used 10-fold cross-validation, as exposed in Table 1. Evaluating each classifier individually, we observed that the best performance was of the Random Forest with Z-curve (0.9605), followed by AdaBoost (EIIP - 0.9521), SVM (EIIP - 0.9476), and NB (Real - 0.9300). After training, the predictive models induced by NB, RF, SVM, and AdaBoost were applied to the test set, in which Fig. 1 summarizes in a polar chart, the SE, SPC, kappa and ACC metrics for each representation.
As can be seen, in Fig. 1, the RF classifier maintained the best performance in the test set using Z-curve (ACC = 0.9553), followed by AdaBoost (ACC = 0.9526) adopting EIIP. In general, the best results are contained in the Real, Z-curve and EIIP representation. However, if we use the AdaBoost classifier as an example, the greatest difference in accuracy between the mappings is approximately 0.0072. Although this, we noted that the mappings have higher peaks of ACC, SPC in NB. We also evaluated the performance of our best predictive model (RF with Z-curve) against other four state-of-the-art tools; CPC2 [11] (an updated version of the CPC method), CNCI, PLEK and RNAplonc (specifically for plants), as shown in Table 2.
CPC2 (0.9574) reported a similar performance along with our predictive model (0.9553), followed by RNAplonc (0.9443), CNCI (0.8997), and PLEK (0.6649). Nevertheless, it is important to emphasize that CPC2 and RNAplonc use the ORF descriptor, a highly employed feature for discovering coding sequences and which, according to Baek et al. [4] is an essential guideline for distinguishing lncRNAs from mRNA. Considering this, our approach has an advantage in terms of generalization to distinguish other classes of ncRNA, since this would not be possible only with the ORF. To evaluate this hypothesis, we apply a second experiment with CPC2 using a new dataset with only non-coding sequences (lncRNA and Small ncRNA - A. thaliana - also obtained from [11]) without mRNA sequences. For such, we used the features provided by CPC2 to construct a model with similar procedures to our approach. However, we eliminated the sequence length descriptor provided by CPC2 and also any attribute that would generate this information in our approach, since that any explicit bias to this feature may facilitate the prediction of these sequences. Therefore, we applied new experiments according to the same methodology described in this work (70% training and 30% test) and using the RF classifier, as shown in Table 3.
The tests confirm again the hypothesis that the proposed method is efficient, in which we reached an ACC of 0.9595 against 0.8071 of the features provided by CPC2 (e.g., ORF). That is, our approach is robust in terms of generalization to distinguish lncRNA from mRNA, as well as other classes of ncRNA.
4 Conclusion
In this work, we investigated five numerical mapping techniques (Voss, Integer, Real, EIIP, Z-curve) with the Fourier transform, for feature extraction and classification of lncRNAsFootnote 1. Thereby, sequences of plant species (A. thaliana), obtained from [11] were adopted in order to validate the proposed method. As results, we conclude that the RF and AdaBoost classifiers presented the best performance using the Z-curve and EIIP representations, respectively. Furthermore, to validate our study, we also compared with other available methods in the literature (CPC2, CNCI, PLEK, and RNAplonc). The proposed approach presented suitable results, being superior or competitive to other methods, and robust in terms of generalization. Finally, as future works we will analyze these representations more deeply, in order to propose a new numerical mapping with nucleotides triplets and amino acid features, e.g., molar mass, acidity, Van Der Waals volume, to consider more RNA classes and different organisms for the feature extraction analysis.
Notes
- 1.
Data and materials: github.com/Bonidia/FourierFeatureExtraction.
References
Abo-Zahhad, M., Ahmed, S.M., Abd-Elrahman, S.A.: Genomic analysis and classification of exon and intron sequences using dna numerical mapping techniques. Int. J. Inf. Technol. Comput. Sci. 4(8), 22–36 (2012)
Altschul, S.F., et al.: Gapped blast and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997)
Anastassiou, D.: Genomic signal processing. IEEE Sig. Proc. Mag. 18(4), 8–20 (2001)
Baek, J., Lee, B., Kwon, S., Yoon, S.: LncRNAnet: long non-coding RNA identification using deep learning. Bioinformatics 1, 9 (2018)
Chakravarthy, N., Spanias, A., Iasemidis, L.D., Tsakalis, K.: Autoregressive modeling and feature analysis of DNA sequences. EURASIP J. Appl. Sig. Process. 2004, 13–28 (2004)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20(1), 37–46 (1960)
Cristea, P.D.: Conversion of nucleotides sequences into genomic signals. J. Cell. Mol. Med. 6(2), 279–303 (2002)
Han, S., et al.: LncFinder: an integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information and physicochemical property. Brief. Bioinform. 19, 1–19 (2018). https://doi.org/10.1093/bib/bby065
Ito, E.A., Katahira, I., da Vicente, F.F.R., Pereira, L.F.P., Lopes, F.M.: BASiNET-biological sequences network: a case study on coding and non-coding RNAs identification. Nucleic Acids Res. 46, e96 (2018)
Kang, Y.J., et al.: CPC2: a fast and accurate coding potential calculator based on sequence intrinsic features. Nucleic Acids Res. 45(W1), W12–W16 (2017)
Kong, L., et al.: CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res. 35(suppl–2), W345–W349 (2007)
Li, A., Zhang, J., Zhou, Z.: PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme. BMC Bioinform. 15(1), 311 (2014)
Li, A., Zang, Q., Sun, D., Wang, M.: A text feature-based approach for literature mining of lncrna-protein interactions. Neurocomputing 206, 73–80 (2016)
Li, W., Godzik, A.: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13), 1658–1659 (2006)
Marsella, L., Sirocco, F., Trovato, A., Seno, F., Tosatto, S.C.: REPETITA: detection and discrimination of the periodicity of protein solenoid repeats by discrete fourier transform. Bioinformatics 25(12), i289–i295 (2009)
Mendizabal-Ruiz, G., Román-Godínez, I., Torres-Ramos, S., Salido-Ruiz, R.A., Morales, J.A.: On DNA numerical representations for genomic similarity computation. PloS One 12(3), e0173288 (2017)
Nair, A.S., Sreenadhan, S.P.: A coding measure scheme employing electron-ion interaction pseudopotential (EIIP). Bioinformation 1(6), 197 (2006)
da Negri, T.C., Alves, W.A.L., Bugatti, P.H., Saito, P.T.M., Domingues, D.S., Paschoal, A.R.: Pattern recognition analysis on long noncoding RNAs: a tool for prediction in plants. Brief. Bioinform. 20, 682–689 (2018)
Pian, C., et al.: LncRNApred: classification of long non-coding rnas and protein-coding transcripts by the ensemble algorithm with a new hybrid feature. PloS One 11(5), e0154567 (2016)
Schneider, H.W., Raiol, T., Brigido, M.M., Walter, M.E.M., Stadler, P.F.: A support vector machine based method to distinguish long non-coding rnas from protein coding transcripts. BMC Genomics 18(1), 804 (2017)
Shao, J., Yan, X., Shao, S.: SNR of DNA sequences mapped by general affine transformations of the indicator sequences. J. Math. Biol. 67(2), 433–451 (2013)
Sun, L., et al.: Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts. Nucleic Acids Res. 41(17), e166–e166 (2013)
Ventola, G.M., Noviello, T.M., D’Aniello, S., Spagnuolo, A., Ceccarelli, M., Cerulo, L.: Identification of long non-coding transcripts with feature selection: a comparative study. BMC Bioinform. 18(1), 187 (2017)
Voss, R.F.: Evolution of long-range fractal correlations and 1/f noise in dna base sequences. Phys. Rev. Lett. 68(25), 3805 (1992)
Wang, L., Park, H.J., Dasari, S., Wang, S., Kocher, J.P., Li, W.: CPAT: coding-potential assessment tool using an alignment-free logistic regression model. Nucleic Acids Res. 41(6), e74 (2013)
Yin, C., Chen, Y., Yau, S.S.T.: A measure of dna sequence similarity by fourier transform with applications on hierarchical clustering. J. Theor. Biol. 359, 18–28 (2014)
Yin, C., Yau, S.S.T.: A fourier characteristic of coding sequences: origins and a non-fourier approximation. J. Comput. Biol. 12(9), 1153–1165 (2005)
Yin, C., Yau, S.S.T.: Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. J. Theor. Biol. 247(4), 687–694 (2007)
Zhang, C.T.: A symmetrical theory of dna sequences and its applications. J. Theor. Biol. 187(3), 297–306 (1997)
Zhang, R., Zhang, C.T.: Z curves, an intutive tool for visualizing and analyzing the dna sequences. J. Biomol. Struct. Dyn 11(4), 767–782 (1994)
Zhang, W., Qu, Q., Zhang, Y., Wang, W.: The linear neighborhood propagation method for predicting long non-coding RNA-protein interactions. Neurocomputing 273, 526–534 (2018)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Bonidia, R.P., Sampaio, L.D.H., Lopes, F.M., Sanches, D.S. (2019). Feature Extraction of Long Non-coding RNAs: A Fourier and Numerical Mapping Approach. In: Nyström, I., Hernández Heredia, Y., Milián Núñez, V. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2019. Lecture Notes in Computer Science(), vol 11896. Springer, Cham. https://doi.org/10.1007/978-3-030-33904-3_44
Download citation
DOI: https://doi.org/10.1007/978-3-030-33904-3_44
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33903-6
Online ISBN: 978-3-030-33904-3
eBook Packages: Computer ScienceComputer Science (R0)