Feature Extraction of Long Non-coding RNAs: A Fourier and Numerical Mapping Approach

Bonidia, Robson Parmezan; Sampaio, Lucas Dias Hiera; Lopes, Fabrício Martins; Sanches, Danilo Sipoli

doi:10.1007/978-3-030-33904-3_44

Robson Parmezan Bonidia¹¹,
Lucas Dias Hiera Sampaio¹¹,
Fabrício Martins Lopes¹¹ &
…
Danilo Sipoli Sanches¹¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11896))

Included in the following conference series:

Iberoamerican Congress on Pattern Recognition

1636 Accesses
4 Citations

Abstract

Due to the high number of genomic sequencing projects, the number of RNA transcripts increased significantly, creating a huge volume of data. Thus, new computational methods are needed for the analysis and information extraction from these data. In particular, when parts of a genome are transcribed into RNA molecules, some specific classes of RNA are produced, such as mRNA and ncRNA with different functions. In this way, long non-coding RNAs have emerged as key regulators of many biological processes. Therefore, machine learning approaches are being used to identify this enigmatic RNA class. Considering this, we present a Fourier transform-based features extraction approach with 5 numerical mapping techniques (Voss, Integer, Real, EIIP and Z-curve), in order to classify lncRNAs from plants. We investigate four classification algorithms like Naive Bayes, Random Forest, Support Vector Machine and AdaBoost. Moreover, the proposed approach was compared with 4 competing methods available in the literature (CPC2, CNCI, PLEK, and RNAplonc). The experimental results demonstrated high efficiency for the classification of lncRNAs, providing competitive performance.

All authors thank the Federal University of Technology - Paraná (UTFPR), CNPq, Fundação Araucária/SETI and CAPES for supporting this study.

You have full access to this open access chapter, Download conference paper PDF

LncMachine: a machine learning algorithm for long noncoding RNA annotation in plants

Article 26 February 2021

Prediction of LncRNA by Using Muitiple Feature Information Fusion and Feature Selection Technique

Machine Learning-Based Annotation of Long Noncoding RNAs Using PLncPRO

Keywords

1 Introduction

In recent years, the number of RNA transcripts has grown greatly due to thousands of sequencing projects [13, 21], creating a huge volume of data for analysis. Hence, advances in complete transcriptome sequencing have offered new challenges for discovering novel functional transcriptional elements [24], for instance, the Long Non-Coding RNAs (lncRNAs). The lncRNAs are a new type of Non-Coding RNA (ncRNA) with a length greater than 200 nucleotides [14]. According to recent studies, play essential roles in several critical biological processes [32], such as transcriptional regulation and immune response. The discovery of lncRNAs as essential gene regulators in many biological contexts have motivated the development of new machine learning approaches in order to extract relevant information from lncRNAs. In this context, some methods were successfully applied: CPC [12], CPAT [26], CNCI [23], PLEK [13], LncRNApred [20], RNAplonc [19], BASiNET [10], and LncFinder [9].

The CPC measures the protein-coding potential of a transcript based on two feature categories. The extent and quality of the Open Reading Frame (ORF), and derivation of BLASTX [2] search. As a prediction method, the authors used the LIBSVM package to train a Support Vector Machine (SVM) model, using the standard radial basis function kernel. CPAT classifies transcripts of coding and non-coding using Logistic Regression (LR). This model uses four sequence features: ORF coverage, ORF size, hexamer usage bias, and Fickett TESTCODE statistic. CNCI was modeled with SVM and uses profiling Adjoining Nucleotide Triplets (ANT - 64*64) and most-like CDS (MLCDS).

In contrast, PLEK (2014) is based on the k-mer scheme ($k = 1-5$) to predict lncRNA, also applying the SVM classifier. LncRNApred classified lncRNAs with Random Forest (RF) and features based on ORF, signal to noise ratio, k-mer ($k = 1-3$), sequence length, and GC content. RNAplonc considered 16 features (ORF, GC content, K-mer scheme $(k = 1-6)$, sequence length), besides classifying sequences with the REPtree algorithm. BASiNET classifies sequences based on the feature extraction from complex network measurements. Finally, LncFinder uses five classifiers (LR, SVM, RF, Extreme Learning Machine, and Deep Learning), to apply the algorithm that obtains the highest accuracy. Moreover, the authors use features of ORF, secondary structural, and electron-ion interaction.

Some of these works [9, 20] have explored Genomic Signal Processing (GSP) techniques, which according to Abo-Zahhad et al. [1], is defined as the analysis of genomic signals, whose purpose is to obtain and translate biological knowledge into systems-based applications. To use GPS techniques it is necessary to apply a numeric representation for transformation or mapping of genomic data (represented in DNA by the letters A (adenine), T (thymine), G (guanine) and C (cytosine)) [17]. In literature, distinct DNA Numerical Representation (DNR) techniques have been developed [1]. According to Mendizabal-Ruiz et al. [17], these representations can be divided into three categories: single-value mapping (e.g., integer representation [8, 17], real number representation [5], real number representation, Electron-Ion Interaction Pseudopotential (EIIP) [18]), multidimensional sequence mapping (e.g., Voss representation [25]), and cumulative sequence mapping (e.g., Z-curve representation [31]).

As previously shown, some works used this approach in the lncRNAs classification, Pian et al. [20] applied Voss representation and Han et al. [9] EIIP representation. Nevertheless, the authors used these approach in conjunction with other features extraction techniques, and without testing other numerical mappings. Furthermore, according to Abo-Zahhad et al. [1], the Voss representation (one of the most applied methods) may be redundant. Therefore, considering that it is not yet clear what the properties of each DNR and how the selection of these distinct techniques can affect the results in a signal processing approach [17], we elaborated a study with 5 numerical mapping techniques (Voss, Integer, Real, EIIP, Z-curve), in order to classify lncRNAs.

2 Materials and Methods

This section describes the methodological procedures used to achieve the proposed objectives. Fundamentally, we divided our approach into five stages: (1) Data selection and preprocessing; (2) Feature extraction; (3) Training; (4) Test; (5) Performance analysis.

2.1 Data Selection

Sequences of plant species (Arabidopsis thaliana), obtained from CPC2 [11], were adopted in order to validate the proposed method. Following the literature methods, this work also adopts two classes for the datasets: positive class, with lncRNAs, and negative class, with protein-coding genes (mRNAs). The mRNA data were obtained from the RefSeq database with protein sequences annotated by Swiss-Prot [11], and lncRNA data from the Ensembl (v87) and Ensembl Plants (v32) database. We used only sequences longer than 200nt [13], and we also removed sequence redundancy (identity $\ge 90\%$), using CD-HIT-EST tool (v4.6.1) [15].

2.2 Feature Extraction

At this stage, a Fourier transform based features extraction approach is performed on input samples (to detect lncRNAs and mRNAs). Thus, we adopt five representations: Voss [25], Integer [8, 17], Real [5], Z-curve [31], and EIIP [18]. Fundamentally, we denote a biological sequence $S = (S[0], S[1], \ldots , S[N-1])$ such that $S \in \{A, C, G, T \}^N$.

Fourier Transform: To generate features based in a Fourier approach, we apply the Discrete Fourier Transform (DFT), widely used for digital image processing and digital signal processing, that can reveal hidden periodicities after transformation of time domain data to frequency domain space [27]. According to Yin and Yau [28], the DFT of a signal with length N, x[n] ($n = 0, 1, \ldots , N - 1$), at frequency k, can be defined by Eq. (1):

$$\begin{aligned} X[k] = \sum _{n=0}^{N - 1} x[n] \, e^{-j \frac{2\pi }{N} kn}, \qquad k = 0, 1,\ldots , N-1. \end{aligned}$$

(1)

This method is extensively studied in bioinformatics, mainly for analysis of periodicities and repetitive elements in DNA sequences [3] and protein structures [16].

Voss Representation: This representation can use single or multidimensional vectors. Fundamentally, this approach transforms a sequence $S \in \{A, \, C, \, G, \, T\}^N$ into a matrix $\mathbf {V} \in \{0,1\}^{4 \times N}$ such that $\mathbf {V} = [\mathbf {v}_1,\,\mathbf {v}_2,\,\mathbf {v}_3,\,\mathbf {v}_4]^T$, where T is the transpose operator and each $\mathbf {v}_i$ array is constructed according to the following relation:

$$\begin{aligned} v_{i}[n] = \left\{ \begin{matrix} 1, &{} S[n] = \alpha [i] \\ 0, &{} S[n] \ne \alpha [i] \end{matrix}, \text { where } \alpha = (A,\,C,\,G,\,T), \qquad n = 0, 1, \ldots , N - 1. \right. \end{aligned}$$

(2)

As a result, each row of matrix $\mathbf {V}$ may be seen as an array that marks each base position such that the first row denotes the presence of base A, row two for base C, row three base G and the last row for base T. For example, let $S = (G, A, G, A, G, T, G, A, C, C, A)$ be a sequence that needs to be represented using Voss representation, therefore, $\mathbf {v}_1 = (0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1)$, which represents the locations of bases A, $\mathbf {v}_2 = (0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0)$ for bases C, $\mathbf {v}_3 = (1, 0, 1, 0, 1, 0, 1,$ 0, 0, 0, 0) for the G bases, $\mathbf {v}_4 = (0, 0, 0, 0, 0, 1, 0, $ 0, 0, 0, 0) for T bases. Then, using the DFT in the indicator sequences shown above, we obtain (see Eq. 3):

$$\begin{aligned} V_i[k] = \sum _{n=0}^{N - 1} v_{i}[n] e^{-j \frac{2\pi }{N} kn},\ \forall \ i \in [1,4], \qquad k = 0, 1,\ldots , N-1. \end{aligned}$$

(3)

The power spectrum of a biological sequence can be obtained by Eq. (4):

$$\begin{aligned} P_V[k] = \sum \limits _{i=1}^4 \, \left| V_i[k]\right| ^2, \qquad k = 0, 1,\ldots , N-1. \end{aligned}$$

(4)

Integer Representation: This representation is one-dimensional [8, 17]. This mapping can be obtained by substituting the four nucleotides (G, A, C, T) of a biological sequence for integers (0, 1, 2, 3), respectively, e.g., let $S = (G, A, G, A, G, T, G, A, C, C, A)$, thus, d = (3, 2, 3, 2, 3, 0, 3, 2, 1, 1, 2), as exposed in Eq. (5). The DFT and power spectrum are exposed in Eq. (6).

$$\begin{aligned} d[n] = \left\{ \begin{matrix} 3, &{} S[n] = G\\ 2, &{} S[n] = A\\ 1, &{} S[n] = C\\ 0, &{} S[n] = T\\ \end{matrix}, \right. \qquad n = 0, 1,\ldots , N-1. \end{aligned}$$

(5)

$$\begin{aligned} D[k] = \sum _{n=0}^{N - 1} d[n] e^{-j \frac{2\pi }{N} kn}, \qquad P_D[k] = |D[k]|^2 , \qquad k = 0, 1,\ldots , N-1. \end{aligned}$$

(6)

Real Representation: In this representation, Chakravarthy et al. [5] use real mapping based on the complement property of the complex mapping of [3]. This mapping applies positive decimal values for the purines (A, G), and negative decimal values for the pyrimidines (C, T), e.g., let $S = (G, A, G, A, G, T, G, A, C, C, A)$, thus, r = (−0.5, −1.5, −0.5, −1.5, −0.5, 1.5, −0.5, −1.5, 0.5, 0.5, −1.5), as Eqs. (7) and (8).

$$\begin{aligned} r[n] = \left\{ \begin{matrix} -0.5, &{} S[n] = G\\ -1.5, &{} S[n] = A\\ 0.5, &{} S[n] = C\\ 1.5, &{} S[n] = T\\ \end{matrix}, \right. \qquad n = 0, 1,\ldots , N-1. \end{aligned}$$

(7)

$$\begin{aligned} R[k] = \sum _{n=0}^{N - 1} r[n] e^{-j \frac{2\pi }{N} kn}, \qquad P_R[k] = |R[k]|^2, \qquad k = 0, 1,\ldots , N-1. \end{aligned}$$

(8)

Z-Curve Representation: The Z-curve scheme is a three-dimensional curve presented by [31], to encode DNA sequences with more biological semantics. Essentially, we can inspect a given sequence S[n] of length N, taking into account the n-th element of the sequence ($n = 1, 2, \ldots , N$). Then, we denote the cumulative occurrence numbers $A_n$, $C_n$, $G_n$ and $T_n$ for each base A, C, G and T, as the number of times that a base occurred from S[1] up until S[n]. Therefore:

$$\begin{aligned} A_n + C_n + G_n + T_n = n \end{aligned}$$

(9)

Where the Z-curve consists of a series of nodes $P_1, P_2, \ldots , P_{N}$, whose coordinates x[n], y[n], and z[n] $(n = 1, 2, \ldots , N)$ are uniquely determined by the Z-transform, shown in Eq. (9):

$$\begin{aligned} \begin{aligned} P[n] = \left\{ \begin{matrix} x[n] = (A_n + G_n) - (C_n + T_n) \equiv R_n - Y_n\\ y[n] = (A_n + C_n) - (G_n + T_n) \equiv M_n - K_n\\ z[n] = (A_n + T_n) - (C_n + G_n) \equiv W_n - S_n\end{matrix},\right. \quad \\ x[n], y[n], z[n] \in [-n, n], \qquad n = 1, 2, \ldots , N. \end{aligned} \end{aligned}$$

(10)

Where R, Y, M, K, W and S denote the bases of purine ($R = A, G$), pyrimidine ($Y = C, T$), amino ($M = A, C$), keto ($K = G, T$), weak hydrogen bonds ($W = A, T$) and strong hydrogen bonds ($S = G, C$), respectively [22, 30]. The coordinates x[n], y[n], and z[n] represent three independent distributions that completely describe a sequence [1]. Therefore, we will have three distributions with definite biological significance: (1) $x[n] = $ purine/pyrimidine, (2) $y[n] = $ amino/keto, (3) $z[n] = $ strong hydrogen bonds/weak hydrogen bonds [31], e.g., let $S = (G, A, G, A, G, T, G, A, C, C, A)$, thus, $x = (1, 2, 3, 4, 5, 4, 5, 6, 5, 4, 5)$; $y = (-1, 0, -1, 0, -1, -2, -3, -2, -1, 0, 1)$; $z = (-1, 0, -1, 0, -1, 0, -1, 0, -1, -2, -1)$. Essentially, the difference between each dimension at the n-th position and the previous ($n-1$) position can be either 1 or $-1$ [31]. Finally, the DFT and the power spectrum of the Z-Curve representation may be defined as [30]:

$$\begin{aligned} X[k] = \sum _{n=1}^{N} x[n] e^{-j \frac{2\pi }{N} kn}, \quad Y[k] = \sum _{n=1}^{N} y[n] e^{-j \frac{2\pi }{N} kn},\quad Z[k] = \sum _{n=1}^{N} z[n] e^{-j \frac{2\pi }{N} kn}.\quad \end{aligned}$$

(11)

$$\begin{aligned} P_C[k] = |X[k]|^2 + |Y[k]|^2 + |Z[k]|^2, \qquad k = 1, 2,\ldots , N. \end{aligned}$$

(12)

EIIP Representation: Nair and Sreenadhan [18] proposed EIIP values of nucleotides to represent biological sequences and to locate exons. According to the authors, a numerical sequence representing the distribution of free electron energies can be called “EIIP indicator sequence", e.g., let $S =$ (G, A, G, A, G, T, G, A, C, C, A), thus, b = (0.0806, 0.1260, 0.0806, 0.1260, 0.0806, 0.1335, 0.0806, 0.1260, 0.1340, 0.1340, 0.1260), as shown in Eq. (13). The DFT and power spectrum of this representation are presented in Eq. (14).

$$\begin{aligned} b[n] = \left\{ \begin{matrix} 0.0806, &{} S[n] = G\\ 0.1260, &{} S[n] = A\\ 0.1340, &{} S[n] = C\\ 0.1335, &{} S[n] = T\\ \end{matrix}, \right. \qquad n = 0, 1,\ldots , N-1. \end{aligned}$$

(13)

$$\begin{aligned} { E[k] = \sum _{n=0}^{N - 1} b[n] e^{-j \frac{2\pi }{N} kn}, \qquad P_E[k] = |E[k]|^2 , \qquad k = 0, 1,\ldots , N-1. } \end{aligned}$$

(14)

Features: Finally, we apply the feature extraction in all representations, adopting Signal to Noise Ratio (SNR - [22]), average power spectrum, median, maximum, minimum, sample standard deviation, population standard deviation, percentile (15/25/50/75), amplitude, and variance. The SNR uses the statistical phenomenon known as period-3 behavior or 3-base periodicity [29].

2.3 Normalization, Sampling, Training and Evaluation Metrics

We adopt the min-max normalization method, which fits the data range to 0 and 1 (or -1 to 1, if there are negative values) for each feature, in order to use them on classification step. Moreover, the sampling method was adopted in our dataset, since we are faced with the imbalanced data problem (A. thaliana (2,540 lncRNA/13,973 mRNA)). Thus, we applied SMOTE [6], an over-sampling approach (to adjust the class distribution), in which “synthetic" examples are created, over-sampling the minority class. Next, we investigate four classification algorithms, like Naive Bayes (NB), Random Forest (RF), SVM and AdaBoost. To induce our models, we used $70\%$ of samples for training (with 10-fold cross-validation) and $30\%$ for testing. Finally, the representations were evaluated with sensitivity (SE - correctly predicted lncRNAs), specificity (SPC - correctly classified mRNAs), accuracy (ACC), and Cohen’s kappa coefficient [7].

3 Results and Discussion

First, we induced our models with the NB, RF, SVM, and AdaBoost classifiers in the training set. Then, to estimate the real accuracy of this set, we used 10-fold cross-validation, as exposed in Table 1. Evaluating each classifier individually, we observed that the best performance was of the Random Forest with Z-curve (0.9605), followed by AdaBoost (EIIP - 0.9521), SVM (EIIP - 0.9476), and NB (Real - 0.9300). After training, the predictive models induced by NB, RF, SVM, and AdaBoost were applied to the test set, in which Fig. 1 summarizes in a polar chart, the SE, SPC, kappa and ACC metrics for each representation.

Table 1. Real accuracy estimates for the training set using 10-fold cross-validation.

Full size table

As can be seen, in Fig. 1, the RF classifier maintained the best performance in the test set using Z-curve (ACC = 0.9553), followed by AdaBoost (ACC = 0.9526) adopting EIIP. In general, the best results are contained in the Real, Z-curve and EIIP representation. However, if we use the AdaBoost classifier as an example, the greatest difference in accuracy between the mappings is approximately 0.0072. Although this, we noted that the mappings have higher peaks of ACC, SPC in NB. We also evaluated the performance of our best predictive model (RF with Z-curve) against other four state-of-the-art tools; CPC2 [11] (an updated version of the CPC method), CNCI, PLEK and RNAplonc (specifically for plants), as shown in Table 2.

Table 2. Comparative performance between our approach and state-of-the-art tools.

Full size table

Table 3. Comparative between CPC2 and our approach to a new dataset with only non-coding sequences (lncRNA and Small ncRNA - A. thaliana).

Full size table

CPC2 (0.9574) reported a similar performance along with our predictive model (0.9553), followed by RNAplonc (0.9443), CNCI (0.8997), and PLEK (0.6649). Nevertheless, it is important to emphasize that CPC2 and RNAplonc use the ORF descriptor, a highly employed feature for discovering coding sequences and which, according to Baek et al. [4] is an essential guideline for distinguishing lncRNAs from mRNA. Considering this, our approach has an advantage in terms of generalization to distinguish other classes of ncRNA, since this would not be possible only with the ORF. To evaluate this hypothesis, we apply a second experiment with CPC2 using a new dataset with only non-coding sequences (lncRNA and Small ncRNA - A. thaliana - also obtained from [11]) without mRNA sequences. For such, we used the features provided by CPC2 to construct a model with similar procedures to our approach. However, we eliminated the sequence length descriptor provided by CPC2 and also any attribute that would generate this information in our approach, since that any explicit bias to this feature may facilitate the prediction of these sequences. Therefore, we applied new experiments according to the same methodology described in this work (70% training and 30% test) and using the RF classifier, as shown in Table 3.

The tests confirm again the hypothesis that the proposed method is efficient, in which we reached an ACC of 0.9595 against 0.8071 of the features provided by CPC2 (e.g., ORF). That is, our approach is robust in terms of generalization to distinguish lncRNA from mRNA, as well as other classes of ncRNA.

4 Conclusion

In this work, we investigated five numerical mapping techniques (Voss, Integer, Real, EIIP, Z-curve) with the Fourier transform, for feature extraction and classification of lncRNAs^{Footnote 1}. Thereby, sequences of plant species (A. thaliana), obtained from [11] were adopted in order to validate the proposed method. As results, we conclude that the RF and AdaBoost classifiers presented the best performance using the Z-curve and EIIP representations, respectively. Furthermore, to validate our study, we also compared with other available methods in the literature (CPC2, CNCI, PLEK, and RNAplonc). The proposed approach presented suitable results, being superior or competitive to other methods, and robust in terms of generalization. Finally, as future works we will analyze these representations more deeply, in order to propose a new numerical mapping with nucleotides triplets and amino acid features, e.g., molar mass, acidity, Van Der Waals volume, to consider more RNA classes and different organisms for the feature extraction analysis.

Notes

1.
Data and materials: github.com/Bonidia/FourierFeatureExtraction.

References

Abo-Zahhad, M., Ahmed, S.M., Abd-Elrahman, S.A.: Genomic analysis and classification of exon and intron sequences using dna numerical mapping techniques. Int. J. Inf. Technol. Comput. Sci. 4(8), 22–36 (2012)
Google Scholar
Altschul, S.F., et al.: Gapped blast and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997)
Article Google Scholar
Anastassiou, D.: Genomic signal processing. IEEE Sig. Proc. Mag. 18(4), 8–20 (2001)
Article Google Scholar
Baek, J., Lee, B., Kwon, S., Yoon, S.: LncRNAnet: long non-coding RNA identification using deep learning. Bioinformatics 1, 9 (2018)
Google Scholar
Chakravarthy, N., Spanias, A., Iasemidis, L.D., Tsakalis, K.: Autoregressive modeling and feature analysis of DNA sequences. EURASIP J. Appl. Sig. Process. 2004, 13–28 (2004)
MATH Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article MATH Google Scholar
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20(1), 37–46 (1960)
Article Google Scholar
Cristea, P.D.: Conversion of nucleotides sequences into genomic signals. J. Cell. Mol. Med. 6(2), 279–303 (2002)
Article MathSciNet Google Scholar
Han, S., et al.: LncFinder: an integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information and physicochemical property. Brief. Bioinform. 19, 1–19 (2018). https://doi.org/10.1093/bib/bby065
Article Google Scholar
Ito, E.A., Katahira, I., da Vicente, F.F.R., Pereira, L.F.P., Lopes, F.M.: BASiNET-biological sequences network: a case study on coding and non-coding RNAs identification. Nucleic Acids Res. 46, e96 (2018)
Article Google Scholar
Kang, Y.J., et al.: CPC2: a fast and accurate coding potential calculator based on sequence intrinsic features. Nucleic Acids Res. 45(W1), W12–W16 (2017)
Article Google Scholar
Kong, L., et al.: CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res. 35(suppl–2), W345–W349 (2007)
Article Google Scholar
Li, A., Zhang, J., Zhou, Z.: PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme. BMC Bioinform. 15(1), 311 (2014)
Article Google Scholar
Li, A., Zang, Q., Sun, D., Wang, M.: A text feature-based approach for literature mining of lncrna-protein interactions. Neurocomputing 206, 73–80 (2016)
Article Google Scholar
Li, W., Godzik, A.: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13), 1658–1659 (2006)
Article Google Scholar
Marsella, L., Sirocco, F., Trovato, A., Seno, F., Tosatto, S.C.: REPETITA: detection and discrimination of the periodicity of protein solenoid repeats by discrete fourier transform. Bioinformatics 25(12), i289–i295 (2009)
Article Google Scholar
Mendizabal-Ruiz, G., Román-Godínez, I., Torres-Ramos, S., Salido-Ruiz, R.A., Morales, J.A.: On DNA numerical representations for genomic similarity computation. PloS One 12(3), e0173288 (2017)
Article Google Scholar
Nair, A.S., Sreenadhan, S.P.: A coding measure scheme employing electron-ion interaction pseudopotential (EIIP). Bioinformation 1(6), 197 (2006)
Google Scholar
da Negri, T.C., Alves, W.A.L., Bugatti, P.H., Saito, P.T.M., Domingues, D.S., Paschoal, A.R.: Pattern recognition analysis on long noncoding RNAs: a tool for prediction in plants. Brief. Bioinform. 20, 682–689 (2018)
Article Google Scholar
Pian, C., et al.: LncRNApred: classification of long non-coding rnas and protein-coding transcripts by the ensemble algorithm with a new hybrid feature. PloS One 11(5), e0154567 (2016)
Article Google Scholar
Schneider, H.W., Raiol, T., Brigido, M.M., Walter, M.E.M., Stadler, P.F.: A support vector machine based method to distinguish long non-coding rnas from protein coding transcripts. BMC Genomics 18(1), 804 (2017)
Article Google Scholar
Shao, J., Yan, X., Shao, S.: SNR of DNA sequences mapped by general affine transformations of the indicator sequences. J. Math. Biol. 67(2), 433–451 (2013)
Article MathSciNet MATH Google Scholar
Sun, L., et al.: Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts. Nucleic Acids Res. 41(17), e166–e166 (2013)
Article Google Scholar
Ventola, G.M., Noviello, T.M., D’Aniello, S., Spagnuolo, A., Ceccarelli, M., Cerulo, L.: Identification of long non-coding transcripts with feature selection: a comparative study. BMC Bioinform. 18(1), 187 (2017)
Article Google Scholar
Voss, R.F.: Evolution of long-range fractal correlations and 1/f noise in dna base sequences. Phys. Rev. Lett. 68(25), 3805 (1992)
Article Google Scholar
Wang, L., Park, H.J., Dasari, S., Wang, S., Kocher, J.P., Li, W.: CPAT: coding-potential assessment tool using an alignment-free logistic regression model. Nucleic Acids Res. 41(6), e74 (2013)
Article Google Scholar
Yin, C., Chen, Y., Yau, S.S.T.: A measure of dna sequence similarity by fourier transform with applications on hierarchical clustering. J. Theor. Biol. 359, 18–28 (2014)
Article MathSciNet MATH Google Scholar
Yin, C., Yau, S.S.T.: A fourier characteristic of coding sequences: origins and a non-fourier approximation. J. Comput. Biol. 12(9), 1153–1165 (2005)
Article Google Scholar
Yin, C., Yau, S.S.T.: Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. J. Theor. Biol. 247(4), 687–694 (2007)
Article MathSciNet Google Scholar
Zhang, C.T.: A symmetrical theory of dna sequences and its applications. J. Theor. Biol. 187(3), 297–306 (1997)
Article Google Scholar
Zhang, R., Zhang, C.T.: Z curves, an intutive tool for visualizing and analyzing the dna sequences. J. Biomol. Struct. Dyn 11(4), 767–782 (1994)
Article Google Scholar
Zhang, W., Qu, Q., Zhang, Y., Wang, W.: The linear neighborhood propagation method for predicting long non-coding RNA-protein interactions. Neurocomputing 273, 526–534 (2018)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Bioinformatics Graduate Program (PPGBIOINFO), Federal University of Technology - Paraná, UTFPR, Cornélio Procópio, PR, Brazil
Robson Parmezan Bonidia, Lucas Dias Hiera Sampaio, Fabrício Martins Lopes & Danilo Sipoli Sanches

Authors

Robson Parmezan Bonidia
View author publications
You can also search for this author in PubMed Google Scholar
Lucas Dias Hiera Sampaio
View author publications
You can also search for this author in PubMed Google Scholar
Fabrício Martins Lopes
View author publications
You can also search for this author in PubMed Google Scholar
Danilo Sipoli Sanches
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Robson Parmezan Bonidia or Danilo Sipoli Sanches .

Editor information

Editors and Affiliations

Uppsala University, Uppsala, Sweden
Ingela Nyström
University of Information Science, Havana, Cuba
Yanio Hernández Heredia
University of Information Science, Havana, Cuba
Vladimir Milián Núñez

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bonidia, R.P., Sampaio, L.D.H., Lopes, F.M., Sanches, D.S. (2019). Feature Extraction of Long Non-coding RNAs: A Fourier and Numerical Mapping Approach. In: Nyström, I., Hernández Heredia, Y., Milián Núñez, V. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2019. Lecture Notes in Computer Science(), vol 11896. Springer, Cham. https://doi.org/10.1007/978-3-030-33904-3_44

Download citation

DOI: https://doi.org/10.1007/978-3-030-33904-3_44
Published: 22 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33903-6
Online ISBN: 978-3-030-33904-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Feature Extraction of Long Non-coding RNAs: A Fourier and Numerical Mapping Approach

Abstract

Similar content being viewed by others

LncMachine: a machine learning algorithm for long noncoding RNA annotation in plants

Prediction of LncRNA by Using Muitiple Feature Information Fusion and Feature Selection Technique

Machine Learning-Based Annotation of Long Noncoding RNAs Using PLncPRO

Keywords

1 Introduction