PMirP: A pre-microRNA prediction method based on structure–sequence hybrid features
Introduction
MicroRNAs (miRNAs) are a newly-discovered class of endogenous non-coding RNAs of 21–26 nucleotides (nt) long and suppress translation of target genes by binding to their mRNAs. According to the current studies, a miRNA is originally transcribed from a long primary miRNA (pri-miRNA) by RNA polymerase II [1]. Then the pri-miRNA is processed into a 60–70 nt miRNA precursor (pre-miRNA) by nuclear RNase III Drosha [2]. The pre-miRNA is transported from nuclear to cytoplasm by exportin5 [3] and finally cleaved into a mature miRNA.
In recent years, more than 4000 miRNAs in various species have been confirmed. Increasing evidences have demonstrated that miRNA plays important roles in various biological processes, including cell proliferation, death and fat metabolism of Drosophila, the formation of nematodes, differentiation process of hematopoietic stem cells in mammals, and so on. Moreover, about one-third of human genes are regulated by miRNAs. Therefore, miRNA molecules are one of the core components in the networks of gene regulation.
More and more sophisticated computational approaches have been developed for predicting novel pre-miRNAs. Most of these methods focus on comparative genomics for identification and they rely on the principle that pre-miRNAs are evolutionarily conserved [3]. For instance, as a characteristic secondary structure, the stem-loop hairpin of pre-miRNA is highly conserved. Researchers have identified many potential miRNA genes [4] from conserved pre-miRNA stem-loop structures based on scanning genomic sequences using secondary structure prediction software (such as Vienna [5] and MFOLD [6]). For example, MiRscan has been employed to predict hundreds of pre-miRNAs in nematodes and human, with the sensitivity of about 74% [7], [8]. MiRdetector achieved the accuracy of about 90% by testing rice miRNA genes and confirmed 95 miRNA genes in rice [9]. Based on a set of novel structure features of the stem-loops, Xue et al. developed the Triplet-SVM method, which could be used to identify pre-miRNAs effectively without utilizing comparative genomics information [10]. Nevertheless, there was no web server provided for the method. Mipred improved Triplet-SVM method by adding MFE and p-value features and using a RF (random forest) machine-learning algorithm [11]. However, the web server of Mipred is time consuming and it can only be used to handle at most three sequences at a time.
In this paper, we use a support vector machine (SVM) incorporating hybrid features of left-triplet method, the double helix structure with free nucleotides, the minimum of free energy (MFE) of secondary structure and base-pairing features to differentiate between real miRNAs and pseudo pre-miRNAs with similar stem-loop structures. The left-triplet combines the local continuous structure with sequence information of the stem-loops to encode the secondary structures of pre-miRNA. In addition, for the first time, the RNA double helix as an important feature is used to identify pre-miRNA. The test results show that the proposed method can obtain higher identification accuracy than the existing methods. We also develop a web server for our method, which is available to the public for free.
Section snippets
Standard SVM
Let (xi, yi), i = 1,…,m be the training set, where xi ∈ Rn, yi ∈ (−1, 1). Training a SVM is a problem that to solve a linearly constrained quadratic programming: where the matrix Q is defined by Qij = yiyjK(xi, xj), i, j = 1,…,m and C is a regularization constant. Then the separating rule based on the optimal hyperplane is the indicator function: , where K is a kernel function, xi with nonzero αi are the support vectors.
In kernel designs, the
Comparison with triplet-SVM method
As shown in Table 1, 68 out of 69 human pre-miRNAs are correctly recognized by our method and 949 out of the 1000 pseudo pre-miRNAs are correctly detected as negative, which give accuracies of 98.4% and 94.9%, respectively. In particular, a major improvement is shown in the pseudo-miRNAs prediction, because the triplet-SVM method of Ref. [10] predicted only 881 correct samples with a specificity of 88.1%. On the set of human pre-miRNA, the precision of our method is superior to that of the
Discussion and conclusions
Based on the triplet-SVM method, the hybrid coding method is developed by introducing the left-triplet, the free nucleotide, MFE and base-pairings. The SVM method is applied as the classifier. To test the effectiveness of the proposed method, the human pre-miRNA sequence data, other 11 species data sets and the latest pre-miRNA sequences are applied to both our method and the triplet-SVM method. Experimental results show that the proposed method improves the prediction accuracy compared with
Acknowledgements
We wish to thank Institute of Bioinformatics, University of Georgia, USA for technical assistance. We also like to thank Dr. Chenghai Xue of Tsinghua University for helpful discussions. We are grateful to the support of the NSFC (10872077, 60903097, 60703025), the National High-technology Development Project of China (2009AA02Z307), and the Science-Technology Development Project from Jilin Province (20080708, 20090152).
References (22)
MicroRNAs: genomics, biogenesis, mechanism, and function
Cell
(2004)- et al.
MicroRNA: past and present
Front Biosci
(2007) - et al.
An extensive class of small RNAs in Caenorhabditis elegans
Science
(2001) - et al.
A computational approach to identify genes for functional RNAs in genomic sequence
Nucleic Acids Res
(2001) Vienna RNA secondary structure server
Nucleic Acids Res
(2003)Prediction of RNA secondary structure by energy minimization
Methods Mol Biol
(1994)- et al.
The microRNAs of Caenorhabditis elegans
Genes Dev
(2003) - et al.
Vertebrate microRNA genes
Science
(2003) - et al.
Computational identification of Drosophila microRNA genes
Genome Biol
(2003) - et al.
Classification of real and pseudo microRNA precursors using local structure–sequence features and support vector machine
BMC Bioinformatics
(2005)
MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features
Nucleic Acids Res
Cited by (29)
Non-coding RNA identification based on topology secondary structure and reading frame in organelle genome level
2016, GenomicsCitation Excerpt :In each kind of reading frame, triplets along the structure-sequence can be denoted as “U(((”, “A((.”, etc., here U or A presents the type of the middle nucleotide among the triplets [28,29]. The structure-sequence features only denote that the bases are paired or unpaired in ncRNA primary sequence level.
The role of real-time in biomedical science: A meta-analysis on computational complexity, delay and speedup
2015, Computers in Biology and MedicinePrediction of microRNA target genes using an efficient genetic algorithm-based decision tree
2015, FEBS Open BioCitation Excerpt :In addition to above mentioned approaches, there are many machine learning algorithms for miRNAs target prediction. Support Vector Machine (SVM) [21–25], Naïve Bayes [26,27], Artificial Neural Network (ANN) [28], Pattern Recognition Neural Network (PRNN) [29], ensemble algorithm [16] and other machine learning algorithms [30] have been used for prediction of miRNAs targets. In the present study, we introduce an efficient genetic algorithm-based decision tree to select the best rules among all extracted rule sets which leads to improve the accuracy of prediction.
Predicting human microRNA precursors based on an optimized feature subset generated by GA-SVM
2011, GenomicsCitation Excerpt :These were important for predicting RNA secondary structure [20,23]. Furthermore, stem-loop secondary structure was important for predicting the pre-miRNAs [7,24]. Thus, the two features we proposed might be promising characteristics for distinguishing the pre-miRNAs from non-pre-miRNAs.
Unveiling the world of bee microRNAs: computational identification and characterization of pathway genes, conserved microRNAs, and their targets
2024, International Journal of Tropical Insect Science