PMirP: A pre-microRNA prediction method based on structure–sequence hybrid features

https://doi.org/10.1016/j.artmed.2010.03.004Get rights and content

Abstract

Objective

MicroRNA is a type of small non-coding RNAs, which usually has a stem-loop structure. As an important stage of microRNA, the pre-microRNA is transported from nuclear to cytoplasm by exportin5 and finally cleaved into mature microRNA. Structure–sequence features and minimum of free energy of secondary structure have been used for predicting pre-microRNA. Meanwhile, the double helix structure with free nucleotides and base-pairing features is used to identify pre-miRNA for the first time.

Methods

We applied support vector machine for a novel hybrid coding scheme using left-triplet method, the free nucleotides, the minimum of free energy of secondary structure and base-pairings features. Data sets of human pre-microRNA, other 11 species and the latest pre-microRNA sequences were used for testing.

Results

In this study we developed an improved method for pre-microRNA prediction using a combination of various features and a web server called PMirP. The prediction specificity and sensitivity for real and pseudo human pre-microRNAs are as high as 98.4% and 94.9%, respectively. The web server is freely available to the public at http://ccst.jlu.edu.cn/ci/bioinformatics/MiRNA (accessed: 26 February 2010).

Conclusions

Experimental results show that the proposed method improves the prediction efficiency and accuracy over existing methods. In addition, the PMirP has lower computational complexity and higher throughput prediction capacity than Mipred web server.

Introduction

MicroRNAs (miRNAs) are a newly-discovered class of endogenous non-coding RNAs of 21–26 nucleotides (nt) long and suppress translation of target genes by binding to their mRNAs. According to the current studies, a miRNA is originally transcribed from a long primary miRNA (pri-miRNA) by RNA polymerase II [1]. Then the pri-miRNA is processed into a 60–70 nt miRNA precursor (pre-miRNA) by nuclear RNase III Drosha [2]. The pre-miRNA is transported from nuclear to cytoplasm by exportin5 [3] and finally cleaved into a mature miRNA.

In recent years, more than 4000 miRNAs in various species have been confirmed. Increasing evidences have demonstrated that miRNA plays important roles in various biological processes, including cell proliferation, death and fat metabolism of Drosophila, the formation of nematodes, differentiation process of hematopoietic stem cells in mammals, and so on. Moreover, about one-third of human genes are regulated by miRNAs. Therefore, miRNA molecules are one of the core components in the networks of gene regulation.

More and more sophisticated computational approaches have been developed for predicting novel pre-miRNAs. Most of these methods focus on comparative genomics for identification and they rely on the principle that pre-miRNAs are evolutionarily conserved [3]. For instance, as a characteristic secondary structure, the stem-loop hairpin of pre-miRNA is highly conserved. Researchers have identified many potential miRNA genes [4] from conserved pre-miRNA stem-loop structures based on scanning genomic sequences using secondary structure prediction software (such as Vienna [5] and MFOLD [6]). For example, MiRscan has been employed to predict hundreds of pre-miRNAs in nematodes and human, with the sensitivity of about 74% [7], [8]. MiRdetector achieved the accuracy of about 90% by testing rice miRNA genes and confirmed 95 miRNA genes in rice [9]. Based on a set of novel structure features of the stem-loops, Xue et al. developed the Triplet-SVM method, which could be used to identify pre-miRNAs effectively without utilizing comparative genomics information [10]. Nevertheless, there was no web server provided for the method. Mipred improved Triplet-SVM method by adding MFE and p-value features and using a RF (random forest) machine-learning algorithm [11]. However, the web server of Mipred is time consuming and it can only be used to handle at most three sequences at a time.

In this paper, we use a support vector machine (SVM) incorporating hybrid features of left-triplet method, the double helix structure with free nucleotides, the minimum of free energy (MFE) of secondary structure and base-pairing features to differentiate between real miRNAs and pseudo pre-miRNAs with similar stem-loop structures. The left-triplet combines the local continuous structure with sequence information of the stem-loops to encode the secondary structures of pre-miRNA. In addition, for the first time, the RNA double helix as an important feature is used to identify pre-miRNA. The test results show that the proposed method can obtain higher identification accuracy than the existing methods. We also develop a web server for our method, which is available to the public for free.

Section snippets

Standard SVM

Let (xi, yi), i = 1,…,m be the training set, where xi  Rn, yi  (−1, 1). Training a SVM is a problem that to solve a linearly constrained quadratic programming: minαW(α)=12αTQαeTα,s.t.αTy=0,0αC, where the matrix Q is defined by Qij = yiyjK(xi, xj), i, j = 1,…,m and C is a regularization constant. Then the separating rule based on the optimal hyperplane is the indicator function: f(x)=sgniαiyiK(x,xi)+b, where K is a kernel function, xi with nonzero αi are the support vectors.

In kernel designs, the

Comparison with triplet-SVM method

As shown in Table 1, 68 out of 69 human pre-miRNAs are correctly recognized by our method and 949 out of the 1000 pseudo pre-miRNAs are correctly detected as negative, which give accuracies of 98.4% and 94.9%, respectively. In particular, a major improvement is shown in the pseudo-miRNAs prediction, because the triplet-SVM method of Ref. [10] predicted only 881 correct samples with a specificity of 88.1%. On the set of human pre-miRNA, the precision of our method is superior to that of the

Discussion and conclusions

Based on the triplet-SVM method, the hybrid coding method is developed by introducing the left-triplet, the free nucleotide, MFE and base-pairings. The SVM method is applied as the classifier. To test the effectiveness of the proposed method, the human pre-miRNA sequence data, other 11 species data sets and the latest pre-miRNA sequences are applied to both our method and the triplet-SVM method. Experimental results show that the proposed method improves the prediction accuracy compared with

Acknowledgements

We wish to thank Institute of Bioinformatics, University of Georgia, USA for technical assistance. We also like to thank Dr. Chenghai Xue of Tsinghua University for helpful discussions. We are grateful to the support of the NSFC (10872077, 60903097, 60703025), the National High-technology Development Project of China (2009AA02Z307), and the Science-Technology Development Project from Jilin Province (20080708, 20090152).

References (22)

  • D.P. Bartel

    MicroRNAs: genomics, biogenesis, mechanism, and function

    Cell

    (2004)
  • Y. Wang et al.

    MicroRNA: past and present

    Front Biosci

    (2007)
  • R.C. Lee et al.

    An extensive class of small RNAs in Caenorhabditis elegans

    Science

    (2001)
  • R.J. Carter et al.

    A computational approach to identify genes for functional RNAs in genomic sequence

    Nucleic Acids Res

    (2001)
  • I.L. Hofacker

    Vienna RNA secondary structure server

    Nucleic Acids Res

    (2003)
  • M. Zuker

    Prediction of RNA secondary structure by energy minimization

    Methods Mol Biol

    (1994)
  • L.P. Lim et al.

    The microRNAs of Caenorhabditis elegans

    Genes Dev

    (2003)
  • L.P. Lim et al.

    Vertebrate microRNA genes

    Science

    (2003)
  • E.C. Lai et al.

    Computational identification of Drosophila microRNA genes

    Genome Biol

    (2003)
  • C.H. Xue et al.

    Classification of real and pseudo microRNA precursors using local structure–sequence features and support vector machine

    BMC Bioinformatics

    (2005)
  • J. Peng et al.

    MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features

    Nucleic Acids Res

    (2007)
  • Cited by (29)

    • Non-coding RNA identification based on topology secondary structure and reading frame in organelle genome level

      2016, Genomics
      Citation Excerpt :

      In each kind of reading frame, triplets along the structure-sequence can be denoted as “U(((”, “A((.”, etc., here U or A presents the type of the middle nucleotide among the triplets [28,29]. The structure-sequence features only denote that the bases are paired or unpaired in ncRNA primary sequence level.

    • Prediction of microRNA target genes using an efficient genetic algorithm-based decision tree

      2015, FEBS Open Bio
      Citation Excerpt :

      In addition to above mentioned approaches, there are many machine learning algorithms for miRNAs target prediction. Support Vector Machine (SVM) [21–25], Naïve Bayes [26,27], Artificial Neural Network (ANN) [28], Pattern Recognition Neural Network (PRNN) [29], ensemble algorithm [16] and other machine learning algorithms [30] have been used for prediction of miRNAs targets. In the present study, we introduce an efficient genetic algorithm-based decision tree to select the best rules among all extracted rule sets which leads to improve the accuracy of prediction.

    • Predicting human microRNA precursors based on an optimized feature subset generated by GA-SVM

      2011, Genomics
      Citation Excerpt :

      These were important for predicting RNA secondary structure [20,23]. Furthermore, stem-loop secondary structure was important for predicting the pre-miRNAs [7,24]. Thus, the two features we proposed might be promising characteristics for distinguishing the pre-miRNAs from non-pre-miRNAs.

    View all citing articles on Scopus
    View full text