Skip to main content

Influence of Sequence Length in Promoter Prediction Performance

  • Conference paper
Advances in Bioinformatics and Computational Biology (BSB 2014)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 8826))

Included in the following conference series:

Abstract

The advent of rapid evolution on sequencing capacity of new genomes has evidenced the need for data analysis automation aiming at speeding up the genomic annotation process and reducing its cost. Given that one important step for functional genomic annotation is the promoter identification, several studies have been taken in order to propose computational approaches to predict promoters. Different classifiers and characteristics of the promoter sequences have been used to deal with this prediction problem. However, several works in literature have addressed the promoter prediction problem using datasets containing sequences of 250 nucleotides or more. As the sequence length defines the amount of dataset attributes, even considering a limited number of properties to characterize the sequences, datasets with a high number of attributes are generated for training classifiers. Once high-dimensional datasets can degrade the classifiers predictive performance or even require an infesible processing time, predicting promoters by training classifiers from datasets with a reduced number of attributes, it is essential to obtain good predictive performance with low computational cost. To the best of our knowledge, there is no work in literature that verified in a sistematic way the relation between the sequences length and the predictive performance of classifiers. Thus, in this work, sixteen datasets composed of different sized sequences are built and evaluated using the SVM and k-NN classifiers. The experimental results show that several datasets composed of shorter sequences acheived better predictive performance when compared with datasets composed of longer sequences and consumed a significantly shorter processing time.

This research was partially supported by CNPq, FAPEMIG and UFOP.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 34.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 44.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abeel, T., Saeys, Y., Bonnet, E., Rouzé, P., Van de Peer, Y.: Generic eukaryotic core promoter prediction using structural features of dna. Genome Research 18(2), 310–323 (2008)

    Article  Google Scholar 

  2. Abeel, T., Saeys, Y., Rouzé, P., Van de Peer, Y.: Prosom: core promoter prediction based on unsupervised clustering of dna physical profiles. Bioinformatics 24(13), i24–i31 (2008)

    Google Scholar 

  3. Baldi, P., Brunak, S., Chauvin, Y., Pedersen, A.G.: Computational applications of dna structural scales. In: Glasgow, J.I., Littlejohn, T.G., Major, F., Lathrop, R.H., Sankoff, D., Sensen, C. (eds.) ISMB, pp. 35–42. AAAI (1998)

    Google Scholar 

  4. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  5. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297 (1995)

    MATH  Google Scholar 

  6. Dineen, D., Schroder, M., Higgins, D., Cunningham, P.: Ensemble approach combining multiple methods improves human transcription start site prediction. BMC Genomics 11(1), 677 (2010)

    Article  Google Scholar 

  7. Florquin, K., Saeys, Y., Degroeve, S., Rouzé, P., Van de Peer, Y.: Large-scale structural analysis of the core promoter in mammalian and plant genomes. Nucleic Acids Research 33(13), 4255–4264 (2005)

    Article  Google Scholar 

  8. Gan, Y., Guan, J., Zhou, S.: A pattern-based nearest neighbor search approach for promoter prediction using dna structural profiles. Bioinf. 25(16), 2006–2012 (2009)

    Article  Google Scholar 

  9. Gan, Y., Guan, J., Zhou, S.: A comparison study on feature selection of dna structural properties for promoter prediction. BMC Bioinformatics 13(1),  4 (2012)

    Google Scholar 

  10. Grishkevich, V., Hashimshony, T., Yanai, I.: Core promoter t-blocks correlate with gene expression levels in c. elegans. Genome Research 21(5), 707–717 (2011)

    Article  Google Scholar 

  11. Meysman, P., Marchal, K., Engelen, K.: DNA structural properties in the classification of genomic transcription regulation elements. Bioinformatics and Biology Insights 6, 155–168 (2012)

    Article  Google Scholar 

  12. Ohler, U., Niemann, H., Liao, G.C., Rubin, G.M.: Joint modeling of dna sequence and physical properties to improve eukaryotic promoter recognition. Bioinformatics 17(suppl. 1), S199–S206 (2001)

    Google Scholar 

  13. Yamashita, R., Sugano, S., Suzuki, Y., Nakai, K.: Dbtss: Database of transcriptional start sites progress report in 2012. Nucleic Acids Res. 40(D1), 150–154 (2012)

    Google Scholar 

  14. Zeng, J., Zhu, S., Yan, H.: Towards accurate human promoter recognition: a review of currently used sequence features and classification methods. Briefings in Bioinformatics 10(5), 498–508 (2009)

    Article  Google Scholar 

  15. Kuhn, M., Johnson, K.: Applied Predictive Modeling. Springer (2013)

    Google Scholar 

  16. Abeel, T., Van de Peer, Y., Saeys, Y.: Toward a gold standard for promoter prediction evaluation. Bioinformatics 25(12), i313–i320 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Carvalho, S.G., Guerra-Sá, R., de C. Merschmann, L.H. (2014). Influence of Sequence Length in Promoter Prediction Performance. In: Campos, S. (eds) Advances in Bioinformatics and Computational Biology. BSB 2014. Lecture Notes in Computer Science(), vol 8826. Springer, Cham. https://doi.org/10.1007/978-3-319-12418-6_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12418-6_6

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12417-9

  • Online ISBN: 978-3-319-12418-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics