Skip to main content

Bioinformatics Analyses to Separate Species Specific mRNAs from Unknown Sequences in de novo Assembled Transcriptomes

  • Conference paper
Bioinformatics and Biomedical Engineering (IWBBIO 2015)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9044))

Included in the following conference series:

Abstract

The use of RNA-Seq has transformed the way sequencing reads are analyzed, allowing for qualitative and quantitative studies of transcriptomes. These studies always include an important collection (usually > 40%) of unknown transcripts. In this study, we improve the capability of Full-LengtherNext, an algorithm developed in our laboratory to annotate, analyze and correct de novo transcriptomes, to detect of potentially coding sequences. Here we analyze five software implementations of coding sequence predictors and show that the use of high-quality sequences at the training stage, proper threshold selection during score interrogation and the algorithm adaptation to its input type have a profound effect on the accuracy of the prediction. TransDecoder, the best performing algorithm in our tests, was thus added to the Full-LenghterNext pipeline, significantly improving its coding prediction reliability. Moreover, these analyses served to make inferences about the quality of the sample and to extract the subset of species specific (perhaps novel) genes discovered in the transcriptome assembly. Indirectly, we also demonstrated that Full-LentherNext sequence classification is appropriate and worth taking into consideration.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Benzekri, H., Armesto, P., Cousin, X., Rovira, M., Crespo, D., Merlo, M.A., Mazurais, D., Bautista, R., Guerrero-Fernández, D., Fernandez-Pozo, N., Ponce, M., Infante, C., Zambonino, J.L., Nidelet, S., Gut, M., Rebordinos, L., Planas, J.V., Bégout, M.L., Claros, M.G., Manchado, M.: De novo assembly, characterization and functional annotation of Senegalese sole (Solea senegalensis) and common sole (Solea solea) transcriptomes: integration in a database and design of a microarray. BMC Genomics 15, 952 (2014)

    Article  Google Scholar 

  2. Besemer, J., Borodovsky, M.: Heuristic approach to deriving models for gene finding. Nucleic Acids Research 27(19), 3911–3920 (1999)

    Article  Google Scholar 

  3. Canales, J., Bautista, R., Label, P., Gómez-Maldonado, J., Lesur, I., Fernández-Pozo, N., Rueda-López, M., Guerrero-Fernández, D., Castro-Rodríguez, V., Benzekri, H., Cañas, R.A., Guevara, M.A., Rodrigues, A., Seoane, P., Teyssier, C., Morel, A., Ehrenmann, F., Le Provost, G., Lalanne, C., Noirot, C., Klopp, C., Reymond, I., García-Gutiérrez, A., Trontin, J.F., Lelu-Walter, M.A., Miguel, C., Cervera, M.T., Cantón, F.R., Plomion, C., Harvengt, L., Avila, C., Gonzalo Claros, M., Cánovas, F.M.: De novo assembly of maritime pine transcriptome: implications for forest breeding and biotechnology. Plant Biotechnology Journal 12(3), 286–299 (2014)

    Article  Google Scholar 

  4. Delcher, A.L., Bratke, K.A., Powers, E.C., Salzberg, S.L.: Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics (Oxford, England) 23(6), 673–679 (2007)

    Article  Google Scholar 

  5. Ellegren, H.: Genome sequencing and population genomics in non-model organisms. Trends in Ecology & Evolution 29(1), 51–63 (2014)

    Article  Google Scholar 

  6. Falgueras, J., Lara, A.J., Fernández-Pozo, N., Cantón, F.R., Pérez-Trabado, G., Claros, M.G.: SeqTrim: a high-throughput pipeline for pre-processing any type of sequence read. BMC Bioinformatics 11, 38 (2010)

    Article  Google Scholar 

  7. Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters 27(8), 861–874 (2006)

    Article  MathSciNet  Google Scholar 

  8. Fickett, J.W.: Recognition of protein coding regions in DNA sequences. Nucleic Acids Research 10(17), 5303–5318 (1982)

    Article  Google Scholar 

  9. Fu, L., Niu, B., Zhu, Z., Wu, S., Li, W.: CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics (Oxford, England) 28(23), 3150–3152 (2012)

    Article  Google Scholar 

  10. Gao, J., Qi, Y., Cao, Y., Tung, W.W.: Protein coding sequence identification by simultaneously characterizing the periodic and random features of DNA sequences. Journal of Biomedicine & Biotechnology 2005(2), 139–146 (2005)

    Article  Google Scholar 

  11. He, Z., Li, X., Ling, S., Fu, Y.X., Hungate, E., Shi, S., Wu, C.I.: Estimating DNA polymorphism from next generation sequencing data with high error rate by dual sequencing applications. BMC Genomics 14(1), 535 (2013)

    Article  Google Scholar 

  12. Jones, C.E., Brown, A.L., Baumann, U.: Estimating the annotation error rate of curated GO database sequence annotations. BMC Bioinformatics 8, 170 (2007)

    Article  Google Scholar 

  13. Lottaz, C., Iseli, C., Jongeneel, C.V., Bucher, P.: Modeling sequencing errors by combining Hidden Markov models. Bioinformatics 19(suppl. 2), ii103–ii112 (2003)

    Google Scholar 

  14. Martin, D.M.A., Berriman, M., Barton, G.J.: GOtcha: A new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics 5, 178 (2004)

    Article  Google Scholar 

  15. Salzberg, S.L., Phillippy, A.M., Zimin, A., Puiu, D., Magoc, T., Koren, S., Treangen, T.J., Schatz, M.C., Delcher, A.L., Roberts, M., Marçais, G., Pop, M., Yorke, J.A.: GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Research 22(3), 557–567 (2012)

    Article  Google Scholar 

  16. Schnoes, A.M., Brown, S.D., Dodevski, I., Babbitt, P.C.: Annotation error in public databases: Misannotation of molecular function in enzyme superfamilies. PLoS Computational Biology 5(12), e1000605 (2009)

    Google Scholar 

  17. Stanke, M., Schöffmann, O., Morgenstern, B., Waack, S.: Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7, 62 (2006)

    Article  Google Scholar 

  18. Wang, L., Park, H.J., Dasari, S., Wang, S., Kocher, J.P., Li, W.: CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucleic Acids Research 41(6), e74 (2013)

    Google Scholar 

  19. Wang, Z., Gerstein, M., Snyder, M.: RNA-Seq: A revolutionary tool for transcriptomics. Nature Reviews Genetics 10(1), 57–63 (2009)

    Article  Google Scholar 

  20. Yin, C., Yau, S.S.T.: Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. Journal of Theoretical Biology 247(4), 687–694 (2007)

    Article  MathSciNet  Google Scholar 

  21. Zagordi, O., Klein, R., Däumer, M., Beerenwinkel, N.: Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies. Nucleic Acids Research 38(21), 7400–7409 (2010)

    Article  Google Scholar 

  22. Zhang, M.Q.: Identification of protein coding regions in the human genome by quadratic discriminant analysis. Proceedings of the National Academy of Sciences of the United States of America 94, 565–568 (1997)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Velasco, D., Seoane, P., Claros, M.G. (2015). Bioinformatics Analyses to Separate Species Specific mRNAs from Unknown Sequences in de novo Assembled Transcriptomes. In: Ortuño, F., Rojas, I. (eds) Bioinformatics and Biomedical Engineering. IWBBIO 2015. Lecture Notes in Computer Science(), vol 9044. Springer, Cham. https://doi.org/10.1007/978-3-319-16480-9_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-16480-9_32

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-16479-3

  • Online ISBN: 978-3-319-16480-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics