Bioinformatics Analyses to Separate Species Specific mRNAs from Unknown Sequences in de novo Assembled Transcriptomes

Velasco, David; Seoane, Pedro; Claros, M. Gonzalo

doi:10.1007/978-3-319-16480-9_32

David Velasco²⁰,
Pedro Seoane²⁰ &
M. Gonzalo Claros²⁰

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9044))

Included in the following conference series:

International Conference on Bioinformatics and Biomedical Engineering

3059 Accesses
1 Citations

Abstract

The use of RNA-Seq has transformed the way sequencing reads are analyzed, allowing for qualitative and quantitative studies of transcriptomes. These studies always include an important collection (usually > 40%) of unknown transcripts. In this study, we improve the capability of Full-LengtherNext, an algorithm developed in our laboratory to annotate, analyze and correct de novo transcriptomes, to detect of potentially coding sequences. Here we analyze five software implementations of coding sequence predictors and show that the use of high-quality sequences at the training stage, proper threshold selection during score interrogation and the algorithm adaptation to its input type have a profound effect on the accuracy of the prediction. TransDecoder, the best performing algorithm in our tests, was thus added to the Full-LenghterNext pipeline, significantly improving its coding prediction reliability. Moreover, these analyses served to make inferences about the quality of the sample and to extract the subset of species specific (perhaps novel) genes discovered in the transcriptome assembly. Indirectly, we also demonstrated that Full-LentherNext sequence classification is appropriate and worth taking into consideration.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Benzekri, H., Armesto, P., Cousin, X., Rovira, M., Crespo, D., Merlo, M.A., Mazurais, D., Bautista, R., Guerrero-Fernández, D., Fernandez-Pozo, N., Ponce, M., Infante, C., Zambonino, J.L., Nidelet, S., Gut, M., Rebordinos, L., Planas, J.V., Bégout, M.L., Claros, M.G., Manchado, M.: De novo assembly, characterization and functional annotation of Senegalese sole (Solea senegalensis) and common sole (Solea solea) transcriptomes: integration in a database and design of a microarray. BMC Genomics 15, 952 (2014)
Article Google Scholar
Besemer, J., Borodovsky, M.: Heuristic approach to deriving models for gene finding. Nucleic Acids Research 27(19), 3911–3920 (1999)
Article Google Scholar
Canales, J., Bautista, R., Label, P., Gómez-Maldonado, J., Lesur, I., Fernández-Pozo, N., Rueda-López, M., Guerrero-Fernández, D., Castro-Rodríguez, V., Benzekri, H., Cañas, R.A., Guevara, M.A., Rodrigues, A., Seoane, P., Teyssier, C., Morel, A., Ehrenmann, F., Le Provost, G., Lalanne, C., Noirot, C., Klopp, C., Reymond, I., García-Gutiérrez, A., Trontin, J.F., Lelu-Walter, M.A., Miguel, C., Cervera, M.T., Cantón, F.R., Plomion, C., Harvengt, L., Avila, C., Gonzalo Claros, M., Cánovas, F.M.: De novo assembly of maritime pine transcriptome: implications for forest breeding and biotechnology. Plant Biotechnology Journal 12(3), 286–299 (2014)
Article Google Scholar
Delcher, A.L., Bratke, K.A., Powers, E.C., Salzberg, S.L.: Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics (Oxford, England) 23(6), 673–679 (2007)
Article Google Scholar
Ellegren, H.: Genome sequencing and population genomics in non-model organisms. Trends in Ecology & Evolution 29(1), 51–63 (2014)
Article Google Scholar
Falgueras, J., Lara, A.J., Fernández-Pozo, N., Cantón, F.R., Pérez-Trabado, G., Claros, M.G.: SeqTrim: a high-throughput pipeline for pre-processing any type of sequence read. BMC Bioinformatics 11, 38 (2010)
Article Google Scholar
Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters 27(8), 861–874 (2006)
Article MathSciNet Google Scholar
Fickett, J.W.: Recognition of protein coding regions in DNA sequences. Nucleic Acids Research 10(17), 5303–5318 (1982)
Article Google Scholar
Fu, L., Niu, B., Zhu, Z., Wu, S., Li, W.: CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics (Oxford, England) 28(23), 3150–3152 (2012)
Article Google Scholar
Gao, J., Qi, Y., Cao, Y., Tung, W.W.: Protein coding sequence identification by simultaneously characterizing the periodic and random features of DNA sequences. Journal of Biomedicine & Biotechnology 2005(2), 139–146 (2005)
Article Google Scholar
He, Z., Li, X., Ling, S., Fu, Y.X., Hungate, E., Shi, S., Wu, C.I.: Estimating DNA polymorphism from next generation sequencing data with high error rate by dual sequencing applications. BMC Genomics 14(1), 535 (2013)
Article Google Scholar
Jones, C.E., Brown, A.L., Baumann, U.: Estimating the annotation error rate of curated GO database sequence annotations. BMC Bioinformatics 8, 170 (2007)
Article Google Scholar
Lottaz, C., Iseli, C., Jongeneel, C.V., Bucher, P.: Modeling sequencing errors by combining Hidden Markov models. Bioinformatics 19(suppl. 2), ii103–ii112 (2003)
Google Scholar
Martin, D.M.A., Berriman, M., Barton, G.J.: GOtcha: A new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics 5, 178 (2004)
Article Google Scholar
Salzberg, S.L., Phillippy, A.M., Zimin, A., Puiu, D., Magoc, T., Koren, S., Treangen, T.J., Schatz, M.C., Delcher, A.L., Roberts, M., Marçais, G., Pop, M., Yorke, J.A.: GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Research 22(3), 557–567 (2012)
Article Google Scholar
Schnoes, A.M., Brown, S.D., Dodevski, I., Babbitt, P.C.: Annotation error in public databases: Misannotation of molecular function in enzyme superfamilies. PLoS Computational Biology 5(12), e1000605 (2009)
Google Scholar
Stanke, M., Schöffmann, O., Morgenstern, B., Waack, S.: Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7, 62 (2006)
Article Google Scholar
Wang, L., Park, H.J., Dasari, S., Wang, S., Kocher, J.P., Li, W.: CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucleic Acids Research 41(6), e74 (2013)
Google Scholar
Wang, Z., Gerstein, M., Snyder, M.: RNA-Seq: A revolutionary tool for transcriptomics. Nature Reviews Genetics 10(1), 57–63 (2009)
Article Google Scholar
Yin, C., Yau, S.S.T.: Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. Journal of Theoretical Biology 247(4), 687–694 (2007)
Article MathSciNet Google Scholar
Zagordi, O., Klein, R., Däumer, M., Beerenwinkel, N.: Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies. Nucleic Acids Research 38(21), 7400–7409 (2010)
Article Google Scholar
Zhang, M.Q.: Identification of protein coding regions in the human genome by quadratic discriminant analysis. Proceedings of the National Academy of Sciences of the United States of America 94, 565–568 (1997)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Departamento de Bioquímica y Biología Molecular, Universidad de Málaga, Málaga, MA, 29071, Spain
David Velasco, Pedro Seoane & M. Gonzalo Claros

Authors

David Velasco
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Seoane
View author publications
You can also search for this author in PubMed Google Scholar
M. Gonzalo Claros
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Universidad de Granada. , Dpto. de Arquitectura y Tecnología de Computadores (ATC)., E.T.S. de Ingenierías en Informática y Telecomunicación. CITIC-UGR.,, , c/ Periodista Daniel Saucedo Aranda s/n, , 18071, Granada, , , Spain
Francisco Ortuño
Universidad de Granada, E.T.S. Ingenierías Informática y de Telecomunicación , , Dpto. Arquitectura y Tecnología de Computadores, CITIC-UGR, , , C Periodista Rafael Gómez Montero , , 18071, Granada,, Spain
Ignacio Rojas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Velasco, D., Seoane, P., Claros, M.G. (2015). Bioinformatics Analyses to Separate Species Specific mRNAs from Unknown Sequences in de novo Assembled Transcriptomes. In: Ortuño, F., Rojas, I. (eds) Bioinformatics and Biomedical Engineering. IWBBIO 2015. Lecture Notes in Computer Science(), vol 9044. Springer, Cham. https://doi.org/10.1007/978-3-319-16480-9_32

Download citation

DOI: https://doi.org/10.1007/978-3-319-16480-9_32
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-16479-3
Online ISBN: 978-3-319-16480-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics