Abstract
The use of RNA-Seq has transformed the way sequencing reads are analyzed, allowing for qualitative and quantitative studies of transcriptomes. These studies always include an important collection (usually > 40%) of unknown transcripts. In this study, we improve the capability of Full-LengtherNext, an algorithm developed in our laboratory to annotate, analyze and correct de novo transcriptomes, to detect of potentially coding sequences. Here we analyze five software implementations of coding sequence predictors and show that the use of high-quality sequences at the training stage, proper threshold selection during score interrogation and the algorithm adaptation to its input type have a profound effect on the accuracy of the prediction. TransDecoder, the best performing algorithm in our tests, was thus added to the Full-LenghterNext pipeline, significantly improving its coding prediction reliability. Moreover, these analyses served to make inferences about the quality of the sample and to extract the subset of species specific (perhaps novel) genes discovered in the transcriptome assembly. Indirectly, we also demonstrated that Full-LentherNext sequence classification is appropriate and worth taking into consideration.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Benzekri, H., Armesto, P., Cousin, X., Rovira, M., Crespo, D., Merlo, M.A., Mazurais, D., Bautista, R., Guerrero-Fernández, D., Fernandez-Pozo, N., Ponce, M., Infante, C., Zambonino, J.L., Nidelet, S., Gut, M., Rebordinos, L., Planas, J.V., Bégout, M.L., Claros, M.G., Manchado, M.: De novo assembly, characterization and functional annotation of Senegalese sole (Solea senegalensis) and common sole (Solea solea) transcriptomes: integration in a database and design of a microarray. BMC Genomics 15, 952 (2014)
Besemer, J., Borodovsky, M.: Heuristic approach to deriving models for gene finding. Nucleic Acids Research 27(19), 3911–3920 (1999)
Canales, J., Bautista, R., Label, P., Gómez-Maldonado, J., Lesur, I., Fernández-Pozo, N., Rueda-López, M., Guerrero-Fernández, D., Castro-Rodríguez, V., Benzekri, H., Cañas, R.A., Guevara, M.A., Rodrigues, A., Seoane, P., Teyssier, C., Morel, A., Ehrenmann, F., Le Provost, G., Lalanne, C., Noirot, C., Klopp, C., Reymond, I., García-Gutiérrez, A., Trontin, J.F., Lelu-Walter, M.A., Miguel, C., Cervera, M.T., Cantón, F.R., Plomion, C., Harvengt, L., Avila, C., Gonzalo Claros, M., Cánovas, F.M.: De novo assembly of maritime pine transcriptome: implications for forest breeding and biotechnology. Plant Biotechnology Journal 12(3), 286–299 (2014)
Delcher, A.L., Bratke, K.A., Powers, E.C., Salzberg, S.L.: Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics (Oxford, England) 23(6), 673–679 (2007)
Ellegren, H.: Genome sequencing and population genomics in non-model organisms. Trends in Ecology & Evolution 29(1), 51–63 (2014)
Falgueras, J., Lara, A.J., Fernández-Pozo, N., Cantón, F.R., Pérez-Trabado, G., Claros, M.G.: SeqTrim: a high-throughput pipeline for pre-processing any type of sequence read. BMC Bioinformatics 11, 38 (2010)
Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters 27(8), 861–874 (2006)
Fickett, J.W.: Recognition of protein coding regions in DNA sequences. Nucleic Acids Research 10(17), 5303–5318 (1982)
Fu, L., Niu, B., Zhu, Z., Wu, S., Li, W.: CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics (Oxford, England) 28(23), 3150–3152 (2012)
Gao, J., Qi, Y., Cao, Y., Tung, W.W.: Protein coding sequence identification by simultaneously characterizing the periodic and random features of DNA sequences. Journal of Biomedicine & Biotechnology 2005(2), 139–146 (2005)
He, Z., Li, X., Ling, S., Fu, Y.X., Hungate, E., Shi, S., Wu, C.I.: Estimating DNA polymorphism from next generation sequencing data with high error rate by dual sequencing applications. BMC Genomics 14(1), 535 (2013)
Jones, C.E., Brown, A.L., Baumann, U.: Estimating the annotation error rate of curated GO database sequence annotations. BMC Bioinformatics 8, 170 (2007)
Lottaz, C., Iseli, C., Jongeneel, C.V., Bucher, P.: Modeling sequencing errors by combining Hidden Markov models. Bioinformatics 19(suppl. 2), ii103–ii112 (2003)
Martin, D.M.A., Berriman, M., Barton, G.J.: GOtcha: A new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics 5, 178 (2004)
Salzberg, S.L., Phillippy, A.M., Zimin, A., Puiu, D., Magoc, T., Koren, S., Treangen, T.J., Schatz, M.C., Delcher, A.L., Roberts, M., Marçais, G., Pop, M., Yorke, J.A.: GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Research 22(3), 557–567 (2012)
Schnoes, A.M., Brown, S.D., Dodevski, I., Babbitt, P.C.: Annotation error in public databases: Misannotation of molecular function in enzyme superfamilies. PLoS Computational Biology 5(12), e1000605 (2009)
Stanke, M., Schöffmann, O., Morgenstern, B., Waack, S.: Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7, 62 (2006)
Wang, L., Park, H.J., Dasari, S., Wang, S., Kocher, J.P., Li, W.: CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucleic Acids Research 41(6), e74 (2013)
Wang, Z., Gerstein, M., Snyder, M.: RNA-Seq: A revolutionary tool for transcriptomics. Nature Reviews Genetics 10(1), 57–63 (2009)
Yin, C., Yau, S.S.T.: Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. Journal of Theoretical Biology 247(4), 687–694 (2007)
Zagordi, O., Klein, R., Däumer, M., Beerenwinkel, N.: Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies. Nucleic Acids Research 38(21), 7400–7409 (2010)
Zhang, M.Q.: Identification of protein coding regions in the human genome by quadratic discriminant analysis. Proceedings of the National Academy of Sciences of the United States of America 94, 565–568 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Velasco, D., Seoane, P., Claros, M.G. (2015). Bioinformatics Analyses to Separate Species Specific mRNAs from Unknown Sequences in de novo Assembled Transcriptomes. In: Ortuño, F., Rojas, I. (eds) Bioinformatics and Biomedical Engineering. IWBBIO 2015. Lecture Notes in Computer Science(), vol 9044. Springer, Cham. https://doi.org/10.1007/978-3-319-16480-9_32
Download citation
DOI: https://doi.org/10.1007/978-3-319-16480-9_32
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-16479-3
Online ISBN: 978-3-319-16480-9
eBook Packages: Computer ScienceComputer Science (R0)