Abstract
Understanding complex mammalian biology depends crucially on our ability to define a precise map of all the transcripts encoded in a genome, and to measure their relative abundances. A promising assay depends on RNASeq approaches, which builds on next generation sequencing pipelines capable of interrogating cDNAs extracted from a cell. The underlying pipeline starts with base-calling, collect the sequence reads and interpret the raw-read in terms of transcripts that are grouped with respect to different splice-variant isoforms of a messenger RNA. We address a very basic problem involved in all of these pipelines, namely accurate Bayesian base-calling, which could combine the analog intensity data with suitable underlying priors on base-composition in the transcripts. In the context of sequencing genomic DNA, a powerful approach for base-calling has been developed in the TotalReCaller pipeline. For these purposes, it uses a suitable reference whole-genome sequence in a compressed self-indexed format to derive its priors. However, TotalReCaller faces many new challenges in the transcriptomic domain, especially since we still lack a fully annotated library of all possible transcripts, and hence a sufficiently good prior. There are many possible solutions, similar to the ones developed for TotalReCaller, in applications addressing de novo sequencing and assembly, where partial contigs or string-graphs could be used to boot-strap the Bayesian priors on base-composition. A similar approach would be applicable here too, partial assembly of transcripts can be used to characterize the splicing junctions or organize them in incompatibility graphs and then provided as priors for TotalReCaller. The key algorithmic techniques for this purpose have been addressed in a forthcoming paper on Stringomics. Here, we address a related but fundamental problem, by assuming that we only have a reference genome, with certain intervals marked as candidate regions for ORF (Open Reading Frames), but not necessarily complete annotations regarding the 5’ or 3’ termini of a gene or its exon-intron structure. The algorithms we describe find the most accurate base-calls of a cDNA with the best possible segmentation, all mapped to the genome appropriately.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bartfai, T., Buckley, P.T., Eberwine, J.: Drug targets: single-cell transcriptomics hastens unbiased discovery. Trends in Pharmacological Sciences 33(1), 9–16 (2012)
Batut, P., Dobin, A., et al.: High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression. Genome Research 23(1), 169–180 (2013)
De Bona, F., Ossowski, S., et al.: Optimal spliced alignments of short sequence reads. Bioinformatics 24(16), I174–I180 (2008)
Djebali, S., Davis, C.A., et al.: Landscape of transcription in human cells. Nature 489(7414), 101–108 (2012)
Dobin, A., Davis, C.A., et al.: Star: ultrafast universal rna-seq aligner. Bioinformatics 29(1), 15–21 (2013)
Dunham, I., Kundaje, A., et al.: An integrated encyclopedia of dna elements in the human genome. Nature 489(7414), 57–74 (2012)
Ferragina, P., Mishra, B.: Pattern matching against ‘stringomes’. BIORXIV 2014(001669), 11 (2013)
Gingeras, T.R.: Implications of chimaeric non-co-linear transcripts. Nature 461(7261), 206–211 (2009)
Grant, G.R., Farkas, M.H., et al.: Comparative analysis of rna-seq alignment algorithms and the rna-seq unified mapper (rum). Bioinformatics 27(18), 2518–2528 (2011)
Land, A.H., Doig, A.G.: An automatic method of solving discrete programming problems. Econometrica: Journal of the Econometric Society 28(3), 497–520 (1960)
Lawler, E.L., Wood, D.E.: Branch-and-bound methods: A survey. Operations Research 14(4), 699–719 (1966)
Levsky, J.M., Shenoy, S.M., et al.: Single-cell gene expression profiling. Science 297(5582), 836–840 (2002)
Martin, J., Wang, Z.: Next-generation transcriptome assembly. Nature Reviews Genetics 12, 671–682 (2011)
Menges, F., Narzisi, G., Mishra, B.: Totalrecaller: improved accuracy and performance via integrated alignment and base-calling. Bioinformatics 27(17), 2330–2337 (2011)
Mishra, B.: The genome question: Moore vs. jevons. Computer Society of India: Journal of Computing (2012)
Narzisi, G., Mishra, B.: Scoring-and-unfolding trimmed tree assembler: Concepts, constructs and comparisons. Bioinformatics 27(12), 153–160 (2011)
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) (2007)
Tariq, M.A., Kim, H.J., et al.: Whole-transcriptome rnaseq analysis from minute amount of total rna. Nucleic Acids Research 39(18) (2011)
Tilgner, H., Knowles, D.G., et al.: Deep sequencing of subcellular rna fractions shows splicing to be predominantly co-transcriptional in the human genome but inefficient for incrnas. Genome Research 22(9), 1616–1625 (2012)
Trapnell, C., Pachter, L., Salzberg, S.L.: Tophat: discovering splice junctions with rna-seq. Bioinformatics 25(9), 1105–1111 (2009)
Wang, K., Singh, D., et al.: Mapsplice: Accurate mapping of rna-seq reads for splice junction discovery. Nucleic Acids Research 38(18) (2010)
Wigler, M.: Broad applications of single-cell nucleic acid analysis in biomedical research. Genome Medicine 4(10) (2012)
Wu, T.D., Nacu, S.: Fast and snp-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26(7), 873–881 (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Mishra, B. (2015). Gappy Total Recaller: Efficient Algorithms and Data Structures for Accurate Transcriptomics. In: Natarajan, R., Barua, G., Patra, M.R. (eds) Distributed Computing and Internet Technology. ICDCIT 2015. Lecture Notes in Computer Science, vol 8956. Springer, Cham. https://doi.org/10.1007/978-3-319-14977-6_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-14977-6_9
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14976-9
Online ISBN: 978-3-319-14977-6
eBook Packages: Computer ScienceComputer Science (R0)