Gappy Total Recaller: Efficient Algorithms and Data Structures for Accurate Transcriptomics

Mishra, B.

doi:10.1007/978-3-319-14977-6_9

B. Mishra¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8956))

Included in the following conference series:

International Conference on Distributed Computing and Internet Technology

2180 Accesses

Abstract

Understanding complex mammalian biology depends crucially on our ability to define a precise map of all the transcripts encoded in a genome, and to measure their relative abundances. A promising assay depends on RNASeq approaches, which builds on next generation sequencing pipelines capable of interrogating cDNAs extracted from a cell. The underlying pipeline starts with base-calling, collect the sequence reads and interpret the raw-read in terms of transcripts that are grouped with respect to different splice-variant isoforms of a messenger RNA. We address a very basic problem involved in all of these pipelines, namely accurate Bayesian base-calling, which could combine the analog intensity data with suitable underlying priors on base-composition in the transcripts. In the context of sequencing genomic DNA, a powerful approach for base-calling has been developed in the TotalReCaller pipeline. For these purposes, it uses a suitable reference whole-genome sequence in a compressed self-indexed format to derive its priors. However, TotalReCaller faces many new challenges in the transcriptomic domain, especially since we still lack a fully annotated library of all possible transcripts, and hence a sufficiently good prior. There are many possible solutions, similar to the ones developed for TotalReCaller, in applications addressing de novo sequencing and assembly, where partial contigs or string-graphs could be used to boot-strap the Bayesian priors on base-composition. A similar approach would be applicable here too, partial assembly of transcripts can be used to characterize the splicing junctions or organize them in incompatibility graphs and then provided as priors for TotalReCaller. The key algorithmic techniques for this purpose have been addressed in a forthcoming paper on Stringomics. Here, we address a related but fundamental problem, by assuming that we only have a reference genome, with certain intervals marked as candidate regions for ORF (Open Reading Frames), but not necessarily complete annotations regarding the 5’ or 3’ termini of a gene or its exon-intron structure. The algorithms we describe find the most accurate base-calls of a cDNA with the best possible segmentation, all mapped to the genome appropriately.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Ryūtō: network-flow based transcriptome reconstruction

Article Open access 16 April 2019

TransComb: genome-guided transcriptome assembly via combing junctions in splicing graphs

Article Open access 19 October 2016

MGcount: a total RNA-seq quantification tool to address multi-mapping and multi-overlapping alignments ambiguity in non-coding transcripts

Article Open access 14 January 2022

References

Bartfai, T., Buckley, P.T., Eberwine, J.: Drug targets: single-cell transcriptomics hastens unbiased discovery. Trends in Pharmacological Sciences 33(1), 9–16 (2012)
Article Google Scholar
Batut, P., Dobin, A., et al.: High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression. Genome Research 23(1), 169–180 (2013)
Article Google Scholar
De Bona, F., Ossowski, S., et al.: Optimal spliced alignments of short sequence reads. Bioinformatics 24(16), I174–I180 (2008)
Google Scholar
Djebali, S., Davis, C.A., et al.: Landscape of transcription in human cells. Nature 489(7414), 101–108 (2012)
Article Google Scholar
Dobin, A., Davis, C.A., et al.: Star: ultrafast universal rna-seq aligner. Bioinformatics 29(1), 15–21 (2013)
Article Google Scholar
Dunham, I., Kundaje, A., et al.: An integrated encyclopedia of dna elements in the human genome. Nature 489(7414), 57–74 (2012)
Article Google Scholar
Ferragina, P., Mishra, B.: Pattern matching against ‘stringomes’. BIORXIV 2014(001669), 11 (2013)
Google Scholar
Gingeras, T.R.: Implications of chimaeric non-co-linear transcripts. Nature 461(7261), 206–211 (2009)
Article Google Scholar
Grant, G.R., Farkas, M.H., et al.: Comparative analysis of rna-seq alignment algorithms and the rna-seq unified mapper (rum). Bioinformatics 27(18), 2518–2528 (2011)
Google Scholar
Land, A.H., Doig, A.G.: An automatic method of solving discrete programming problems. Econometrica: Journal of the Econometric Society 28(3), 497–520 (1960)
Article MATH MathSciNet Google Scholar
Lawler, E.L., Wood, D.E.: Branch-and-bound methods: A survey. Operations Research 14(4), 699–719 (1966)
Article MATH MathSciNet Google Scholar
Levsky, J.M., Shenoy, S.M., et al.: Single-cell gene expression profiling. Science 297(5582), 836–840 (2002)
Article Google Scholar
Martin, J., Wang, Z.: Next-generation transcriptome assembly. Nature Reviews Genetics 12, 671–682 (2011)
Article Google Scholar
Menges, F., Narzisi, G., Mishra, B.: Totalrecaller: improved accuracy and performance via integrated alignment and base-calling. Bioinformatics 27(17), 2330–2337 (2011)
Article Google Scholar
Mishra, B.: The genome question: Moore vs. jevons. Computer Society of India: Journal of Computing (2012)
Google Scholar
Narzisi, G., Mishra, B.: Scoring-and-unfolding trimmed tree assembler: Concepts, constructs and comparisons. Bioinformatics 27(12), 153–160 (2011)
Article Google Scholar
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) (2007)
Google Scholar
Tariq, M.A., Kim, H.J., et al.: Whole-transcriptome rnaseq analysis from minute amount of total rna. Nucleic Acids Research 39(18) (2011)
Google Scholar
Tilgner, H., Knowles, D.G., et al.: Deep sequencing of subcellular rna fractions shows splicing to be predominantly co-transcriptional in the human genome but inefficient for incrnas. Genome Research 22(9), 1616–1625 (2012)
Article Google Scholar
Trapnell, C., Pachter, L., Salzberg, S.L.: Tophat: discovering splice junctions with rna-seq. Bioinformatics 25(9), 1105–1111 (2009)
Article Google Scholar
Wang, K., Singh, D., et al.: Mapsplice: Accurate mapping of rna-seq reads for splice junction discovery. Nucleic Acids Research 38(18) (2010)
Google Scholar
Wigler, M.: Broad applications of single-cell nucleic acid analysis in biomedical research. Genome Medicine 4(10) (2012)
Google Scholar
Wu, T.D., Nacu, S.: Fast and snp-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26(7), 873–881 (2010)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Courant Institute, NYU, New York, USA
B. Mishra

Authors

B. Mishra
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Technology and Computer Science, Tata Institute of Fundamental Research, Homi Bhabha Road, Colaba, 400005, Mumbai, India
Raja Natarajan
Department of Computer Science and Engineering, Indian Institute of Technology Guwahati, 781039, Guwahati, India
Gautam Barua
Department of Computer Science,, Berhampur University, 760007, Berhampur, Odisha, India
Manas Ranjan Patra

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mishra, B. (2015). Gappy Total Recaller: Efficient Algorithms and Data Structures for Accurate Transcriptomics. In: Natarajan, R., Barua, G., Patra, M.R. (eds) Distributed Computing and Internet Technology. ICDCIT 2015. Lecture Notes in Computer Science, vol 8956. Springer, Cham. https://doi.org/10.1007/978-3-319-14977-6_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-14977-6_9
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14976-9
Online ISBN: 978-3-319-14977-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics