Skip to main content

Gappy Total Recaller: Efficient Algorithms and Data Structures for Accurate Transcriptomics

  • Conference paper
Distributed Computing and Internet Technology (ICDCIT 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8956))

  • 2133 Accesses

Abstract

Understanding complex mammalian biology depends crucially on our ability to define a precise map of all the transcripts encoded in a genome, and to measure their relative abundances. A promising assay depends on RNASeq approaches, which builds on next generation sequencing pipelines capable of interrogating cDNAs extracted from a cell. The underlying pipeline starts with base-calling, collect the sequence reads and interpret the raw-read in terms of transcripts that are grouped with respect to different splice-variant isoforms of a messenger RNA. We address a very basic problem involved in all of these pipelines, namely accurate Bayesian base-calling, which could combine the analog intensity data with suitable underlying priors on base-composition in the transcripts. In the context of sequencing genomic DNA, a powerful approach for base-calling has been developed in the TotalReCaller pipeline. For these purposes, it uses a suitable reference whole-genome sequence in a compressed self-indexed format to derive its priors. However, TotalReCaller faces many new challenges in the transcriptomic domain, especially since we still lack a fully annotated library of all possible transcripts, and hence a sufficiently good prior. There are many possible solutions, similar to the ones developed for TotalReCaller, in applications addressing de novo sequencing and assembly, where partial contigs or string-graphs could be used to boot-strap the Bayesian priors on base-composition. A similar approach would be applicable here too, partial assembly of transcripts can be used to characterize the splicing junctions or organize them in incompatibility graphs and then provided as priors for TotalReCaller. The key algorithmic techniques for this purpose have been addressed in a forthcoming paper on Stringomics. Here, we address a related but fundamental problem, by assuming that we only have a reference genome, with certain intervals marked as candidate regions for ORF (Open Reading Frames), but not necessarily complete annotations regarding the 5’ or 3’ termini of a gene or its exon-intron structure. The algorithms we describe find the most accurate base-calls of a cDNA with the best possible segmentation, all mapped to the genome appropriately.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Bartfai, T., Buckley, P.T., Eberwine, J.: Drug targets: single-cell transcriptomics hastens unbiased discovery. Trends in Pharmacological Sciences 33(1), 9–16 (2012)

    Article  Google Scholar 

  • Batut, P., Dobin, A., et al.: High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression. Genome Research 23(1), 169–180 (2013)

    Article  Google Scholar 

  • De Bona, F., Ossowski, S., et al.: Optimal spliced alignments of short sequence reads. Bioinformatics 24(16), I174–I180 (2008)

    Google Scholar 

  • Djebali, S., Davis, C.A., et al.: Landscape of transcription in human cells. Nature 489(7414), 101–108 (2012)

    Article  Google Scholar 

  • Dobin, A., Davis, C.A., et al.: Star: ultrafast universal rna-seq aligner. Bioinformatics 29(1), 15–21 (2013)

    Article  Google Scholar 

  • Dunham, I., Kundaje, A., et al.: An integrated encyclopedia of dna elements in the human genome. Nature 489(7414), 57–74 (2012)

    Article  Google Scholar 

  • Ferragina, P., Mishra, B.: Pattern matching against ‘stringomes’. BIORXIV 2014(001669), 11 (2013)

    Google Scholar 

  • Gingeras, T.R.: Implications of chimaeric non-co-linear transcripts. Nature 461(7261), 206–211 (2009)

    Article  Google Scholar 

  • Grant, G.R., Farkas, M.H., et al.: Comparative analysis of rna-seq alignment algorithms and the rna-seq unified mapper (rum). Bioinformatics 27(18), 2518–2528 (2011)

    Google Scholar 

  • Land, A.H., Doig, A.G.: An automatic method of solving discrete programming problems. Econometrica: Journal of the Econometric Society 28(3), 497–520 (1960)

    Article  MATH  MathSciNet  Google Scholar 

  • Lawler, E.L., Wood, D.E.: Branch-and-bound methods: A survey. Operations Research 14(4), 699–719 (1966)

    Article  MATH  MathSciNet  Google Scholar 

  • Levsky, J.M., Shenoy, S.M., et al.: Single-cell gene expression profiling. Science 297(5582), 836–840 (2002)

    Article  Google Scholar 

  • Martin, J., Wang, Z.: Next-generation transcriptome assembly. Nature Reviews Genetics 12, 671–682 (2011)

    Article  Google Scholar 

  • Menges, F., Narzisi, G., Mishra, B.: Totalrecaller: improved accuracy and performance via integrated alignment and base-calling. Bioinformatics 27(17), 2330–2337 (2011)

    Article  Google Scholar 

  • Mishra, B.: The genome question: Moore vs. jevons. Computer Society of India: Journal of Computing (2012)

    Google Scholar 

  • Narzisi, G., Mishra, B.: Scoring-and-unfolding trimmed tree assembler: Concepts, constructs and comparisons. Bioinformatics 27(12), 153–160 (2011)

    Article  Google Scholar 

  • Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) (2007)

    Google Scholar 

  • Tariq, M.A., Kim, H.J., et al.: Whole-transcriptome rnaseq analysis from minute amount of total rna. Nucleic Acids Research 39(18) (2011)

    Google Scholar 

  • Tilgner, H., Knowles, D.G., et al.: Deep sequencing of subcellular rna fractions shows splicing to be predominantly co-transcriptional in the human genome but inefficient for incrnas. Genome Research 22(9), 1616–1625 (2012)

    Article  Google Scholar 

  • Trapnell, C., Pachter, L., Salzberg, S.L.: Tophat: discovering splice junctions with rna-seq. Bioinformatics 25(9), 1105–1111 (2009)

    Article  Google Scholar 

  • Wang, K., Singh, D., et al.: Mapsplice: Accurate mapping of rna-seq reads for splice junction discovery. Nucleic Acids Research 38(18) (2010)

    Google Scholar 

  • Wigler, M.: Broad applications of single-cell nucleic acid analysis in biomedical research. Genome Medicine 4(10) (2012)

    Google Scholar 

  • Wu, T.D., Nacu, S.: Fast and snp-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26(7), 873–881 (2010)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Mishra, B. (2015). Gappy Total Recaller: Efficient Algorithms and Data Structures for Accurate Transcriptomics. In: Natarajan, R., Barua, G., Patra, M.R. (eds) Distributed Computing and Internet Technology. ICDCIT 2015. Lecture Notes in Computer Science, vol 8956. Springer, Cham. https://doi.org/10.1007/978-3-319-14977-6_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-14977-6_9

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-14976-9

  • Online ISBN: 978-3-319-14977-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics