Skip to main content

Inference of Isoforms from Short Sequence Reads

(Extended Abstract)

  • Conference paper
Research in Computational Molecular Biology (RECOMB 2010)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 6044))

Abstract

Due to alternative splicing events in eukaryotic species, the identification of mRNA isoforms (or splicing variants) is a difficult problem. Traditional experimental methods for this purpose are time consuming and cost ineffective. The emerging RNA-Seq technology provides a possible effective method to address this problem. Although the advantages of RNA-Seq over traditional methods in transcriptome analysis have been confirmed by many studies, the inference of isoforms from millions of short sequence reads (e.g., Illumina/Solexa reads) has remained computationally challenging. In this work, we propose a method to calculate the expression levels of isoforms and infer isoforms from short RNA-Seq reads using exon-intron boundary, transcription start site (TSS) and poly-A site (PAS) information. We first formulate the relationship among exons, isoforms, and single-end reads as a convex quadratic program, and then use an efficient algorithm (called IsoInfer) to search for isoforms. IsoInfer can calculate the expression levels of isoforms accurately if all the isoforms are known and infer novel isoforms from scratch. Our experimental tests on known mouse isoforms with both simulated expression levels and reads demonstrate that IsoInfer is able to calculate the expression levels of isoforms with an accuracy comparable to the state-of-the-art statistical method and a 60 times faster speed. Moreover, our tests on both simulated and real reads show that it achieves a good precision and sensitivity in inferring isoforms when given accurate exon-intron boundary, TSS and PAS information, especially for isoforms whose expression levels are significantly high.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Boguski, M.S., et al.: Gene discovery in dbEST. Science 265(5181), 1993–(1994)

    Article  Google Scholar 

  2. Boguski, M.S.: The turning point in genome research. Trends in Biochemical Sciences 20(8), 295–296 (1995)

    Article  Google Scholar 

  3. The FANTOM Consortium: The transcriptional landscape of the mammalian genome. Science 309(5740), 1559–1563 (2005)

    Google Scholar 

  4. The ENCODE Project Consortium: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447(7146), 799–816 (2007)

    Google Scholar 

  5. Weinstock, G.M.: ENCODE: more genomic empowerment. Genome Res. 17(6), 667–668 (2007)

    Article  Google Scholar 

  6. Bertone, P., et al.: Global identification of human transcribed sequences with genome tiling arrays. Science 306(5705), 2242–2246 (2004)

    Article  Google Scholar 

  7. Kwan, T., et al.: Genome-wide analysis of transcript isoform variation in humans. Nat. Genetics (2008)

    Google Scholar 

  8. Johnson, J.M., et al.: Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science 302(5653), 2141–2144 (2003)

    Article  Google Scholar 

  9. Kapranov, P., et al.: RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science 316(5830), 1484–1488 (2007)

    Article  Google Scholar 

  10. Brenner, S., et al.: Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat. Biotechnol. 18(6), 630–634 (2000)

    Article  Google Scholar 

  11. Reinartz, J., et al.: Massively parallel signature sequencing (MPSS) as a tool for in-depth quantitative gene expression profiling in all organisms. Brief Funct. Genomic Proteomic 1(1), 95–104 (2002)

    Article  Google Scholar 

  12. Velculescu, V.E., et al.: Serial analysis of gene expression. Science 270(5235), 484–487 (1995)

    Article  Google Scholar 

  13. Harbers, M., Carninci, P.: Tag-based approaches for transcriptome research and genome annotation. Nat. Meth. 2(7), 495–502 (2005)

    Article  Google Scholar 

  14. Shiraki, T., et al.: Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proceedings of the National Academy of Sciences of the United States of America 100(26), 15776–15781 (2003)

    Article  Google Scholar 

  15. Kodzius, R., et al.: CAGE: cap analysis of gene expression. Nat. Meth. 3(3), 211–222 (2005)

    Article  Google Scholar 

  16. Kim, J.B., et al.: Polony multiplex analysis of gene expression (PMAGE) in mouse hypertrophic cardiomyopathy. Science 316(5830), 1481–1484 (2007)

    Article  Google Scholar 

  17. Ng, P., et al.: Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation. Nat. Methods 2, 105–111 (2005)

    Article  Google Scholar 

  18. Nagalakshmi, U., et al.: The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320(5881), 1344–1349 (2008)

    Article  Google Scholar 

  19. Trapnell, C., et al.: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25(9), 1105–1111 (2009)

    Article  Google Scholar 

  20. Graveley, B.R.: Molecular biology: power sequencing. Nature 453(7199), 1197–1198 (2008)

    Article  Google Scholar 

  21. Yassour, M., et al.: Ab initio construction of a eukaryotic transcriptome by massively parallel mRNA sequencing. Proceedings of the National Academy of Sciences 106(9), 3264–3269 (2009)

    Article  Google Scholar 

  22. Wilhelm, B.T., et al.: Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature 453(7199), 1239–1243 (2008)

    Article  Google Scholar 

  23. Cloonan, N., et al.: Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Methods (2008)

    Google Scholar 

  24. Mortazavi, A., et al.: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5(7), 621–628 (2008)

    Article  Google Scholar 

  25. Marioni, J., et al.: RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18(9), 1509–1517 (2008)

    Article  Google Scholar 

  26. Sultan, M., et al.: A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 321(5891), 956–960 (2008)

    Article  Google Scholar 

  27. Wang, Z., et al.: RNA-Seq: a revolutionary tool for transcriptomics. Genetics Nature reviews (2008)

    Google Scholar 

  28. Lacroix, V., et al.: Exact transcriptome reconstruction from short sequence reads. In: Crandall, K.A., Lagergren, J. (eds.) WABI 2008. LNCS (LNBI), vol. 5251, pp. 50–63. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  29. Jiang, H., Wong, W.H.: Statistical inferences for isoform expression in RNA-Seq. Bioinformatics 25(8), 1026–1032 (2009)

    Article  Google Scholar 

  30. Pagani, F., Baralle, F.E.: Genomic variants in exons and introns: identifying the splicing spoilers. Nat. Rev. Genet. 5(5), 389–396 (2004)

    Article  Google Scholar 

  31. Srebrow, A., Kornblihtt, A.R.: The connection between splicing and cancer. J. Cell Sci. 119(13), 2635–2641 (2006)

    Article  Google Scholar 

  32. Williams, W.V.: Editorial hot topic: Transcriptome analysis in drug development (executive editor: williams, W.v.). Current Molecular Medicine 5(2), 1–2 (2005)

    Article  Google Scholar 

  33. Heber, S., et al.: Splicing graphs and EST assembly problem. Bioinformatics 18(suppl.1), S181–S188 (2002)

    Google Scholar 

  34. Sammeth, M., Valiente, G., Guigó, R.: Bubbles: Alternative splicing events of arbitrary dimension in splicing graphs. In: Vingron, M., Wong, L. (eds.) RECOMB 2008. LNCS (LNBI), vol. 4955, pp. 372–395. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  35. Xing, Y., et al.: The multiassembly problem: reconstructing multiple transcript isoforms from EST fragment mixtures. Genome Res. 14(3), 426–441 (2004)

    Article  Google Scholar 

  36. Bonizzoni, P., et al.: Detecting alternative gene structures from spliced ESTs: a computational approach. Journal of Computational Biology 16(1), 43–66 (2009)

    Article  MathSciNet  Google Scholar 

  37. Djebali, S., et al.: Efficient targeted transcript discovery via array-based normalization of RACE libraries. Nat. Meth. 5(7), 629–635 (2008)

    Article  Google Scholar 

  38. Salehi-Ashtiani, K., Yang, X., Derti, A., Tian, W., Hao, T., Lin, C., Makowski, K., Shen, L., Murray, R.R., Szeto, D., Tusneem, N., Smith, D.R., Cusick, M.E., Hill, D.E., Roth, F.P., Vidal, M.: Isoform discovery by targeted cloning, ’deep-well’ pooling and parallel sequencing. Nat. Meth. 5(7), 597–600 (2008)

    Article  Google Scholar 

  39. Fullwood, M.J., et al.: Next-generation DNA sequencing of paired-end tags (PET) for transcriptome and genome analyses. Genome Res. 19(4), 521–532 (2009)

    Article  Google Scholar 

  40. Pan, Q., et al.: Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet. 40(12), 1413–1415 (2008)

    Article  Google Scholar 

  41. Wang, E.T., et al.: Alternative isoform regulation in human tissue transcriptomes. Nature 456(7221), 470–476 (2008)

    Article  Google Scholar 

  42. Feng, J., et al.: Inference of isoforms from short sequence reads. Manuscript (Janaury 2010), http://www.cs.ucr.edu/~jianxing/IsoInfer-recomb10-full.pdf

  43. Breitbart, R.E., et al.: Alternative splicing: a ubiquitous mechanism for the generation of multiple protein isoforms from single genes. Annual Review of Biochemistry 56(1), 467–495 (1987)

    Article  MathSciNet  Google Scholar 

  44. Sammeth, M., et al.: A general definition and nomenclature for alternative splicing events. PLoS Comput. Biol. 4(8), e1000147 (2008)

    Google Scholar 

  45. Langmead, B., et al.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10(3), R25 (2009)

    Google Scholar 

  46. Li, H., et al.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18(11), 1851–1858 (2008)

    Article  Google Scholar 

  47. Li, R., et al.: SOAP: short oligonucleotide alignment program. Bioinformatics 24(5), 713–714 (2008)

    Article  Google Scholar 

  48. Cloonan, N., et al.: RNA-MATE: a recursive mapping strategy for high-throughput RNA-sequencing data. Bioinformatics, btp459 (2009)

    Google Scholar 

  49. Alkan, C., Kidd, J.M., Marques-Bonet, T., Aksay, G., Antonacci, F., Hormozdiari, F., Kitzman, J.O., Baker, C., Malig, M., Mutlu, O., Sahinalp, S.C., Gibbs, R.A., Eichler, E.E.: Personalized copy number and segmental duplication maps using next-generation sequencing. Nat. Genet. 41(10), 1061–1067 (2009)

    Article  Google Scholar 

  50. Hashimoto, T., et al.: Probabilistic resolution of multi-mapping reads in massively parallel sequencing data using MuMRescueLite. Bioinformatics, btp438 (2009)

    Google Scholar 

  51. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2007)

    Google Scholar 

  52. Goldfarb, D., Idnani, A.: A numerically stable dual method for solving strictly convex quadratic programs. Math. Program 27, 1–33 (1983)

    Article  MATH  MathSciNet  Google Scholar 

  53. Korbel, J., et al.: PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. Genome Biology 10(2), R23 (2009)

    Google Scholar 

  54. Karolchik, D., et al.: The UCSC genome browser database: 2008 update. Nucl. Acids Res. 36(Database issue), D773–D779 (2008)

    Google Scholar 

  55. Alter, M.D., et al.: Variation in the large-scale organization of gene expression levels in the hippocampus relates to stable epigenetic variability in behavior. PLoS ONE 3(10), e3344 (2008)

    Google Scholar 

  56. Konishi, T.: Three-parameter lognormal distribution ubiquitously found in cdna microarray data and its application to parametric data treatment. BMC Bioinformatics 5(1), 5 (2004)

    Article  Google Scholar 

  57. Wijaya, E., et al.: Modeling the marginal distribution of gene expression with mixture models. In: FGCN 2008: Proceedings of the 2008 Second International Conference on Future Generation Communication and Networking, pp. 84–89. IEEE Computer Society, Washington (2008)

    Chapter  Google Scholar 

  58. Richter, D.C., et al.: MetaSima sequencing simulator for genomics and metagenomics. PLoS ONE 3(10), e3373 (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Feng, J., Li, W., Jiang, T. (2010). Inference of Isoforms from Short Sequence Reads. In: Berger, B. (eds) Research in Computational Molecular Biology. RECOMB 2010. Lecture Notes in Computer Science(), vol 6044. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12683-3_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-12683-3_10

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-12682-6

  • Online ISBN: 978-3-642-12683-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics