Abstract
Due to alternative splicing events in eukaryotic species, the identification of mRNA isoforms (or splicing variants) is a difficult problem. Traditional experimental methods for this purpose are time consuming and cost ineffective. The emerging RNA-Seq technology provides a possible effective method to address this problem. Although the advantages of RNA-Seq over traditional methods in transcriptome analysis have been confirmed by many studies, the inference of isoforms from millions of short sequence reads (e.g., Illumina/Solexa reads) has remained computationally challenging. In this work, we propose a method to calculate the expression levels of isoforms and infer isoforms from short RNA-Seq reads using exon-intron boundary, transcription start site (TSS) and poly-A site (PAS) information. We first formulate the relationship among exons, isoforms, and single-end reads as a convex quadratic program, and then use an efficient algorithm (called IsoInfer) to search for isoforms. IsoInfer can calculate the expression levels of isoforms accurately if all the isoforms are known and infer novel isoforms from scratch. Our experimental tests on known mouse isoforms with both simulated expression levels and reads demonstrate that IsoInfer is able to calculate the expression levels of isoforms with an accuracy comparable to the state-of-the-art statistical method and a 60 times faster speed. Moreover, our tests on both simulated and real reads show that it achieves a good precision and sensitivity in inferring isoforms when given accurate exon-intron boundary, TSS and PAS information, especially for isoforms whose expression levels are significantly high.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Boguski, M.S., et al.: Gene discovery in dbEST. Science 265(5181), 1993–(1994)
Boguski, M.S.: The turning point in genome research. Trends in Biochemical Sciences 20(8), 295–296 (1995)
The FANTOM Consortium: The transcriptional landscape of the mammalian genome. Science 309(5740), 1559–1563 (2005)
The ENCODE Project Consortium: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447(7146), 799–816 (2007)
Weinstock, G.M.: ENCODE: more genomic empowerment. Genome Res. 17(6), 667–668 (2007)
Bertone, P., et al.: Global identification of human transcribed sequences with genome tiling arrays. Science 306(5705), 2242–2246 (2004)
Kwan, T., et al.: Genome-wide analysis of transcript isoform variation in humans. Nat. Genetics (2008)
Johnson, J.M., et al.: Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science 302(5653), 2141–2144 (2003)
Kapranov, P., et al.: RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science 316(5830), 1484–1488 (2007)
Brenner, S., et al.: Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat. Biotechnol. 18(6), 630–634 (2000)
Reinartz, J., et al.: Massively parallel signature sequencing (MPSS) as a tool for in-depth quantitative gene expression profiling in all organisms. Brief Funct. Genomic Proteomic 1(1), 95–104 (2002)
Velculescu, V.E., et al.: Serial analysis of gene expression. Science 270(5235), 484–487 (1995)
Harbers, M., Carninci, P.: Tag-based approaches for transcriptome research and genome annotation. Nat. Meth. 2(7), 495–502 (2005)
Shiraki, T., et al.: Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proceedings of the National Academy of Sciences of the United States of America 100(26), 15776–15781 (2003)
Kodzius, R., et al.: CAGE: cap analysis of gene expression. Nat. Meth. 3(3), 211–222 (2005)
Kim, J.B., et al.: Polony multiplex analysis of gene expression (PMAGE) in mouse hypertrophic cardiomyopathy. Science 316(5830), 1481–1484 (2007)
Ng, P., et al.: Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation. Nat. Methods 2, 105–111 (2005)
Nagalakshmi, U., et al.: The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320(5881), 1344–1349 (2008)
Trapnell, C., et al.: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25(9), 1105–1111 (2009)
Graveley, B.R.: Molecular biology: power sequencing. Nature 453(7199), 1197–1198 (2008)
Yassour, M., et al.: Ab initio construction of a eukaryotic transcriptome by massively parallel mRNA sequencing. Proceedings of the National Academy of Sciences 106(9), 3264–3269 (2009)
Wilhelm, B.T., et al.: Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature 453(7199), 1239–1243 (2008)
Cloonan, N., et al.: Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Methods (2008)
Mortazavi, A., et al.: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5(7), 621–628 (2008)
Marioni, J., et al.: RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18(9), 1509–1517 (2008)
Sultan, M., et al.: A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 321(5891), 956–960 (2008)
Wang, Z., et al.: RNA-Seq: a revolutionary tool for transcriptomics. Genetics Nature reviews (2008)
Lacroix, V., et al.: Exact transcriptome reconstruction from short sequence reads. In: Crandall, K.A., Lagergren, J. (eds.) WABI 2008. LNCS (LNBI), vol. 5251, pp. 50–63. Springer, Heidelberg (2008)
Jiang, H., Wong, W.H.: Statistical inferences for isoform expression in RNA-Seq. Bioinformatics 25(8), 1026–1032 (2009)
Pagani, F., Baralle, F.E.: Genomic variants in exons and introns: identifying the splicing spoilers. Nat. Rev. Genet. 5(5), 389–396 (2004)
Srebrow, A., Kornblihtt, A.R.: The connection between splicing and cancer. J. Cell Sci. 119(13), 2635–2641 (2006)
Williams, W.V.: Editorial hot topic: Transcriptome analysis in drug development (executive editor: williams, W.v.). Current Molecular Medicine 5(2), 1–2 (2005)
Heber, S., et al.: Splicing graphs and EST assembly problem. Bioinformatics 18(suppl.1), S181–S188 (2002)
Sammeth, M., Valiente, G., Guigó, R.: Bubbles: Alternative splicing events of arbitrary dimension in splicing graphs. In: Vingron, M., Wong, L. (eds.) RECOMB 2008. LNCS (LNBI), vol. 4955, pp. 372–395. Springer, Heidelberg (2008)
Xing, Y., et al.: The multiassembly problem: reconstructing multiple transcript isoforms from EST fragment mixtures. Genome Res. 14(3), 426–441 (2004)
Bonizzoni, P., et al.: Detecting alternative gene structures from spliced ESTs: a computational approach. Journal of Computational Biology 16(1), 43–66 (2009)
Djebali, S., et al.: Efficient targeted transcript discovery via array-based normalization of RACE libraries. Nat. Meth. 5(7), 629–635 (2008)
Salehi-Ashtiani, K., Yang, X., Derti, A., Tian, W., Hao, T., Lin, C., Makowski, K., Shen, L., Murray, R.R., Szeto, D., Tusneem, N., Smith, D.R., Cusick, M.E., Hill, D.E., Roth, F.P., Vidal, M.: Isoform discovery by targeted cloning, ’deep-well’ pooling and parallel sequencing. Nat. Meth. 5(7), 597–600 (2008)
Fullwood, M.J., et al.: Next-generation DNA sequencing of paired-end tags (PET) for transcriptome and genome analyses. Genome Res. 19(4), 521–532 (2009)
Pan, Q., et al.: Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet. 40(12), 1413–1415 (2008)
Wang, E.T., et al.: Alternative isoform regulation in human tissue transcriptomes. Nature 456(7221), 470–476 (2008)
Feng, J., et al.: Inference of isoforms from short sequence reads. Manuscript (Janaury 2010), http://www.cs.ucr.edu/~jianxing/IsoInfer-recomb10-full.pdf
Breitbart, R.E., et al.: Alternative splicing: a ubiquitous mechanism for the generation of multiple protein isoforms from single genes. Annual Review of Biochemistry 56(1), 467–495 (1987)
Sammeth, M., et al.: A general definition and nomenclature for alternative splicing events. PLoS Comput. Biol. 4(8), e1000147 (2008)
Langmead, B., et al.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10(3), R25 (2009)
Li, H., et al.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18(11), 1851–1858 (2008)
Li, R., et al.: SOAP: short oligonucleotide alignment program. Bioinformatics 24(5), 713–714 (2008)
Cloonan, N., et al.: RNA-MATE: a recursive mapping strategy for high-throughput RNA-sequencing data. Bioinformatics, btp459 (2009)
Alkan, C., Kidd, J.M., Marques-Bonet, T., Aksay, G., Antonacci, F., Hormozdiari, F., Kitzman, J.O., Baker, C., Malig, M., Mutlu, O., Sahinalp, S.C., Gibbs, R.A., Eichler, E.E.: Personalized copy number and segmental duplication maps using next-generation sequencing. Nat. Genet. 41(10), 1061–1067 (2009)
Hashimoto, T., et al.: Probabilistic resolution of multi-mapping reads in massively parallel sequencing data using MuMRescueLite. Bioinformatics, btp438 (2009)
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2007)
Goldfarb, D., Idnani, A.: A numerically stable dual method for solving strictly convex quadratic programs. Math. Program 27, 1–33 (1983)
Korbel, J., et al.: PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. Genome Biology 10(2), R23 (2009)
Karolchik, D., et al.: The UCSC genome browser database: 2008 update. Nucl. Acids Res. 36(Database issue), D773–D779 (2008)
Alter, M.D., et al.: Variation in the large-scale organization of gene expression levels in the hippocampus relates to stable epigenetic variability in behavior. PLoS ONEÂ 3(10), e3344 (2008)
Konishi, T.: Three-parameter lognormal distribution ubiquitously found in cdna microarray data and its application to parametric data treatment. BMC Bioinformatics 5(1), 5 (2004)
Wijaya, E., et al.: Modeling the marginal distribution of gene expression with mixture models. In: FGCN 2008: Proceedings of the 2008 Second International Conference on Future Generation Communication and Networking, pp. 84–89. IEEE Computer Society, Washington (2008)
Richter, D.C., et al.: MetaSima sequencing simulator for genomics and metagenomics. PLoS ONEÂ 3(10), e3373 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Feng, J., Li, W., Jiang, T. (2010). Inference of Isoforms from Short Sequence Reads. In: Berger, B. (eds) Research in Computational Molecular Biology. RECOMB 2010. Lecture Notes in Computer Science(), vol 6044. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12683-3_10
Download citation
DOI: https://doi.org/10.1007/978-3-642-12683-3_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12682-6
Online ISBN: 978-3-642-12683-3
eBook Packages: Computer ScienceComputer Science (R0)