Inference of Isoforms from Short Sequence Reads

Feng, Jianxing; Li, Wei; Jiang, Tao

doi:10.1007/978-3-642-12683-3_10

Jianxing Feng²⁰,
Wei Li²¹ &
Tao Jiang^21,22

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 6044))

Included in the following conference series:

Annual International Conference on Research in Computational Molecular Biology

2694 Accesses
15 Citations

Abstract

Due to alternative splicing events in eukaryotic species, the identification of mRNA isoforms (or splicing variants) is a difficult problem. Traditional experimental methods for this purpose are time consuming and cost ineffective. The emerging RNA-Seq technology provides a possible effective method to address this problem. Although the advantages of RNA-Seq over traditional methods in transcriptome analysis have been confirmed by many studies, the inference of isoforms from millions of short sequence reads (e.g., Illumina/Solexa reads) has remained computationally challenging. In this work, we propose a method to calculate the expression levels of isoforms and infer isoforms from short RNA-Seq reads using exon-intron boundary, transcription start site (TSS) and poly-A site (PAS) information. We first formulate the relationship among exons, isoforms, and single-end reads as a convex quadratic program, and then use an efficient algorithm (called IsoInfer) to search for isoforms. IsoInfer can calculate the expression levels of isoforms accurately if all the isoforms are known and infer novel isoforms from scratch. Our experimental tests on known mouse isoforms with both simulated expression levels and reads demonstrate that IsoInfer is able to calculate the expression levels of isoforms with an accuracy comparable to the state-of-the-art statistical method and a 60 times faster speed. Moreover, our tests on both simulated and real reads show that it achieves a good precision and sensitivity in inferring isoforms when given accurate exon-intron boundary, TSS and PAS information, especially for isoforms whose expression levels are significantly high.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Boguski, M.S., et al.: Gene discovery in dbEST. Science 265(5181), 1993–(1994)
Article Google Scholar
Boguski, M.S.: The turning point in genome research. Trends in Biochemical Sciences 20(8), 295–296 (1995)
Article Google Scholar
The FANTOM Consortium: The transcriptional landscape of the mammalian genome. Science 309(5740), 1559–1563 (2005)
Google Scholar
The ENCODE Project Consortium: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447(7146), 799–816 (2007)
Google Scholar
Weinstock, G.M.: ENCODE: more genomic empowerment. Genome Res. 17(6), 667–668 (2007)
Article Google Scholar
Bertone, P., et al.: Global identification of human transcribed sequences with genome tiling arrays. Science 306(5705), 2242–2246 (2004)
Article Google Scholar
Kwan, T., et al.: Genome-wide analysis of transcript isoform variation in humans. Nat. Genetics (2008)
Google Scholar
Johnson, J.M., et al.: Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science 302(5653), 2141–2144 (2003)
Article Google Scholar
Kapranov, P., et al.: RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science 316(5830), 1484–1488 (2007)
Article Google Scholar
Brenner, S., et al.: Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat. Biotechnol. 18(6), 630–634 (2000)
Article Google Scholar
Reinartz, J., et al.: Massively parallel signature sequencing (MPSS) as a tool for in-depth quantitative gene expression profiling in all organisms. Brief Funct. Genomic Proteomic 1(1), 95–104 (2002)
Article Google Scholar
Velculescu, V.E., et al.: Serial analysis of gene expression. Science 270(5235), 484–487 (1995)
Article Google Scholar
Harbers, M., Carninci, P.: Tag-based approaches for transcriptome research and genome annotation. Nat. Meth. 2(7), 495–502 (2005)
Article Google Scholar
Shiraki, T., et al.: Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proceedings of the National Academy of Sciences of the United States of America 100(26), 15776–15781 (2003)
Article Google Scholar
Kodzius, R., et al.: CAGE: cap analysis of gene expression. Nat. Meth. 3(3), 211–222 (2005)
Article Google Scholar
Kim, J.B., et al.: Polony multiplex analysis of gene expression (PMAGE) in mouse hypertrophic cardiomyopathy. Science 316(5830), 1481–1484 (2007)
Article Google Scholar
Ng, P., et al.: Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation. Nat. Methods 2, 105–111 (2005)
Article Google Scholar
Nagalakshmi, U., et al.: The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320(5881), 1344–1349 (2008)
Article Google Scholar
Trapnell, C., et al.: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25(9), 1105–1111 (2009)
Article Google Scholar
Graveley, B.R.: Molecular biology: power sequencing. Nature 453(7199), 1197–1198 (2008)
Article Google Scholar
Yassour, M., et al.: Ab initio construction of a eukaryotic transcriptome by massively parallel mRNA sequencing. Proceedings of the National Academy of Sciences 106(9), 3264–3269 (2009)
Article Google Scholar
Wilhelm, B.T., et al.: Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature 453(7199), 1239–1243 (2008)
Article Google Scholar
Cloonan, N., et al.: Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Methods (2008)
Google Scholar
Mortazavi, A., et al.: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5(7), 621–628 (2008)
Article Google Scholar
Marioni, J., et al.: RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18(9), 1509–1517 (2008)
Article Google Scholar
Sultan, M., et al.: A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 321(5891), 956–960 (2008)
Article Google Scholar
Wang, Z., et al.: RNA-Seq: a revolutionary tool for transcriptomics. Genetics Nature reviews (2008)
Google Scholar
Lacroix, V., et al.: Exact transcriptome reconstruction from short sequence reads. In: Crandall, K.A., Lagergren, J. (eds.) WABI 2008. LNCS (LNBI), vol. 5251, pp. 50–63. Springer, Heidelberg (2008)
Chapter Google Scholar
Jiang, H., Wong, W.H.: Statistical inferences for isoform expression in RNA-Seq. Bioinformatics 25(8), 1026–1032 (2009)
Article Google Scholar
Pagani, F., Baralle, F.E.: Genomic variants in exons and introns: identifying the splicing spoilers. Nat. Rev. Genet. 5(5), 389–396 (2004)
Article Google Scholar
Srebrow, A., Kornblihtt, A.R.: The connection between splicing and cancer. J. Cell Sci. 119(13), 2635–2641 (2006)
Article Google Scholar
Williams, W.V.: Editorial hot topic: Transcriptome analysis in drug development (executive editor: williams, W.v.). Current Molecular Medicine 5(2), 1–2 (2005)
Article Google Scholar
Heber, S., et al.: Splicing graphs and EST assembly problem. Bioinformatics 18(suppl.1), S181–S188 (2002)
Google Scholar
Sammeth, M., Valiente, G., Guigó, R.: Bubbles: Alternative splicing events of arbitrary dimension in splicing graphs. In: Vingron, M., Wong, L. (eds.) RECOMB 2008. LNCS (LNBI), vol. 4955, pp. 372–395. Springer, Heidelberg (2008)
Chapter Google Scholar
Xing, Y., et al.: The multiassembly problem: reconstructing multiple transcript isoforms from EST fragment mixtures. Genome Res. 14(3), 426–441 (2004)
Article Google Scholar
Bonizzoni, P., et al.: Detecting alternative gene structures from spliced ESTs: a computational approach. Journal of Computational Biology 16(1), 43–66 (2009)
Article MathSciNet Google Scholar
Djebali, S., et al.: Efficient targeted transcript discovery via array-based normalization of RACE libraries. Nat. Meth. 5(7), 629–635 (2008)
Article Google Scholar
Salehi-Ashtiani, K., Yang, X., Derti, A., Tian, W., Hao, T., Lin, C., Makowski, K., Shen, L., Murray, R.R., Szeto, D., Tusneem, N., Smith, D.R., Cusick, M.E., Hill, D.E., Roth, F.P., Vidal, M.: Isoform discovery by targeted cloning, ’deep-well’ pooling and parallel sequencing. Nat. Meth. 5(7), 597–600 (2008)
Article Google Scholar
Fullwood, M.J., et al.: Next-generation DNA sequencing of paired-end tags (PET) for transcriptome and genome analyses. Genome Res. 19(4), 521–532 (2009)
Article Google Scholar
Pan, Q., et al.: Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet. 40(12), 1413–1415 (2008)
Article Google Scholar
Wang, E.T., et al.: Alternative isoform regulation in human tissue transcriptomes. Nature 456(7221), 470–476 (2008)
Article Google Scholar
Feng, J., et al.: Inference of isoforms from short sequence reads. Manuscript (Janaury 2010), http://www.cs.ucr.edu/~jianxing/IsoInfer-recomb10-full.pdf
Breitbart, R.E., et al.: Alternative splicing: a ubiquitous mechanism for the generation of multiple protein isoforms from single genes. Annual Review of Biochemistry 56(1), 467–495 (1987)
Article MathSciNet Google Scholar
Sammeth, M., et al.: A general definition and nomenclature for alternative splicing events. PLoS Comput. Biol. 4(8), e1000147 (2008)
Google Scholar
Langmead, B., et al.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10(3), R25 (2009)
Google Scholar
Li, H., et al.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18(11), 1851–1858 (2008)
Article Google Scholar
Li, R., et al.: SOAP: short oligonucleotide alignment program. Bioinformatics 24(5), 713–714 (2008)
Article Google Scholar
Cloonan, N., et al.: RNA-MATE: a recursive mapping strategy for high-throughput RNA-sequencing data. Bioinformatics, btp459 (2009)
Google Scholar
Alkan, C., Kidd, J.M., Marques-Bonet, T., Aksay, G., Antonacci, F., Hormozdiari, F., Kitzman, J.O., Baker, C., Malig, M., Mutlu, O., Sahinalp, S.C., Gibbs, R.A., Eichler, E.E.: Personalized copy number and segmental duplication maps using next-generation sequencing. Nat. Genet. 41(10), 1061–1067 (2009)
Article Google Scholar
Hashimoto, T., et al.: Probabilistic resolution of multi-mapping reads in massively parallel sequencing data using MuMRescueLite. Bioinformatics, btp438 (2009)
Google Scholar
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2007)
Google Scholar
Goldfarb, D., Idnani, A.: A numerically stable dual method for solving strictly convex quadratic programs. Math. Program 27, 1–33 (1983)
Article MATH MathSciNet Google Scholar
Korbel, J., et al.: PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. Genome Biology 10(2), R23 (2009)
Google Scholar
Karolchik, D., et al.: The UCSC genome browser database: 2008 update. Nucl. Acids Res. 36(Database issue), D773–D779 (2008)
Google Scholar
Alter, M.D., et al.: Variation in the large-scale organization of gene expression levels in the hippocampus relates to stable epigenetic variability in behavior. PLoS ONE 3(10), e3344 (2008)
Google Scholar
Konishi, T.: Three-parameter lognormal distribution ubiquitously found in cdna microarray data and its application to parametric data treatment. BMC Bioinformatics 5(1), 5 (2004)
Article Google Scholar
Wijaya, E., et al.: Modeling the marginal distribution of gene expression with mixture models. In: FGCN 2008: Proceedings of the 2008 Second International Conference on Future Generation Communication and Networking, pp. 84–89. IEEE Computer Society, Washington (2008)
Chapter Google Scholar
Richter, D.C., et al.: MetaSima sequencing simulator for genomics and metagenomics. PLoS ONE 3(10), e3373 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

State Key Laboratory on Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science, Tsinghua Univ., Beijing, China
Jianxing Feng
Department of Computer Science, Univ. of California, Riverside, CA
Wei Li & Tao Jiang
Tsinghua Univ., Beijing, China
Tao Jiang

Authors

Jianxing Feng
View author publications
You can also search for this author in PubMed Google Scholar
Wei Li
View author publications
You can also search for this author in PubMed Google Scholar
Tao Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 77 Massachusetts Avenue, 02139, Cambridge, MA, USA
Bonnie Berger

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Feng, J., Li, W., Jiang, T. (2010). Inference of Isoforms from Short Sequence Reads. In: Berger, B. (eds) Research in Computational Molecular Biology. RECOMB 2010. Lecture Notes in Computer Science(), vol 6044. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12683-3_10

Download citation

DOI: https://doi.org/10.1007/978-3-642-12683-3_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12682-6
Online ISBN: 978-3-642-12683-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics