AREM: Aligning Short Reads from ChIP-Sequencing by Expectation Maximization

Newkirk, Daniel; Biesinger, Jacob; Chon, Alvin; Yokomori, Kyoko; Xie, Xiaohui

doi:10.1007/978-3-642-20036-6_26

Daniel Newkirk^21,23,
Jacob Biesinger^22,23,
Alvin Chon^22,23,
Kyoko Yokomori²¹ &
…
Xiaohui Xie^22,23

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 6577))

Included in the following conference series:

International Conference on Research in Computational Molecular Biology

1250 Accesses
2 Citations

Abstract

High-throughput sequencing coupled to chromatin immunoprecipitation (ChIP-Seq) is widely used in characterizing genome-wide binding patterns of transcription factors, cofactors, chromatin modifiers, and other DNA binding proteins. A key step in ChIP-Seq data analysis is to map short reads from high-throughput sequencing to a reference genome and identify peak regions enriched with short reads. Although several methods have been proposed for ChIP-Seq analysis, most existing methods only consider reads that can be uniquely placed in the reference genome, and therefore have low power for detecting peaks located within repeat sequences. Here we introduce a probabilistic approach for ChIP-Seq data analysis which utilizes all reads, providing a truly genome-wide view of binding patterns. Reads are modeled using a mixture model corresponding to K enriched regions and a null genomic background. We use maximum likelihood to estimate the locations of the enriched regions, and implement an expectation-maximization (E-M) algorithm, called AREM (aligning reads by expectation maximization), to update the alignment probabilities of each read to different genomic locations. We apply the algorithm to identify genome-wide binding events of two proteins: Rad21, a component of cohesin and a key factor involved in chromatid cohesion, and Srebp-1, a transcription factor important for lipid/cholesterol homeostasis. Using AREM, we were able to identify 19,935 Rad21 peaks and 1,748 Srebp-1 peaks in the mouse genome with high confidence, including 1,517 (7.6%) Rad21 peaks and 227 (13%) Srebp-1 peaks that were missed using only uniquely mapped reads. The open source implementation of our algorithm is available at http://sourceforge.net/projects/arem

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Park, P.: ChIP–seq: advantages and challenges of a maturing technology. Nature Reviews Genetics 10, 669–680 (2009)
Article Google Scholar
Mikkelsen, T., Ku, M., Jaffe, D., Issac, B., Lieberman, E., Giannoukos, G., Alvarez, P., Brockman, W., Kim, T., Koche, R., et al.: Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 448, 553–560 (2007)
Article Google Scholar
Ouyang, Z., Zhou, Q., Wong, W.: ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells. Proceedings of the National Academy of Sciences 106, 21521 (2009)
Article Google Scholar
Blow, M., McCulley, D., Li, Z., Zhang, T., Akiyama, J., Holt, A., Plajzer-Frick, I., Shoukry, M., Wright, C., Chen, F., et al.: ChIP-Seq identification of weakly conserved heart enhancers. Nature Genetics 42, 806–810 (2010)
Article Google Scholar
Seo, Y., Chong, H., Infante, A., Im, S., Xie, X., Osborne, T.: Genome-wide analysis of SREBP-1 binding in mouse liver chromatin reveals a preference for promoter proximal binding to a new motif. Proceedings of the National Academy of Sciences 106, 13765 (2009)
Article Google Scholar
Cox, A.J.: Efficient Large-Scale Alignment of Nucleotide Databases. Whole genome alignments to a reference genome (2007), http://bioinfo.cgrb.oregonstate.edu/docs/solexa
Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009)
Article Google Scholar
Li, H., Ruan, J., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research 18, 1851 (2008)
Article Google Scholar
Li, R., Li, Y., Kristiansen, K., Wang, J.: SOAP: short oligonucleotide alignment program. Bioinformatics 24, 713 (2008)
Article Google Scholar
Fejes, A., Robertson, G., Bilenky, M., Varhol, R., Bainbridge, M., Jones, S.: FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology. Bioinformatics 24, 1729 (2008)
Article Google Scholar
Ji, H., Jiang, H., Ma, W., Johnson, D., Myers, R., Wong, W.: An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nature Biotechnology 26, 1293–1300 (2008)
Article Google Scholar
Mortazavi, A., Williams, B., McCue, K., Schaeffer, L., Wold, B.: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods 5, 621–628 (2008)
Article Google Scholar
Zhang, Y., Liu, T., Meyer, C., Eeckhoute, J., Johnson, D., Bernstein, B., Nussbaum, C., Myers, R., Brown, M., Li, W., et al.: Model-based analysis of ChIP-Seq (MACS). Genome Biology 9, R137 (2008)
Article Google Scholar
Spyrou, C., Stark, R., Lynch, A., Tavaré, S.: BayesPeak: Bayesian analysis of ChIP-seq data. BMC Bioinformatics 10, 299 (2009)
Article Google Scholar
Zang, C., Schones, D., Zeng, C., Cui, K., Zhao, K., Peng, W.: A clustering approach for identification of enriched domains from histone modification ChIP-Seq data. Bioinformatics 25, 1952 (2009)
Article Google Scholar
Blahnik, K., Dou, L., O’Geen, H., McPhillips, T., Xu, X., Cao, A., Iyengar, S., Nicolet, C., Ludascher, B., Korf, I., et al.: Sole-Search: an integrated analysis program for peak detection and functional annotation using ChIP-seq data. Nucleic Acids Research 38, e13 (2010)
Article Google Scholar
Qin, Z., Yu, J., Shen, J., Maher, C., Hu, M., Kalyana-Sundaram, S., Yu, J., Chinnaiyan, A.: HPeak: an HMM-based algorithm for defining read-enriched regions in ChIP-Seq data. BMC Bioinformatics 11, 369 (2010)
Article Google Scholar
Salmon-Divon, M., Dvinge, H., Tammoja, K., Bertone, P.: PeakAnalyzer: Genome-wide annotation of chromatin binding and modification loci. BMC Bioinformatics 11, 415 (2010)
Article Google Scholar
Kharchenko, P., Tolstorukov, M., Park, P.: Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nature Biotechnology 26, 1351–1359 (2008)
Article Google Scholar
Pepke, S., Wold, B., Mortazavi, A.: Computation for ChIP-seq and RNA-seq studies. Nature Methods 6, S22–S32 (2009)
Article Google Scholar
Wilbanks, E., Facciotti, M.: Evaluation of Algorithm Performance in ChIP-Seq Peak Detection. PloS One 5, e11471 (2010)
Article Google Scholar
Kagey, M., Newman, J., Bilodeau, S., Zhan, Y., Orlando, D., van Berkum, N., Ebmeier, C., Goossens, J., Rahl, P., Levine, S., et al.: Mediator and cohesin connect gene expression and chromatin architecture. Nature (2010)
Google Scholar
Schmid, C., Bucher, P.: MER41 Repeat Sequences Contain Inducible STAT1 Binding Sites. PloS One 5, e11425 (2010)
Article Google Scholar
Zeng, W., De Greef, J., Chen, Y., Chien, R., Kong, X., Gregson, H., Winokur, S., Pyle, A., Robertson, K., Schmiesing, J., et al.: Specific loss of histone H3 lysine 9 trimethylation and HP1γ/cohesin binding at D4Z4 repeats is associated with facioscapulohumeral dystrophy (FSHD) (2009)
Google Scholar
Rubio, E., Reiss, D., Welcsh, P., Disteche, C., Filippova, G., Baliga, N., Aebersold, R., Ranish, J., Krumm, A.: CTCF physically links cohesin to chromatin. Proceedings of the National Academy of Sciences 105, 8309 (2008)
Article Google Scholar
Liu, J., Zhang, Z., Bando, M., Itoh, T., Deardorff, M., Clark, D., Kaur, M., Tandy, S., Kondoh, T., Rappaport, E., et al.: Transcriptional dysregulation in NIPBL and cohesin mutant human cells. PLoS Biol. 7, e1000119 (2009)
Article Google Scholar
Wendt, K., Yoshida, K., Itoh, T., Bando, M., Koch, B., Schirghuber, E., Tsutsumi, S., Nagae, G., Ishihara, K., Mishiro, T., et al.: Cohesin mediates transcriptional insulation by CCCTC-binding factor. Nature 451, 796–801 (2008)
Article Google Scholar
Nativio, R., Wendt, K., Ito, Y., Huddleston, J., Uribe-Lewis, S., Woodfine, K., Krueger, C., Reik, W., Peters, J., Murrell, A.: Cohesin is required for higher-order chromatin conformation at the imprinted IGF2-H19 locus (2009)
Google Scholar
Hagen, R., Rodriguez-Cuenca, S., Vidal-Puig, A.: An allostatic control of membrane lipid composition by SREBP1. FEBS Letters (2010)
Google Scholar
Yokoyama, C., Wang, X., Briggs, M., Admon, A., Wu, J., Hua, X., Goldstein, J., Brown, M.: SREBP-1, a basic-helix-loop-helix-leucine zipper protein that controls transcription of the low density lipoprotein receptor gene. Cell 75, 187–197 (1993)
Article Google Scholar
Huda, A., Jordan, I.: Epigenetic regulation of Mammalian genomes by transposable elements. Annals of the New York Academy of Sciences 1178, 276–284 (2009)
Article Google Scholar
Chuzhanova, N., Abeysinghe, S., Krawczak, M., Cooper, D.: Translocation and gross deletion breakpoints in human inherited disease and cancer II: Potential involvement of repetitive sequence elements in secondary structure formation between DNA ends. Human Mutation 22, 245–251 (2003)
Article Google Scholar
Rhead, B., Karolchik, D., Kuhn, R., Hinrichs, A., Zweig, A., Fujita, P., Diekhans, M., Smith, K., Rosenbloom, K., Raney, B., et al.: The UCSC genome browser database: update 2010. Nucleic Acids Research (2009)
Google Scholar
Boeva, V., Surdez, D., Guillon, N., Tirode, F., Fejes, A., Delattre, O., Barillot, E.: De novo motif identification improves the accuracy of predicting transcription factor binding sites in ChIP-Seq data analysis. Nucleic Acids Research (2010)
Google Scholar
Bailey, T., Elkan, C.: The value of prior knowledge in discovering motifs with MEME. In: Proc Int. Conf. Intell. Syst. Mol. Biol., vol. 3, pp. 21–29 (1995)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Biological Chemistry, University of California, Irvine, CA, 92697, USA
Daniel Newkirk & Kyoko Yokomori
Department of Computer Science, University of California, Irvine, CA, 92697, USA
Jacob Biesinger, Alvin Chon & Xiaohui Xie
The Institute for Genomics and Bioinformatics, University of California, Irvine, CA, 92697, USA
Daniel Newkirk, Jacob Biesinger, Alvin Chon & Xiaohui Xie

Authors

Daniel Newkirk
View author publications
You can also search for this author in PubMed Google Scholar
Jacob Biesinger
View author publications
You can also search for this author in PubMed Google Scholar
Alvin Chon
View author publications
You can also search for this author in PubMed Google Scholar
Kyoko Yokomori
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohui Xie
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

EBU3b, University of California San Diego, #4218, 9500 Gilman Drive, 92093-0404, La Jolla, CA, USA
Vineet Bafna
School of Computing Science, Simon Fraser University, 8888 University Drive, V5A 1S6, Burnaby, BC, Canada
S. Cenk Sahinalp

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Newkirk, D., Biesinger, J., Chon, A., Yokomori, K., Xie, X. (2011). AREM: Aligning Short Reads from ChIP-Sequencing by Expectation Maximization. In: Bafna, V., Sahinalp, S.C. (eds) Research in Computational Molecular Biology. RECOMB 2011. Lecture Notes in Computer Science(), vol 6577. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20036-6_26

Download citation

DOI: https://doi.org/10.1007/978-3-642-20036-6_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20035-9
Online ISBN: 978-3-642-20036-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics