skip to main content
10.1145/3469877.3497696acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Prediction of Transcription Factor Binding Sites Using Deep Learning Combined with DNA Sequences and Shape Feature Data

Published: 10 January 2022 Publication History

Abstract

Knowing transcription factor binding sites (TFBS) is essential to model underlying binding mechanisms and cellular functions. Studies have shown that in addition to the DNA sequence, the shape information of DNA is also an important factor affecting its activity. Here, we developed a CNN model to integrate 3D DNA shape information derived using a high-throughput method for predicting TF binding sites (TFBSs). We identify the best performing architectures by varying CNN window size, kernels, hidden nodes and hidden layers. The performance of the two types of data and their combination was evaluated using 69 different ChIP-seq [1] experiments. Our results showed that the model integrating shape information and sequence information compared favorably to the sequence-based model This work combines knowledge from structural biology and genomics, and DNA shape features improved the description of TF binding specificity.

References

[1]
[1]Karimzadeh M, Hoffman M M. Virtual ChIP-seq: predicting transcription factor binding by learning from the transcriptome. bioRxiv, 2019, 168419.
[2]
[2]Lee T I, Young R A. Transcriptional regulation and its misregulation in disease. Cell, 2013, 152(6): 1237–1251
[3]
[3]Vaquerizas J M, Kummerfeld S K, Teichmann S A, Luscombe N M. A census of human transcription factors: function, expression and evolution. Nature Reviews. Genetics, 2009, 10(4): 252–263
[4]
[4] Junion G, Spivakov M, Girardot C, Braun M, Gustafson E H, Birney E, Furlong E E. A transcription factor collective defines cardiac cell fate and reflects lineage history. Cell, 2012, 148(3): 473–486
[5]
[5]Neph S, Vierstra J, Stergachis A B, et al. An expansive human regulatory lexicon encoded in transcription factor footprints[J]. Nature, 2012, 489(7414):83-90.
[6]
[6]Gilfillan GD, Hughes T, Sheng Y, Hjorthaug H S, Straub T, Gervin K, Harris J R, Undlien D E, Lyle R. Limitations and possibilities of low cell number ChIP-seq. BMC Genomics, 2012, 13: 645
[7]
[7]Park P J. ChIP–seq: advantages and challenges of a maturing technology. Nature Reviews. Genetics, 2009, 10(10): 669–680
[8]
[8]Slattery M, Zhou T, Yang L, et al. Absence of a simple code: how transcription factors read the genome[J]. Trends in Biochemical Sciences, 2014, 39(9).
[9]
[9]Stormo G D, Zhao Y. Determining the specificity of protein–DNA interactions[J]. NATURE REVIEWS GENETICS, 2010, 11(11):751—760.
[10]
[10]NN GD Stormo. Modeling the specificity of protein-DNA interactions[J]. Quant. Biol., 2013, 1(2):115-130.
[11]
[11]Rohs R, Jin X, West S M, et al. Origins of specificity in protein-DNA recognition.[J]. Annual Review of Biochemistry, 2010, 79(1):233-269.
[12]
[12] Kim, Erik, et al. Probing Allostery Through DNA[C]// NCBPC2, 2013.
[13]
[13]Watson L C, Kuchen Be Cker K M, Schiller B J, et al. The glucocorticoid receptor dimer interface allosterically transmits sequence-specific DNA signals[J]. Nature Structural & Molecular Biology, 2013, 20(7):876-83.
[14]
[14]Joshi R, Passner J M, Rohs R, et al. Functional Specificity of a Hox Protein Mediated by the Recognition of Minor Groove Structure[J]. Cell, 2007, 131(3):530-543.
[15]
[15]Rohs R, West S M, Sosinsky A, et al. The role of DNA shape in protein-DNA recognition.[J]. Nature, 2009, 461(7268):1248-1253.
[16]
[16]White M A, CA Myers, Corbo J C, et al. Massively parallel in vivo enhancer assay reveals that highly local features determine the cis-regulatory function of ChIP-seq peaks.[J]. Pnas, 2013, 110(29):11952-11957.
[17]
[17]Zhou T, Shen N, Yang L, Abe N, Horton J, Mann R S, Bussemaker H J, Gordân R, Rohs R. Quantitative modeling of transcription factor binding specificities using DNA shape. Proceedings of the National Academy of Sciences of the United States of America, 2015, 112(15): 4654–4659.
[18]
[18]Zhou T, Yang L, Lu Y, Dror I, Dantas Machado A C, Ghane T, Di Felice R, Rohs R. DNAshape: a method for the high-throughput prediction of DNA structural features on a genomic scale. Nucleic Acids Research, 2013, 41(web server issue): W56–W62.
[19]
[19]Gordân R, Shen N, Dror I, Zhou T, Horton J, Rohs R, Bulyk M L. Genomic regions flanking E-box binding sites influence DNA binding specificity of bHLH transcription factors through DNA shape. Cell Reports, 2013, 3(4): 1093–1104.
[20]
[20]Lazarovici A, Zhou T, Shafer A, Dantas Machado A C, Riley T R, Sandstrom R, Sabo P J, Lu Y, Rohs R, Stamatoyannopoulos J A, Bussemaker H J. Probing DNA shape and methylation state on a genomic scale with DNase I. Proceedings of the National Academy of Sciences of the United States of America, 2013, 110(16): 6376–6381.
[21]
[21]Chen Y, Zhang X, Dantas Machado A C, Ding Y, Chen Z, Qin P Z, Rohs R, Chen L. Structure of p53 binding to the BAX response element reveals DNA unwinding and compression to accommodate base-pair insertion. Nucleic Acids Research, 2013, 41(17): 8368–8376.
[22]
[22]Chang Y P, Xu M, Machado A C, Yu X J, Rohs R, Chen X S. Mechanism of origin DNA recognition and assembly of an initiator-helicase complex by SV40 large tumor antigen. Cell Reports, 2013, 3(4): 1117–1127.
[23]
[23]Warner J B, Philippakis A A, Jaeger S A, He F S, Lin J, Bulyk M L. Systematic identification of mammalian regulatory motifs’ target genes and functions. Nature Methods, 2008, 5(4): 347–353
[24]
[24]Ghandi M, Lee D, Mohammad-Noori M, Beer M A. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Computational Biology, 2014, 10(7): e1003711
[25]
[25]LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553): 436–444.
[26]
[26]Angermueller C, Lee H J, Reik W, Stegle O. DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome Biology, 2017, 18(1): 67.
[27]
[27]Qin Q, Feng J. Imputation for transcription factor binding predictions based on deep learning. PLoS Computational Biology, 2017, 13(2): e1005403.
[28]
[28]Yang B, Liu F, Ren C, Ouyang Z, Xie Z, Bo X, Shu W. BiRen: predicting enhancers with a deep-learning-based model using the DNA sequence alone. Bioinformatics, 2017, 33(13): 1930–1936.
[29]
[29]Kelley D R, Snoek J, Rinn J L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Research, 2016, 26(7): 990–999.
[30]
[30]Zeng H, Edwards M D, Liu G, Gifford D K. Convolutional neural network architectures for predicting DNA–protein binding. Bioinformatics, 2016, 32(12): i121–i127.
[31]
[31]Jurtz V I, Johansen A R, Nielsen M, Almagro Armenteros J J, Nielsen H, Sønderby C K, Winther O, Sønderby S K. An introduction to deep learning on biological sequence data: examples and solutions. Bioinformatics, 2017, 33(22): 3685–3690.
[32]
[32]Liu Q, Xia F, Yin Q, Jiang R. Chromatin accessibility prediction via a hybrid deep convolutional neural network. Bioinformatics, 2017, 34(5): 732–738.
[33]
[33]Min X, Zeng W, Chen N, Chen T, Jiang R. Chromatin accessibility prediction via convolutional long short-term memory networks with k-mer embedding. Bioinformatics, 2017, 33(14): i92–i101.
[34]
[34]Bu H, Gan Y, Wang Y, Zhou S, Guan J. A new method for enhancer prediction based on deep belief network. BMC Bioinformatics, 2017, 18(Suppl 12): 418.
[35]
[35]Zhang J, Peng W, Wang L. LeNup: learning nucleosome positioning from DNA sequences with improved convolutional neural networks. Bioinformatics, 2018, 34(10): 1705–1712.
[36]
[36]Inukai S, Kock K H, Bulyk M L. Transcription factor-DNA binding: beyond binding site motifs. Current Opinion in Genetics & Development, 2017, 43: 110–119.
[37]
[37]Siggers T, Gordân R. Protein-DNA binding: complexities and multi-protein codes. Nucleic Acids Research, 2014, 42: 2099–2111.
[38]
[38] Pique-Regi R, Degner J F, Pai A A, Gaffney D J, Gilad Y, Pritchard J K. Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Research, 2011, 21(3): 447–455.
[39]
[39]Gusmao E G, Allhoff M, Zenke M, Costa I G. Analysis of computational footprinting methods for DNase sequencing experiments. Nature Methods, 2016, 13(4): 303–309.
[40]
[40]Guo W L, Huang D S. An efficient method to transcription factor binding sites imputation via simultaneous completion of multiple matrices with positional consistency. Molecular BioSystems, 2017, 13(9): 1827–1837.
[41]
[41]Jing F, Zhang S W, Cao Z, Zhang S. Combining sequence and epigenomic data to predict transcription factor binding sites using deep learning. In: International Symposium on Bioinformatics Research and Applications, 2018, 241–252.
[42]
[42]Yang L, Zhou T, Dror I, Mathelier A, Wasserman W W, Gordân R, Rohs R. TFBSshape: a motif database for DNA shape features of transcription factor binding sites. Nucleic Acids Research, 2014, 42(Database issue): D148–D155.
[43]
[43]Qinhu, Zhang, Zhen, et al. Predicting in-vitro transcription factor binding sites using DNA sequence + shape.[J]. IEEE/ACM transactions on computational biology and bioinformatics, 2019.
[44]
[44]Wang S, Shen Z, Y He, et al. A New Method Combining DNA Shape Features to Improve the Prediction Accuracy of Transcription Factor Binding Sites[M]. Springer, Cham, 2020.
[45]
[45]Min X, Zeng W, Chen S, Chen N, Chen T, Jiang R. Predicting enhancers with deep convolutional neural networks. BMC Bioinformatics, 2017, 18(Suppl 13): 478.
[46]
[46]Zhou J, Troyanskaya O G. Predicting effects of noncoding variants with deep learning–based sequence model. Nature Methods, 2015, 12(10): 931–934.
[47]
[47]Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M. TensorFlow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, 2016, 265–283.
[48]
[48]Bernstein B E, et al. An integrated encyclopedia of DNA elements in the human genome[J]. Nature, 2012, 489(7414):p.57-74.
[49]
[49]Srivastava N, Hinton G E, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014, 15(56): 1929–1958.
[50]
[50]Zeiler M D. ADADELTA: an adaptive learning rate method. arXiv, 2012, arXiv:1212.5701.

Index Terms

  1. Prediction of Transcription Factor Binding Sites Using Deep Learning Combined with DNA Sequences and Shape Feature Data
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          MMAsia '21: Proceedings of the 3rd ACM International Conference on Multimedia in Asia
          December 2021
          508 pages
          ISBN:9781450386074
          DOI:10.1145/3469877
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Sponsors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 10 January 2022

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. DNA-protein binding
          2. binding site
          3. convolution neural network
          4. deep learning
          5. transcription factor

          Qualifiers

          • Research-article
          • Research
          • Refereed limited

          Funding Sources

          Conference

          MMAsia '21
          Sponsor:
          MMAsia '21: ACM Multimedia Asia
          December 1 - 3, 2021
          Gold Coast, Australia

          Acceptance Rates

          Overall Acceptance Rate 59 of 204 submissions, 29%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 115
            Total Downloads
          • Downloads (Last 12 months)9
          • Downloads (Last 6 weeks)1
          Reflects downloads up to 27 Feb 2025

          Other Metrics

          Citations

          View Options

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format.

          HTML Format

          Figures

          Tables

          Media

          Share

          Share

          Share this Publication link

          Share on social media