Skip to main content

Predicting Protein-DNA Binding Sites by Fine-Tuning BERT

  • Conference paper
  • First Online:
  • 1682 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13394))

Abstract

The study of Protein-DNA binding sites is one of the fundamental problems in genome biology research. It plays an important role in understanding gene expression and transcription, biological research, and drug development. In recent years, language representation models have had remarkable results in the field of Natural Language Processing (NLP) and have received extensive attention from researchers. Bidirectional Encoder Representations for Transformers (BERT) has been shown to have state-of-the-art results in other domains, using the concept of word embedding to capture the semantics of sentences. In the case of small datasets, previous models often cannot capture the upstream and downstream global information of DNA sequences well, so it is reasonable to refer the BERT model to the training of DNA sequences. Models pre-trained with large datasets and then fine-tuned with specific datasets have excellent results on different downstream tasks. In this study, firstly, we regard DNA sequences as sentences and tokenize them using K-mer method, and later utilize BERT to matrix the fixed length of the tokenized sentences, perform feature extraction, and later perform classification operations. We compare this method with current state-of-the-art models, and the DNABERT method has better performance with average improvement 0.013537, 0.010866, 0.029813, 0.052611, 0.122131 in ACC, F1-score, MCC, Precision, Recall, respectively. Overall, one of the advantages of BERT is that the pre-training strategy speeds up the convergence in the network in migration learning and improves the learning ability of the network. DNABER model has advantageous generalization ability on other DNA datasets and can be utilized on other sequence classification tasks.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Rohs, R., Jin, X., West, S.M., Joshi, R., Honig, B., Mann, R.S.: Origins of specificity in protein-DNA recognition. Annu. Rev. Biochem. 79, 233–269 (2010). https://doi.org/10.1146/annurev-biochem-060408-091030

    Article  Google Scholar 

  2. Jordan, M.I., LeCun, Y., Solla, S.A. (eds.): Advances in Neural Information Processing Systems: Proceedings of the First 12 Conferences. MIT Press, Cambridge (2004)

    Google Scholar 

  3. Liu, Y., et al.: RoBERTa: A Robustly Optimized BERT Pretraining Approach (2019)

    Google Scholar 

  4. Liu, Y., Zhu, Y.-H., Song, X., Song, J., Yu, D.-J.: Why can deep convolutional neural networks improve protein fold recognition? A visual explanation by interpretation. Brief Bioinform. 22, bbab001 (2021). https://doi.org/10.1093/bib/bbab001

  5. Karin, M.: Too many transcription factors: positive and negative interactions. New Biol. 2, 126–131 (1990)

    Google Scholar 

  6. Latchman, D.S.: Transcription factors: an overview. Int. J. Biochem. Cell Biol. 29, 1305–1312 (1997). https://doi.org/10.1016/s1357-2725(97)00085-x

    Article  Google Scholar 

  7. Jolma, A., et al.: DNA-binding specificities of human transcription factors. Cell 152, 327–339 (2013). https://doi.org/10.1016/j.cell.2012.12.009

    Article  Google Scholar 

  8. Tuupanen, S., et al.: The common colorectal cancer predisposition SNP rs6983267 at chromosome 8q24 confers potential to enhanced Wnt signaling. Nat. Genet. 41, 885–890 (2009). https://doi.org/10.1038/ng.406

    Article  Google Scholar 

  9. Wasserman, W.W., Sandelin, A.: Applied bioinformatics for the identification of regulatory elements. Nat. Rev. Genet. 5, 276–287 (2004). https://doi.org/10.1038/nrg1315

    Article  Google Scholar 

  10. Lambert, S.A., et al.: The human transcription factors. Cell 172, 650–665 (2018). https://doi.org/10.1016/j.cell.2018.01.029

    Article  Google Scholar 

  11. Basith, S., Manavalan, B., Shin, T.H., Lee, G.: iGHBP: computational identification of growth hormone binding proteins from sequences using extremely randomised tree. Comput. Struct. Biotechnol. J. 16, 412–420 (2018). https://doi.org/10.1016/j.csbj.2018.10.007

    Article  Google Scholar 

  12. Furey, T.S.: ChIP-seq and beyond: new and improved methodologies to detect and characterize protein-DNA interactions. Nat. Rev. Genet. 13, 840–852 (2012). https://doi.org/10.1038/nrg3306

    Article  Google Scholar 

  13. Manavalan, B., Shin, T.H., Lee, G.: DHSpred: support-vector-machine-based human DNase I hypersensitive sites prediction using the optimal features selected by random forest. Oncotarget 9, 1944–1956 (2017). https://doi.org/10.18632/oncotarget.23099

  14. Wong, K.-C., Chan, T.-M., Peng, C., Li, Y., Zhang, Z.: DNA motif elucidation using belief propagation. Nucleic Acids Res. 41, e153 (2013). https://doi.org/10.1093/nar/gkt574

    Article  Google Scholar 

  15. Li, L., et al.: Sequence-based identification of recombination spots using pseudo nucleic acid representation and recursive feature extraction by linear kernel SVM. BMC Bioinform. 15, 340 (2014). https://doi.org/10.1186/1471-2105-15-340

    Article  Google Scholar 

  16. Angermueller, C., Pärnamaa, T., Parts, L., Stegle, O.: Deep learning for computational biology. Mol. Syst. Biol. 12, 878 (2016). https://doi.org/10.15252/msb.20156651

  17. Graves, A., Mohamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649 (2013). https://doi.org/10.1109/ICASSP.2013.6638947

  18. Hong, J., et al.: Convolutional neural network-based annotation of bacterial type IV secretion system effectors with enhanced accuracy and reduced false discovery. Brief. Bioinform. 21, 1825–1836 (2020). https://doi.org/10.1093/bib/bbz120

    Article  Google Scholar 

  19. Hong, J., et al.: Protein functional annotation of simultaneously improved stability, accuracy and false discovery rate achieved by a sequence-based deep learning. Brief Bioinform. 21, 1437–1447 (2020). https://doi.org/10.1093/bib/bbz081

    Article  Google Scholar 

  20. Min, S., Kim, H., Lee, B., Yoon, S.: Protein transfer learning improves identification of heat shock protein families. PLoS ONE 16, e0251865 (2021). https://doi.org/10.1371/journal.pone.0251865

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported in part by the University Innovation Team Project of Jinan (2019GXRC015), the Natural Science Foundation of Shandong Province, China (Grant No. ZR2021MF036).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yi Cao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, Y., Chen, Y., Chen, B., Cao, Y., Chen, J., Cong, H. (2022). Predicting Protein-DNA Binding Sites by Fine-Tuning BERT. In: Huang, DS., Jo, KH., Jing, J., Premaratne, P., Bevilacqua, V., Hussain, A. (eds) Intelligent Computing Theories and Application. ICIC 2022. Lecture Notes in Computer Science, vol 13394. Springer, Cham. https://doi.org/10.1007/978-3-031-13829-4_57

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-13829-4_57

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-13828-7

  • Online ISBN: 978-3-031-13829-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics