Skip to main content

Improved DNA-versus-Protein Homology Search for Protein Fossils

  • Conference paper
  • First Online:
Algorithms for Computational Biology (AlCoB 2021)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 12715))

Included in the following conference series:

Abstract

Protein fossils, i.e. noncoding DNA descended from coding DNA, arise frequently from transposable elements (TEs), decayed genes, and viral integrations. They can reveal, and mislead about, evolutionary history and relationships. They have been detected by comparing DNA to protein sequences, but current methods are not optimized for this task. We describe a powerful DNA-protein homology search method. We use a 64 \(\times \) 21 substitution matrix, which is fitted to sequence data, automatically learning the genetic code. We detect subtly homologous regions by considering alternative possible alignments between them, and calculate significance (probability of occurring by chance between random sequences). Our method detects TE protein fossils much more sensitively than blastx, and \({>}10{\times }\) faster. Of the \(\sim \)7 major categories of eukaryotic TE, three were long thought absent in mammals: we find two of them in the human genome, polinton and DIRS/Ngaro. This method increases our power to find ancient fossils, and perhaps to detect non-standard genetic codes. The alternative-alignments and significance paradigm is not specific to DNA-protein comparison, and could benefit homology search generally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Allison, L., Wallace, C.S., Yee, C.N.: Finite-state models in the alignment of macromolecules. J. Mol. Evol. 35(1), 77–89 (1992)

    Article  Google Scholar 

  2. Altschul, S.F., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids Res. 25(17), 3389–3402 (1997)

    Article  Google Scholar 

  3. Campbell, S., Aswad, A., Katzourakis, A.: Disentangling the origins of virophages and polintons. Curr. Opin. Virol. 25, 59–65 (2017)

    Article  Google Scholar 

  4. Csűrös, M., Miklós, I.: Statistical alignment of retropseudogenes and their functional paralogs. Mol. Biol. Evol. 22(12), 2457–2471 (2005)

    Article  Google Scholar 

  5. Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge (1998)

    Book  Google Scholar 

  6. Eddy, S.R.: A new generation of homology search tools based on probabilistic inference. Genome Inform. 23(1), 205–211 (2009)

    MathSciNet  Google Scholar 

  7. Eddy, S.R.: A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput. Biol. 4(5), e1000069 (2008)

    Article  MathSciNet  Google Scholar 

  8. Frith, M.C.: Gentle masking of low-complexity sequences improves homology search. PLoS One 6(12), e28819 (2011)

    Article  Google Scholar 

  9. Frith, M.C.: A new repeat-masking method enables specific detection of homologous sequences. Nucleic Acids Res. 39(4), e23–e23 (2011)

    Article  Google Scholar 

  10. Frith, M.C.: How sequence alignment scores correspond to probability models. Bioinformatics 36(2), 408–415 (2020)

    MathSciNet  Google Scholar 

  11. Gotoh, O.: Homology-based gene structure prediction: simplified matching algorithm using a translated codon (tron) and improved accuracy by allowing for long gaps. Bioinformatics 16(3), 190–202 (2000)

    Article  Google Scholar 

  12. Guan, X., Uberbacher, E.C.: Alignments of DNA and protein sequences containing frameshift errors. Comput. Appl. Biosci. 12(1), 31–40 (1996)

    Google Scholar 

  13. Halperin, E., Faigler, S., Gill-More, R.: FramePlus: aligning DNA to protein sequences. Bioinformatics 15(11), 867–873 (1999)

    Article  Google Scholar 

  14. Harris, R.S.: Improved pairwise alignment of genomic DNA. Ph.D. thesis, The Pennsylvania State University (2007)

    Google Scholar 

  15. Huang, X., Zhang, J.: Methods for comparing a DNA sequence with a protein sequence. Bioinformatics 12(6), 497–506 (1996)

    Article  Google Scholar 

  16. Huson, D.H., et al.: MEGAN-LR: new algorithms allow accurate binning and easy interactive exploration of metagenomic long reads and contigs. Biol. Direct 13(1), 6 (2018)

    Article  Google Scholar 

  17. Katzourakis, A., Gifford, R.J.: Endogenous viral elements in animal genomes. PLoS Genet. 6(11), e1001191 (2010)

    Article  Google Scholar 

  18. Kent, W.J., et al.: The human genome browser at UCSC. Genome Res. 12(6), 996–1006 (2002)

    Article  Google Scholar 

  19. Kiełbasa, S.M., Wan, R., Sato, K., Horton, P., Frith, M.C.: Adaptive seeds tame genomic sequence comparison. Genome Res. 21(3), 487–493 (2011)

    Article  Google Scholar 

  20. Ko, P., Narayanan, M., Kalyanaraman, A., Aluru, S.: Space-conserving optimal DNA-protein alignment. In: Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004, pp. 80–88. IEEE (2004)

    Google Scholar 

  21. Lam, H.Y., et al.: Pseudofam: the pseudogene families database. Nucleic Acids Res. 37(suppl\(\_\)1), D738–D743 (2009)

    Google Scholar 

  22. Lysholm, F.: Highly improved homopolymer aware nucleotide-protein alignments with 454 data. BMC Bioinform. 13(1), 230 (2012)

    Article  Google Scholar 

  23. Pearson, W.R., Wood, T., Zhang, Z., Miller, W.: Comparison of DNA sequences with protein sequences. Genomics 46(1), 24–36 (1997)

    Article  Google Scholar 

  24. Peltola, H., Söderlund, H., Ukkonen, E.: Algorithms for the search of amino acid patterns in nucleic acid sequences. Nucleic Acids Res. 14(1), 99–107 (1986)

    Article  Google Scholar 

  25. Poulter, R.T., Butler, M.I.: Tyrosine recombinase retrotransposons and transposons. In: Mobile DNA III, pp. 1271–1291 (2015)

    Google Scholar 

  26. Pritham, E.J., Feschotte, C.: Massive amplification of rolling-circle transposons in the lineage of the bat Myotis lucifugus. Proc. Nat. Acad. Sci. 104(6), 1895–1900 (2007)

    Article  Google Scholar 

  27. Raes, J., Van de Peer, Y.: Functional divergence of proteins through frameshift mutations. Trends Genet. 21(8), 428–431 (2005)

    Article  Google Scholar 

  28. Roytberg, M., et al.: On subset seeds for protein alignment. IEEE/ACM Trans. Comput. Biol. Bioinform. 6(3), 483–494 (2009)

    Article  Google Scholar 

  29. Sheetlin, S.L., Park, Y., Frith, M.C., Spouge, J.L.: Frameshift alignment: statistics and post-genomic applications. Bioinformatics 30(24), 3575–3582 (2014)

    Article  Google Scholar 

  30. Smit, A., Hubley, R., Green, P.: RepeatMasker open-4.0 (2013–2015). http://www.repeatmasker.org

  31. Starrett, G.J., et al.: Adintoviruses: a proposed animal-tropic family of midsize eukaryotic linear dsDNA (MELD) viruses. Virus Evol. (2020). veaa055

    Google Scholar 

  32. States, D., Botstein, D.: Molecular sequence accuracy and the analysis of protein coding regions. Proc. Nat. Acad. Sci. U.S.A. 88(13), 5518 (1991)

    Article  Google Scholar 

  33. Steinegger, M., Söding, J.: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35(11), 1026–1028 (2017)

    Article  Google Scholar 

  34. Storer, J., Hubley, R., Rosen, J., Wheeler, T.J., Smit, A.F.: The Dfam community resource of transposable element families, sequence models, and genome annotations. Mobile DNA 12(1), 1–14 (2021)

    Article  Google Scholar 

  35. Tanay, A., Siggia, E.D.: Sequence context affects the rate of short insertions and deletions in flies and primates. Genome Biol. 9(2), R37 (2008)

    Article  Google Scholar 

  36. Tzou, P.L., Huang, X., Shafer, R.W.: NucAmino: a nucleotide to amino acid alignment optimized for virus gene sequences. BMC Bioinform. 18(1), 138 (2017)

    Article  Google Scholar 

  37. Wang, R., Xiong, J., Wang, W., Miao, W., Liang, A.: High frequency of +1 programmed ribosomal frameshifting in Euplotes octocarinatus. Sci. Rep. 6, 21139 (2016)

    Article  Google Scholar 

  38. Wells, J.N., Feschotte, C.: A field guide to eukaryotic transposable elements. Ann. Rev. Genet. 54, 539–561 (2020)

    Article  Google Scholar 

  39. Yu, Y.K., Hwa, T.: Statistical significance of probabilistic sequence alignment and related local hidden Markov models. J. Comput. Biol. 8(3), 249–282 (2001)

    Article  Google Scholar 

  40. Yu, Y.K., Bundschuh, R., Hwa, T.: Hybrid alignment: high-performance with universal statistics. Bioinformatics 18(6), 864–872 (2002)

    Article  Google Scholar 

  41. Zhang, Z., Pearson, W.R., Miller, W.: Aligning a DNA sequence with a protein sequence. J. Comput. Biol. 4(3), 339–349 (1997)

    Article  Google Scholar 

Download references

Acknowledgments

We thank the Frith and Asai lab members for discussions that clarified our thinking.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martin C. Frith .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yao, Y., Frith, M.C. (2021). Improved DNA-versus-Protein Homology Search for Protein Fossils. In: Martín-Vide, C., Vega-Rodríguez, M.A., Wheeler, T. (eds) Algorithms for Computational Biology. AlCoB 2021. Lecture Notes in Computer Science(), vol 12715. Springer, Cham. https://doi.org/10.1007/978-3-030-74432-8_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-74432-8_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-74431-1

  • Online ISBN: 978-3-030-74432-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics