Improved DNA-versus-Protein Homology Search for Protein Fossils

Yao, Yin; Frith, Martin C.

doi:10.1007/978-3-030-74432-8_11

Yin Yao^11,12 &
Martin C. Frith^11,12,13

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 12715))

Included in the following conference series:

International Conference on Algorithms for Computational Biology

561 Accesses
3 Citations
2 Altmetric

Abstract

Protein fossils, i.e. noncoding DNA descended from coding DNA, arise frequently from transposable elements (TEs), decayed genes, and viral integrations. They can reveal, and mislead about, evolutionary history and relationships. They have been detected by comparing DNA to protein sequences, but current methods are not optimized for this task. We describe a powerful DNA-protein homology search method. We use a 64 $\times $ 21 substitution matrix, which is fitted to sequence data, automatically learning the genetic code. We detect subtly homologous regions by considering alternative possible alignments between them, and calculate significance (probability of occurring by chance between random sequences). Our method detects TE protein fossils much more sensitively than blastx, and ${>}10{\times }$ faster. Of the $\sim $7 major categories of eukaryotic TE, three were long thought absent in mammals: we find two of them in the human genome, polinton and DIRS/Ngaro. This method increases our power to find ancient fossils, and perhaps to detect non-standard genetic codes. The alternative-alignments and significance paradigm is not specific to DNA-protein comparison, and could benefit homology search generally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

PQ, a new program for phylogeny reconstruction

Article Open access 12 October 2018

A Protocol for Phylogenetic Reconstruction

Phylogeny-Aware Alignment with PRANK and PAGAN

References

Allison, L., Wallace, C.S., Yee, C.N.: Finite-state models in the alignment of macromolecules. J. Mol. Evol. 35(1), 77–89 (1992)
Article Google Scholar
Altschul, S.F., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids Res. 25(17), 3389–3402 (1997)
Article Google Scholar
Campbell, S., Aswad, A., Katzourakis, A.: Disentangling the origins of virophages and polintons. Curr. Opin. Virol. 25, 59–65 (2017)
Article Google Scholar
Csűrös, M., Miklós, I.: Statistical alignment of retropseudogenes and their functional paralogs. Mol. Biol. Evol. 22(12), 2457–2471 (2005)
Article Google Scholar
Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge (1998)
Book Google Scholar
Eddy, S.R.: A new generation of homology search tools based on probabilistic inference. Genome Inform. 23(1), 205–211 (2009)
MathSciNet Google Scholar
Eddy, S.R.: A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput. Biol. 4(5), e1000069 (2008)
Article MathSciNet Google Scholar
Frith, M.C.: Gentle masking of low-complexity sequences improves homology search. PLoS One 6(12), e28819 (2011)
Article Google Scholar
Frith, M.C.: A new repeat-masking method enables specific detection of homologous sequences. Nucleic Acids Res. 39(4), e23–e23 (2011)
Article Google Scholar
Frith, M.C.: How sequence alignment scores correspond to probability models. Bioinformatics 36(2), 408–415 (2020)
MathSciNet Google Scholar
Gotoh, O.: Homology-based gene structure prediction: simplified matching algorithm using a translated codon (tron) and improved accuracy by allowing for long gaps. Bioinformatics 16(3), 190–202 (2000)
Article Google Scholar
Guan, X., Uberbacher, E.C.: Alignments of DNA and protein sequences containing frameshift errors. Comput. Appl. Biosci. 12(1), 31–40 (1996)
Google Scholar
Halperin, E., Faigler, S., Gill-More, R.: FramePlus: aligning DNA to protein sequences. Bioinformatics 15(11), 867–873 (1999)
Article Google Scholar
Harris, R.S.: Improved pairwise alignment of genomic DNA. Ph.D. thesis, The Pennsylvania State University (2007)
Google Scholar
Huang, X., Zhang, J.: Methods for comparing a DNA sequence with a protein sequence. Bioinformatics 12(6), 497–506 (1996)
Article Google Scholar
Huson, D.H., et al.: MEGAN-LR: new algorithms allow accurate binning and easy interactive exploration of metagenomic long reads and contigs. Biol. Direct 13(1), 6 (2018)
Article Google Scholar
Katzourakis, A., Gifford, R.J.: Endogenous viral elements in animal genomes. PLoS Genet. 6(11), e1001191 (2010)
Article Google Scholar
Kent, W.J., et al.: The human genome browser at UCSC. Genome Res. 12(6), 996–1006 (2002)
Article Google Scholar
Kiełbasa, S.M., Wan, R., Sato, K., Horton, P., Frith, M.C.: Adaptive seeds tame genomic sequence comparison. Genome Res. 21(3), 487–493 (2011)
Article Google Scholar
Ko, P., Narayanan, M., Kalyanaraman, A., Aluru, S.: Space-conserving optimal DNA-protein alignment. In: Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004, pp. 80–88. IEEE (2004)
Google Scholar
Lam, H.Y., et al.: Pseudofam: the pseudogene families database. Nucleic Acids Res. 37(suppl$\_$1), D738–D743 (2009)
Google Scholar
Lysholm, F.: Highly improved homopolymer aware nucleotide-protein alignments with 454 data. BMC Bioinform. 13(1), 230 (2012)
Article Google Scholar
Pearson, W.R., Wood, T., Zhang, Z., Miller, W.: Comparison of DNA sequences with protein sequences. Genomics 46(1), 24–36 (1997)
Article Google Scholar
Peltola, H., Söderlund, H., Ukkonen, E.: Algorithms for the search of amino acid patterns in nucleic acid sequences. Nucleic Acids Res. 14(1), 99–107 (1986)
Article Google Scholar
Poulter, R.T., Butler, M.I.: Tyrosine recombinase retrotransposons and transposons. In: Mobile DNA III, pp. 1271–1291 (2015)
Google Scholar
Pritham, E.J., Feschotte, C.: Massive amplification of rolling-circle transposons in the lineage of the bat Myotis lucifugus. Proc. Nat. Acad. Sci. 104(6), 1895–1900 (2007)
Article Google Scholar
Raes, J., Van de Peer, Y.: Functional divergence of proteins through frameshift mutations. Trends Genet. 21(8), 428–431 (2005)
Article Google Scholar
Roytberg, M., et al.: On subset seeds for protein alignment. IEEE/ACM Trans. Comput. Biol. Bioinform. 6(3), 483–494 (2009)
Article Google Scholar
Sheetlin, S.L., Park, Y., Frith, M.C., Spouge, J.L.: Frameshift alignment: statistics and post-genomic applications. Bioinformatics 30(24), 3575–3582 (2014)
Article Google Scholar
Smit, A., Hubley, R., Green, P.: RepeatMasker open-4.0 (2013–2015). http://www.repeatmasker.org
Starrett, G.J., et al.: Adintoviruses: a proposed animal-tropic family of midsize eukaryotic linear dsDNA (MELD) viruses. Virus Evol. (2020). veaa055
Google Scholar
States, D., Botstein, D.: Molecular sequence accuracy and the analysis of protein coding regions. Proc. Nat. Acad. Sci. U.S.A. 88(13), 5518 (1991)
Article Google Scholar
Steinegger, M., Söding, J.: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35(11), 1026–1028 (2017)
Article Google Scholar
Storer, J., Hubley, R., Rosen, J., Wheeler, T.J., Smit, A.F.: The Dfam community resource of transposable element families, sequence models, and genome annotations. Mobile DNA 12(1), 1–14 (2021)
Article Google Scholar
Tanay, A., Siggia, E.D.: Sequence context affects the rate of short insertions and deletions in flies and primates. Genome Biol. 9(2), R37 (2008)
Article Google Scholar
Tzou, P.L., Huang, X., Shafer, R.W.: NucAmino: a nucleotide to amino acid alignment optimized for virus gene sequences. BMC Bioinform. 18(1), 138 (2017)
Article Google Scholar
Wang, R., Xiong, J., Wang, W., Miao, W., Liang, A.: High frequency of +1 programmed ribosomal frameshifting in Euplotes octocarinatus. Sci. Rep. 6, 21139 (2016)
Article Google Scholar
Wells, J.N., Feschotte, C.: A field guide to eukaryotic transposable elements. Ann. Rev. Genet. 54, 539–561 (2020)
Article Google Scholar
Yu, Y.K., Hwa, T.: Statistical significance of probabilistic sequence alignment and related local hidden Markov models. J. Comput. Biol. 8(3), 249–282 (2001)
Article Google Scholar
Yu, Y.K., Bundschuh, R., Hwa, T.: Hybrid alignment: high-performance with universal statistics. Bioinformatics 18(6), 864–872 (2002)
Article Google Scholar
Zhang, Z., Pearson, W.R., Miller, W.: Aligning a DNA sequence with a protein sequence. J. Comput. Biol. 4(3), 339–349 (1997)
Article Google Scholar

Download references

Acknowledgments

We thank the Frith and Asai lab members for discussions that clarified our thinking.

Author information

Authors and Affiliations

Graduate School of Frontier Sciences, University of Tokyo, Chiba, Japan
Yin Yao & Martin C. Frith
Artificial Intelligence Research Center, AIST, Tokyo, Japan
Yin Yao & Martin C. Frith
Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), AIST, Tokyo, Japan
Martin C. Frith

Authors

Yin Yao
View author publications
You can also search for this author in PubMed Google Scholar
Martin C. Frith
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Martin C. Frith .

Editor information

Editors and Affiliations

Rovira i Virgili University, Tarragona, Spain
Carlos Martín-Vide
University of Extremadura, Cáceres, Spain
Miguel A. Vega-Rodríguez
University of Montana, Missoula, MT, USA
Travis Wheeler

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yao, Y., Frith, M.C. (2021). Improved DNA-versus-Protein Homology Search for Protein Fossils. In: Martín-Vide, C., Vega-Rodríguez, M.A., Wheeler, T. (eds) Algorithms for Computational Biology. AlCoB 2021. Lecture Notes in Computer Science(), vol 12715. Springer, Cham. https://doi.org/10.1007/978-3-030-74432-8_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-74432-8_11
Published: 31 May 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-74431-1
Online ISBN: 978-3-030-74432-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics