Skip to main content

Analysis and Classification of Constrained DNA Elements with N-gram Graphs and Genomic Signatures

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 8542))

Abstract

Most common methods for inquiring genomic sequence composition, are based on the bag-of-words approach and thus largely ignore the original sequence structure or the relative positioning of its constituent oligonucleotides. We here present a novel methodology that takes into account both word representation and relative positioning at various lengths scales in the form of n-gram graphs (NGG). We implemented the NGG approach on short vertebrate and invertebrate constrained genomic sequences of various origins and predicted functionalities and were able to efficiently distinguish DNA sequences belonging to the same species (intra-species classification). As an alternative method, we also applied the Genomic Signatures (GS) approach to the same sequences. To our knowledge, this is the first time that GS are applied on short sequences, rather than whole genomes. Together, the presented results suggest that NGG is an efficient method for classifying sequences, originating from a given genome, according to their function.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bejerano, G., Pheasant, M., Makunin, I., Stephen, S., Kent, W.J., Mattick, J.S., Haussler, D.: Ultraconserved elements in the human genome. Science 304(5675), 1321–1325 (2004), http://www.ncbi.nlm.nih.gov/pubmed/15131266

    Article  Google Scholar 

  2. Cohen, W.W.: Fast effective rule induction. ICML 95, 115–123 (1995)

    Google Scholar 

  3. Culotta, A., Kulp, D., McCallum, A.: Gene prediction with conditional random fields, Tech. Rep. UM-CS-2005-028, University of Massachusetts, Amherst (2005)

    Google Scholar 

  4. Dimitrieva, S., Bucher, P.: Genomic context analysis reveals dense interaction network between vertebrate ultraconserved non-coding elements. Bioinformatics 28(18), i395–i401 (2012), http://www.ncbi.nlm.nih.gov/pubmed/22962458

  5. Drake, J.A., Bird, C., Nemesh, J., Thomas, D.J., Newton-Cheh, C., Reymond, A., Excoffier, L., Attar, H., Antonarakis, S.E., Dermitzakis, E.T., Hirschhorn, J.N.: Conserved noncoding sequences are selectively constrained and not mutation cold spots. Nat. Genet. 38(2), 223–227 (2006), http://www.ncbi.nlm.nih.gov/pubmed/16380714

    Article  Google Scholar 

  6. Ganapathiraju, M., Weisser, D., Rosenfeld, R., Carbonell, J., Reddy, R., Klein-Seetharaman, J.: Comparative n-gram analysis of whole-genome protein sequences. In: Proceedings of the Second International Conference on Human Language Technology Research, pp. 76–81. Morgan Kaufmann Publishers Inc. (2002)

    Google Scholar 

  7. Giannakopoulos, G., Karkaletsis, V., Vouros, G., Stamatopoulos, P.: Summarization system evaluation revisited: N-gram graphs. ACM Trans. Speech Lang. Process. 5(3), 139 (2008)

    Article  Google Scholar 

  8. Glazko, G.V., Koonin, E.V., Rogozin, I.B., Shabalina, S.A.: A significant fraction of conserved noncoding DNA in human and mouse consists of predicted matrix attachment regions. Trends Genet. 19(3), 119–124 (2003), http://www.ncbi.nlm.nih.gov/pubmed/12615002

    Article  Google Scholar 

  9. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. ACM SIGKDD Explorations Newsletter 11(1), 10–18 (2009)

    Article  Google Scholar 

  10. Harmston, N., Baresic, A., Lenhard, B.: The mystery of extreme non-coding conservation. Philosophical transactions of the Royal Society of London 368(1632), 20130021 (2013), http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3826495&tool=pmcentrez&rendertype=abstract

    Article  Google Scholar 

  11. Karlin, S., Mrázek, J.: Compositional differences within and between eukaryotic genomes. Proceedings of the National Academy of Sciences of the United States of America 94(19), 10227–10232 (1997), http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=23344&tool=pmcentrez&rendertype=abstract

    Article  Google Scholar 

  12. Karlin, S.: Global dinucleotide signatures and analysis of genomic heterogeneity. Current Opinion in Microbiology 1(5), 598–610 (1998)

    Article  Google Scholar 

  13. Karlin, S., Burge, C.: Dinucleotide relative abundance extremes: a genomic signature. Trends in Genetics 11(7), 283–290 (1995)

    Article  Google Scholar 

  14. Karolchik, D., Baertsch, R., Diekhans, M., Furey, T.S., Hinrichs, A., Lu, Y., Roskin, K.M., Schwartz, M., Sugnet, C.W., Thomas, D.J., et al.: The ucsc genome browser database. Nucleic Acids Research 31(1), 51–54 (2003)

    Article  Google Scholar 

  15. Kim, J.Y., Shawe-Taylor, J.: Fast string matching using an n-gram algorithm. Software: Practice and Experience 24(1), 79–88 (1994)

    Google Scholar 

  16. Kim, M.S., Whang, K.Y., Lee, J.G., Lee, M.J.: n-gram/2l: A space and time efficient two-level n-gram inverted index structure. In: Proceedings of the 31st International Conference on Very Large Data Bases, pp. 325–336. VLDB Endowment (2005)

    Google Scholar 

  17. Kim, S.Y., Pritchard, J.K.: Adaptive evolution of conserved noncoding elements in mammals. PLoS Genetics 3(9), 1572–1586 (2007), http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1971121&tool=pmcentrez&rendertype=abstract

    Article  Google Scholar 

  18. Lee, A.P., Kerk, S.Y., Tan, Y.Y., Brenner, S., Venkatesh, B.: Ancient vertebrate conserved noncoding elements have been evolving rapidly in teleost fishes. Mol. Biol. Evol. 28(3), 1205–1215 (2011), http://www.ncbi.nlm.nih.gov/pubmed/21081479

    Article  Google Scholar 

  19. Lindblad-Toh, K., et al.: A high-resolution map of human evolutionary constraint using 29 mammals. Nature 478(7370), 476–482 (2011), http://www.ncbi.nlm.nih.gov/pubmed/21993624

    Article  Google Scholar 

  20. Mantegna, R., Buldyrev, S., Goldberger, A., Havlin, S., Peng, C.K., Simons, M., Stanley, H.: Systematic analysis of coding and noncoding dna sequences using methods of statistical linguistics. Physical Review E 52(3), 2939 (1995)

    Article  Google Scholar 

  21. Pruitt, K.D., Tatusova, T., Maglott, D.R.: Ncbi reference sequences (refseq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research 35(suppl. 1), 61–65 (2007)

    Article  Google Scholar 

  22. Quinlan, A.R., Hall, I.M.: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26(6), 841–842 (2010), http://www.ncbi.nlm.nih.gov/pubmed/20110278

    Article  Google Scholar 

  23. Retelska, D., Beaudoing, E., Notredame, C., Jongeneel, C.V., Bucher, P.: Vertebrate conserved non coding DNA regions have a high persistence length and a short persistence time. BMC Genomics 8, 398 (2007), http://www.ncbi.nlm.nih.gov/pubmed/17973996

    Article  Google Scholar 

  24. Rice, P., Longden, I., Bleasby, A.: EMBOSS: the European Molecular Biology Open Software Suite. Trends in genetics: TIG 16(6), 276–277 (2000), http://www.ncbi.nlm.nih.gov/pubmed/10827456

    Article  Google Scholar 

  25. Stephen, S., Pheasant, M., Makunin, I.V., Mattick, J.S.: Large-scale appearance of ultraconserved elements in tetrapod genomes and slowdown of the molecular clock. Mol. Biol. Evol. 25(2), 402–408 (2008), http://www.ncbi.nlm.nih.gov/pubmed/18056681

    Article  Google Scholar 

  26. Touchon, M., Arneodo, A., d’Aubenton Carafa, Y., Thermes, C.: Transcription-coupled and splicing-coupled strand asymmetries in eukaryotic genomes. Nucleic Acids Research 32(17), 4969–4978 (2004)

    Article  Google Scholar 

  27. Vavouri, T., Walter, K., Gilks, W.R., Lehner, B., Elgar, G.: Parallel evolution of conserved non-coding elements that target a common set of developmental regulatory genes from worms to humans. Genome Biol. 8(2), R15 (2007), http://www.ncbi.nlm.nih.gov/pubmed/17274809

  28. Viturawong, T., Meissner, F., Butter, F., Mann, M.: A DNA-Centric Protein Interaction Map of Ultraconserved Elements Reveals Contribution of Transcription Factor Binding Hubs to Conservation. Cell reports 5(2), 531–545 (2013), http://www.cell.com/cell-reports/fulltext/S2211-1247

    Article  Google Scholar 

  29. Walter, K., Abnizova, I., Elgar, G., Gilks, W.R.: Striking nucleotide frequency pattern at the borders of highly conserved vertebrate non-coding sequences. Trends Genet. 21(8), 436–440 (2005), http://www.ncbi.nlm.nih.gov/pubmed/15979195

    Article  Google Scholar 

  30. Xie, X., Mikkelsen, T.S., Gnirke, A., Lindblad-Toh, K., Kellis, M., Lander, E.S.: Systematic discovery of regulatory motifs in conserved regions of the human genome, including thousands of CTCF insulator sites. Proc. Natl. Acad. Sci U. S. A. 104(17), 7145–7150 (2007), http://www.ncbi.nlm.nih.gov/pubmed/17442748

    Article  Google Scholar 

  31. Zhang, L., Kasif, S., Cantor, C.R., Broude, N.E.: Gc/at-content spikes as genomic punctuation marks. Proceedings of the National Academy of Sciences of the United States of America 101(48), 16855–16860 (2004)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Polychronopoulos, D., Krithara, A., Nikolaou, C., Paliouras, G., Almirantis, Y., Giannakopoulos, G. (2014). Analysis and Classification of Constrained DNA Elements with N-gram Graphs and Genomic Signatures. In: Dediu, AH., Martín-Vide, C., Truthe, B. (eds) Algorithms for Computational Biology. AlCoB 2014. Lecture Notes in Computer Science(), vol 8542. Springer, Cham. https://doi.org/10.1007/978-3-319-07953-0_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-07953-0_18

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-07952-3

  • Online ISBN: 978-3-319-07953-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics