Skip to main content

Searching for the Origins of Life – Detecting RNA Life Signatures Using Learning Vector Quantization

  • Conference paper
  • First Online:

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 976))

Abstract

The most plausible hypothesis for explaining the origins of life on earth is the RNA world hypothesis supported by a growing number of research results from various scientific areas. Frequently, the existence of a hypothetical species on earth is supposed, with a base RNA sequence probably dissimilar from any known genomes today. It is hard to distinguish hypothetical sequences obtained by computer simulations from biological sequences and, hence, to decide which characteristics provide biological functionality. In the present consideration biological sequences obtained from RNA-viruses are compared with computationally generated sequences (artificial life probes). The task is to discriminate the samples regarding their origin, biological or artificial. We used the learning vector quantization (LVQ) model as the respective classifier. LVQ is a dissimilarity based classifier, which has only weak requirements regarding the underlying dissimilarity measure. This gives the opportunity to investigate several dissimilarity measures regarding their discriminating behavior for this task. Particularly, we consider information theoretic dissimilarities like the normalized compression distance (NCD) and divergences based on bag-of-word (BoW) vectors generated on the base of nucleotide-codons. Additionally, the geodesic path distance is applied taking an unary coding of sequences for a representation in the underlying Grassmann-manifold. Both, BoW and GPD allow continuous updates of prototypes in the feature space and in the Grassmann-manifold, respectively, whereas NCD restricts the application of LVQ methods to median variants.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    For computational convenience it is usually assumed that both matrices \(\mathbf {X}\) and \(\mathbf {Y}\) are orthonormal, which can always be obtained by Gram-Schmidt-orthonormalization. We will take this assumption here, too. If this assumption is dropped the procedure is still valid but more complicated. We refer to [34].

References

  1. Gilbert W (1986) Origin of life: the RNA world. Nature 319(6055):618

    Article  Google Scholar 

  2. Neveu M, Kim H-J, Benner SA (2013) The “Strong” RNA world hypothesis: fifty years old. Astrobiology 13(4):391–403

    Article  Google Scholar 

  3. Rich A (1962) On the problems of evolution and biochemical information transfer. In: Kasha M, Pullman B (eds) Horizons in biochemistry. Academic Press, pp 103–126

    Google Scholar 

  4. Cech TR (2011) The RNA worlds in context. Cold Spring Harb Perspect Biol 4(7):a006742

    Google Scholar 

  5. Wasik S, Szostak N, Kudla M, Wachowiak M, Krawiec K, Blazewicz J (2019) Detecting life signatures with RNA sequence similarity measure. J Theor Biol 463:110–120

    Article  Google Scholar 

  6. Szostak N, Synak J, Borowski M, Wasik S, Blazewicz J (2017) Simulating the origins of life: the dual role of RNA replicases as an obstacle to evolution. PLOS ONE 12(7):1–28

    Article  Google Scholar 

  7. Eigen M (1971) Selforganization of matter and the evolution of biological macromolecules. Die Naturwiss 58(10):465–523

    Article  Google Scholar 

  8. Quastler H (1953) Essays on the use of information theory in biology. University of Illinois Press, Urbana

    Google Scholar 

  9. Szostak N, Wasik S, Blazewicz J (2017) Understanding life: a bioinformatics perspective. Eur Rev 25(2):231245

    Article  Google Scholar 

  10. Kohonen T (1988) Learning vector quantization. Neural Netw 1(Suppl. 1):303

    Google Scholar 

  11. Sato A, Yamada K (1996) Generalized learning vector quantization. In: Touretzky DS, Mozer MC, Hasselmo ME (eds) Advances in neural information processing systems, vol 8. Proceedings of the 1995 Conference. MIT Press, Cambridge, pp 423–429

    Google Scholar 

  12. Nebel D, Hammer B, Frohberg K, Villmann T (2015) Median variants of learning vector quantization for learning of dissimilarity data. Neurocomputing 169:295–305

    Article  Google Scholar 

  13. Wasik S, Prejzendanc T, Blazewicz J (2013) ModeLang - a new approach for experts-friendly viral infections modeling. Comput Math Methods Med 2013:8

    Article  MathSciNet  Google Scholar 

  14. Wasik S (2018) Modeling biological systems using crowdsourcing. Found Comput Decis Sci 43(3):219–243

    Article  Google Scholar 

  15. Guogas L, Hogle J, Gehrke L (2004) Origins of life and the RNA world: evolution of RNA-replicase recognition. In: Norris R, Stootman F (eds) Bioastronomy 2002: life among the stars. IAU Symposium, vol 213, p 321, June 2004

    Google Scholar 

  16. Brister JR, Ako-adjei D, Bao Y, Blinkova O (2014) NCBI viral genomes resource. Nucleic Acids Res 43(D1):D571–D577

    Article  Google Scholar 

  17. Eigen M, Schuster P (1982) Stages of emerging life—five principles of early organization. J Mol Evol 19(1):47–61

    Article  Google Scholar 

  18. Sharp SJ, Schaack J, Cooley L, Burke DJ, Söll D (1985) Structure and transcription of eukaryotic tRNA genes. CRC Crit Rev Biochem 19(2):107–144

    Article  Google Scholar 

  19. Azad RK, Li J (2013) Interpreting genomic data via entropic dissection. Nucleic Acids Res 41(1):e23

    Article  Google Scholar 

  20. Mohammadi M, Biehl M, Villmann A, Villmann T (2017) Sequence learning in unsupervised and supervised vector quantization using Hankel matrices. In: Rutkowski L, Korytkowski M, Scherer R, Tadeusiewicz R, Zadeh LA, Zurada JM (eds) Proceedings of the 16th international conference on artificial intelligence and soft computing - ICAISC. LNAI, Zakopane. Springer, Cham, pp 131–142

    Google Scholar 

  21. Blaisdell BE (1986) A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci USA 83:5155–5159

    Article  Google Scholar 

  22. Vinga S, Almeida JS (2004) Alignment-free sequence comparison – a review. Bioinformatics 20(2):206–215

    Article  Google Scholar 

  23. Cilibrasi R, Vitányi PMB (2005) Clustering by compression. IEEE Trans Inf Theory 51(4):1523–1545

    Article  MathSciNet  Google Scholar 

  24. Li M, Chen X, Li X, Ma B, Vitanyi PMB (2004) The similarity metric. IEEE Trans Inf Theory 50(12):3250–3264

    Article  MathSciNet  Google Scholar 

  25. Kolmogorov AN (1965) Three approaches to the quantitative definition of information. Probl Inf Transm 1(1):1–7

    MathSciNet  Google Scholar 

  26. Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory 23(3):337–343

    Article  MathSciNet  Google Scholar 

  27. Huffman D (1952) A method for the construction of minimum-redundancy codes. Proc IRE 40(9):1098–1101

    Article  Google Scholar 

  28. Vinga S (2004) Information theory applictions for biological sequence analysis. Bioinformatics 15(3):376–389

    MathSciNet  Google Scholar 

  29. Vinga S, Almeida JS (2004) Rényi continuous entropy of DNA sequences. J Theor Biol 231:377–388

    Article  Google Scholar 

  30. Fianacca A, LaPaglia L, LaRosa M, LoBosco G, Renda G, Rizzo R, Galio S, Urso A (2018) Deep learning models for bacteria taxonomic classification of metagenomic data. BMC Bioinform 19(Suppl. 7):198

    Article  Google Scholar 

  31. Rényi A (1961) On measures of entropy and information. In: Proceedings of the fourth Berkeley symposium on mathematical statistics and probability. University of California Press, Berkeley

    Google Scholar 

  32. Nguyen NG, Tran VA, Ngo DL, Phan D, Lumbanraja FR, Faisal MR, Abapihi B, Kubo M, Satou K (2016) DNA sequence classification by convolutional neural network. J Biomed Sci Eng 9:280–286

    Article  Google Scholar 

  33. Hamm J, Lee DD (2008) Grassmann discriminant analysis: a unifying view on subspace-based learning. In: Proceedings of the 25th international conference on machine learning, pp 376–388

    Google Scholar 

  34. Absil P-A, Mahony R, Sepulchre R (2004) Riemannian geometry of Grassmann manifolds with a view on algorithmic computation. Acta Appl Math 80:199–220

    Article  MathSciNet  Google Scholar 

  35. Wedin PA (1983) On angles between subspaces of a finite dimensional inner product space. Lecture notes in mathematics, vol 973. Springer, Heidelberg, pp 263–285

    Google Scholar 

  36. Nebel D, Kaden M, Villmann A, Villmann T (2017) Types of (dis\(-\))similarities and adaptive mixtures thereof for improved classification learning. Neurocomputing 268:42–54

    Google Scholar 

  37. Kaden M, Riedel M, Hermann W, Villmann T (2015) Border-sensitive learning in generalized learning vector quantization: an alternative to support vector machines. Soft Comput 19(9):2423–2434

    Article  Google Scholar 

  38. Kirby M, Peterson C (2017) Visualizing data sets on the Grassmannian using self-organizing maps. In: Proceedings of the 12th workshop on self-organizing maps and learning vector quantization (WSOM 2017), Nancy, France. IEEE Press, Los Alamitos, pp 32–37

    Google Scholar 

  39. Villmann T (2017) Grassmann manifolds, Hankel matrices and tangent metric models in classification learning. Mach Learn Rep 11(MLR-02-2017):22–25 http://www.techfak.uni-bielefeld.de/~fschleif/mlr/mlr_0_2017.pdf, ISSN:1865-3960

  40. Hammer B, Hofmann D, Schleif F-M, Zhu X (2014) Learning vector quantization for (dis-)similarities. Neurocomputing 131:43–51

    Article  Google Scholar 

  41. Pekalska E, Duin RPW (2006) The dissimilarity representation for pattern recognition: foundations and applications. World Scientific, Singapore

    MATH  Google Scholar 

  42. Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10:707–710

    MathSciNet  Google Scholar 

  43. Yin C, Chen Y, Yau SS-T (2014) A measure of DNA sequence similarity by fourier transform with applications on hierarchical clustering. J Theor Biol 359:18–28

    Article  MathSciNet  Google Scholar 

  44. Almeida JS, Carrico JA, Maretzek A, Noble PA, Fletcher M (2001) Analysis of genomic sequences by chaos game representation. Bioinformatics 17(5):429–437

    Article  Google Scholar 

  45. Deng M, Yu C, Liang Q, He RL, Yau SS-T (2011) A novel method of characterizing sequences: genome space with biological distance and applications. PLoS ONE 6(3):e17293

    Article  Google Scholar 

  46. Li Y, He L, He RL, Yau SS-T (2017) A novel fast vector method for genetic sequence comparison. Nat Sci Rep 7(12226):1–11

    Google Scholar 

  47. Li Y, Tian K, Yin C, He RL, Yau SS-T (2016) Virus classification in 60-dimensional protein space. Mol Phylogenet Evol 99:53–62

    Article  Google Scholar 

Download references

Acknowledgement

M.K. was supported by grants of the European Social Fond (ESF) for a Young Researcher Group ‘MACS’ in cooperation with the TU Bergakademie Freiberg (Germany) and for the project titled ‘Digitale Produkt- und Prozessinovationen’ at the UAS Mittweida.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thomas Villmann .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Villmann, T. et al. (2020). Searching for the Origins of Life – Detecting RNA Life Signatures Using Learning Vector Quantization. In: Vellido, A., Gibert, K., Angulo, C., Martín Guerrero, J. (eds) Advances in Self-Organizing Maps, Learning Vector Quantization, Clustering and Data Visualization. WSOM 2019. Advances in Intelligent Systems and Computing, vol 976. Springer, Cham. https://doi.org/10.1007/978-3-030-19642-4_32

Download citation

Publish with us

Policies and ethics