Abstract
The most plausible hypothesis for explaining the origins of life on earth is the RNA world hypothesis supported by a growing number of research results from various scientific areas. Frequently, the existence of a hypothetical species on earth is supposed, with a base RNA sequence probably dissimilar from any known genomes today. It is hard to distinguish hypothetical sequences obtained by computer simulations from biological sequences and, hence, to decide which characteristics provide biological functionality. In the present consideration biological sequences obtained from RNA-viruses are compared with computationally generated sequences (artificial life probes). The task is to discriminate the samples regarding their origin, biological or artificial. We used the learning vector quantization (LVQ) model as the respective classifier. LVQ is a dissimilarity based classifier, which has only weak requirements regarding the underlying dissimilarity measure. This gives the opportunity to investigate several dissimilarity measures regarding their discriminating behavior for this task. Particularly, we consider information theoretic dissimilarities like the normalized compression distance (NCD) and divergences based on bag-of-word (BoW) vectors generated on the base of nucleotide-codons. Additionally, the geodesic path distance is applied taking an unary coding of sequences for a representation in the underlying Grassmann-manifold. Both, BoW and GPD allow continuous updates of prototypes in the feature space and in the Grassmann-manifold, respectively, whereas NCD restricts the application of LVQ methods to median variants.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
For computational convenience it is usually assumed that both matrices \(\mathbf {X}\) and \(\mathbf {Y}\) are orthonormal, which can always be obtained by Gram-Schmidt-orthonormalization. We will take this assumption here, too. If this assumption is dropped the procedure is still valid but more complicated. We refer to [34].
References
Gilbert W (1986) Origin of life: the RNA world. Nature 319(6055):618
Neveu M, Kim H-J, Benner SA (2013) The “Strong” RNA world hypothesis: fifty years old. Astrobiology 13(4):391–403
Rich A (1962) On the problems of evolution and biochemical information transfer. In: Kasha M, Pullman B (eds) Horizons in biochemistry. Academic Press, pp 103–126
Cech TR (2011) The RNA worlds in context. Cold Spring Harb Perspect Biol 4(7):a006742
Wasik S, Szostak N, Kudla M, Wachowiak M, Krawiec K, Blazewicz J (2019) Detecting life signatures with RNA sequence similarity measure. J Theor Biol 463:110–120
Szostak N, Synak J, Borowski M, Wasik S, Blazewicz J (2017) Simulating the origins of life: the dual role of RNA replicases as an obstacle to evolution. PLOS ONE 12(7):1–28
Eigen M (1971) Selforganization of matter and the evolution of biological macromolecules. Die Naturwiss 58(10):465–523
Quastler H (1953) Essays on the use of information theory in biology. University of Illinois Press, Urbana
Szostak N, Wasik S, Blazewicz J (2017) Understanding life: a bioinformatics perspective. Eur Rev 25(2):231245
Kohonen T (1988) Learning vector quantization. Neural Netw 1(Suppl. 1):303
Sato A, Yamada K (1996) Generalized learning vector quantization. In: Touretzky DS, Mozer MC, Hasselmo ME (eds) Advances in neural information processing systems, vol 8. Proceedings of the 1995 Conference. MIT Press, Cambridge, pp 423–429
Nebel D, Hammer B, Frohberg K, Villmann T (2015) Median variants of learning vector quantization for learning of dissimilarity data. Neurocomputing 169:295–305
Wasik S, Prejzendanc T, Blazewicz J (2013) ModeLang - a new approach for experts-friendly viral infections modeling. Comput Math Methods Med 2013:8
Wasik S (2018) Modeling biological systems using crowdsourcing. Found Comput Decis Sci 43(3):219–243
Guogas L, Hogle J, Gehrke L (2004) Origins of life and the RNA world: evolution of RNA-replicase recognition. In: Norris R, Stootman F (eds) Bioastronomy 2002: life among the stars. IAU Symposium, vol 213, p 321, June 2004
Brister JR, Ako-adjei D, Bao Y, Blinkova O (2014) NCBI viral genomes resource. Nucleic Acids Res 43(D1):D571–D577
Eigen M, Schuster P (1982) Stages of emerging life—five principles of early organization. J Mol Evol 19(1):47–61
Sharp SJ, Schaack J, Cooley L, Burke DJ, Söll D (1985) Structure and transcription of eukaryotic tRNA genes. CRC Crit Rev Biochem 19(2):107–144
Azad RK, Li J (2013) Interpreting genomic data via entropic dissection. Nucleic Acids Res 41(1):e23
Mohammadi M, Biehl M, Villmann A, Villmann T (2017) Sequence learning in unsupervised and supervised vector quantization using Hankel matrices. In: Rutkowski L, Korytkowski M, Scherer R, Tadeusiewicz R, Zadeh LA, Zurada JM (eds) Proceedings of the 16th international conference on artificial intelligence and soft computing - ICAISC. LNAI, Zakopane. Springer, Cham, pp 131–142
Blaisdell BE (1986) A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci USA 83:5155–5159
Vinga S, Almeida JS (2004) Alignment-free sequence comparison – a review. Bioinformatics 20(2):206–215
Cilibrasi R, Vitányi PMB (2005) Clustering by compression. IEEE Trans Inf Theory 51(4):1523–1545
Li M, Chen X, Li X, Ma B, Vitanyi PMB (2004) The similarity metric. IEEE Trans Inf Theory 50(12):3250–3264
Kolmogorov AN (1965) Three approaches to the quantitative definition of information. Probl Inf Transm 1(1):1–7
Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory 23(3):337–343
Huffman D (1952) A method for the construction of minimum-redundancy codes. Proc IRE 40(9):1098–1101
Vinga S (2004) Information theory applictions for biological sequence analysis. Bioinformatics 15(3):376–389
Vinga S, Almeida JS (2004) Rényi continuous entropy of DNA sequences. J Theor Biol 231:377–388
Fianacca A, LaPaglia L, LaRosa M, LoBosco G, Renda G, Rizzo R, Galio S, Urso A (2018) Deep learning models for bacteria taxonomic classification of metagenomic data. BMC Bioinform 19(Suppl. 7):198
Rényi A (1961) On measures of entropy and information. In: Proceedings of the fourth Berkeley symposium on mathematical statistics and probability. University of California Press, Berkeley
Nguyen NG, Tran VA, Ngo DL, Phan D, Lumbanraja FR, Faisal MR, Abapihi B, Kubo M, Satou K (2016) DNA sequence classification by convolutional neural network. J Biomed Sci Eng 9:280–286
Hamm J, Lee DD (2008) Grassmann discriminant analysis: a unifying view on subspace-based learning. In: Proceedings of the 25th international conference on machine learning, pp 376–388
Absil P-A, Mahony R, Sepulchre R (2004) Riemannian geometry of Grassmann manifolds with a view on algorithmic computation. Acta Appl Math 80:199–220
Wedin PA (1983) On angles between subspaces of a finite dimensional inner product space. Lecture notes in mathematics, vol 973. Springer, Heidelberg, pp 263–285
Nebel D, Kaden M, Villmann A, Villmann T (2017) Types of (dis\(-\))similarities and adaptive mixtures thereof for improved classification learning. Neurocomputing 268:42–54
Kaden M, Riedel M, Hermann W, Villmann T (2015) Border-sensitive learning in generalized learning vector quantization: an alternative to support vector machines. Soft Comput 19(9):2423–2434
Kirby M, Peterson C (2017) Visualizing data sets on the Grassmannian using self-organizing maps. In: Proceedings of the 12th workshop on self-organizing maps and learning vector quantization (WSOM 2017), Nancy, France. IEEE Press, Los Alamitos, pp 32–37
Villmann T (2017) Grassmann manifolds, Hankel matrices and tangent metric models in classification learning. Mach Learn Rep 11(MLR-02-2017):22–25 http://www.techfak.uni-bielefeld.de/~fschleif/mlr/mlr_0_2017.pdf, ISSN:1865-3960
Hammer B, Hofmann D, Schleif F-M, Zhu X (2014) Learning vector quantization for (dis-)similarities. Neurocomputing 131:43–51
Pekalska E, Duin RPW (2006) The dissimilarity representation for pattern recognition: foundations and applications. World Scientific, Singapore
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10:707–710
Yin C, Chen Y, Yau SS-T (2014) A measure of DNA sequence similarity by fourier transform with applications on hierarchical clustering. J Theor Biol 359:18–28
Almeida JS, Carrico JA, Maretzek A, Noble PA, Fletcher M (2001) Analysis of genomic sequences by chaos game representation. Bioinformatics 17(5):429–437
Deng M, Yu C, Liang Q, He RL, Yau SS-T (2011) A novel method of characterizing sequences: genome space with biological distance and applications. PLoS ONE 6(3):e17293
Li Y, He L, He RL, Yau SS-T (2017) A novel fast vector method for genetic sequence comparison. Nat Sci Rep 7(12226):1–11
Li Y, Tian K, Yin C, He RL, Yau SS-T (2016) Virus classification in 60-dimensional protein space. Mol Phylogenet Evol 99:53–62
Acknowledgement
M.K. was supported by grants of the European Social Fond (ESF) for a Young Researcher Group ‘MACS’ in cooperation with the TU Bergakademie Freiberg (Germany) and for the project titled ‘Digitale Produkt- und Prozessinovationen’ at the UAS Mittweida.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Villmann, T. et al. (2020). Searching for the Origins of Life – Detecting RNA Life Signatures Using Learning Vector Quantization. In: Vellido, A., Gibert, K., Angulo, C., Martín Guerrero, J. (eds) Advances in Self-Organizing Maps, Learning Vector Quantization, Clustering and Data Visualization. WSOM 2019. Advances in Intelligent Systems and Computing, vol 976. Springer, Cham. https://doi.org/10.1007/978-3-030-19642-4_32
Download citation
DOI: https://doi.org/10.1007/978-3-030-19642-4_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-19641-7
Online ISBN: 978-3-030-19642-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)