Abstract
Motivated by an interest to understand how information is organized within genomes, and how genes communicate between each other in the transcription process, in this paper we propose a novel network based methodology for genomic sequence analysis, specifically applied to three organisms: Nanoarchaeum equitans, Escherichia coli, and Saccaromyces cerevisiae. A dictionary based approach previously introduced is here continued through a repeat analysis in genic and intergenic regions. Key results of this work have been found in a biological and computational analysis of novel parametrized gene networks, defined by means of motifs of fixed length occurring inside multiple genes. Cliques emerge as groups of genes sharing a long repeat with a clear biological interpretation, while a (complete, paralog) cluster analysis has outlined some unexpected regularity. Repeat sharing gene networks may be applied in contexts of comparative genomics, as an investigation methodology for a comprehension of evolutional and functional properties of genes.















Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
www.cbmc.it/external/Infogenomics3.
For example, capability of a protein to break chemical bonds or phosphorilate another protein.
For example, a protein involved in replication, energy production or movement.
Localization of the protein, for example in nucleus, on membranes, in ribosomes.
References
Aittokallio T, Schwikowski B (2006) Graph-based methods for analysing networks in cell biology. Brief Bioinform 7(3):243–255
Bartel DP (2009) MicroRNAs: target recognition and regulatory functions. Cell 136(2):215–233. doi:10.1016/j.cell.2009.01.002
Brendel V, Busse H (1984) Genome structure described by formal languages. Nucleic Acids Res 12(94):2561–2568
Castellini A, Franco G, Manca V (2012) A dictionary based informational genome analysis. BMC Genomics 13(1):485. doi:10.1186/1471-2164-13-485
Castellini A et al. Genome classification by dictionary-based indexes. Poster presented at the International Conference on Pattern Recognition in Bioinformatics (PRIB2011).
Chor B, Horn D, Goldman N et al (2009) Genomic DNA k-mer spectra: models and modalities. Genome Biol 10:R108
Das S, Paul S, Bag SK, Dutta C (2006) Analysis of Nanoarchaeum equitans genome and proteome composition: indications for hyperthermophilic and parasitic adaption. BMC Genomics 7:186
Dunham I, Kundaje A, Aldred S et al (2012) (the ENCODE Project Consortium): An integrated encyclopedia of DNA elements in the human genome. Nature 489:57–74
Fici G, Mignosi F, Restivo A et al (2006) Word assembly through minimal forbidden words. Theor Comput Sci 359:214–230
Fofanov Y, Luo Y, Katili C, Wang J, Belosludtsev Y, Powdrill T, Belapurkar C, Fofanov V, Li T-B, Chumakov S, Pettitt BM (2008) How independent are the appearances of \(n\)-mers in different genomes? Bioinformatics 20(15):2421–2428
Franco G (2013) Perspectives in computational genome analysis. Discrete and topological models in molecular biology. Springer, Berlin
Franco G, Milanese A (2013) An investigation on genomic repeats. LNCS 7921:149–160
Friedman RC, Farh KK, Burge CB, Bartel DP (January 2009) Most mammalian mRNAs are conserved targets of microRNAs. Genome Res 19(1):92–105
Gottesman S (2004) The small RNA regulators of Escherichia coli: roles and mechanisms. Annu Rev Microbiol 58:303–328
Hampikian G, Andersen T (2007) Absent sequences: nullomers and primes. Pac Symp Biocomput 12:355–366
Herold J, Kurtz S, Giegerich R (2008) Efficient computation of absent words in genomic sequences. BMC Bioinform 9:167
Hoogeboom H, Kosters W (2008) Substring differences in genomes. In: Armañanzas, R., Saeys, Y., Inza, I., García-Torres, M., Van de Peer, Y., Bielza, C., Larrañaga, P. (eds.) Proceedings of the Benelux Bioinformatics Conference (BBC 2008), pp. 62, Maastricht, The Netherlands
Hussein R, Lim HN (2012) Direct comparison of small RNA and transcription factor signalling. Nucleic Acids Res 40(15):7269–7279
International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921
Mandin P (2012) Genetic screens to identify bacterial sRNA regulators. Methods Mol Biol 905:41–60
Mizoguchi H, Mori H, Fujio T (2007) Escherichia Coli minimum genome factory. Biotechnol. Appl. Biochem. 46:157–167
Navarro G, Mäkinen V (2007) Compressed full-text indexes. ACM Comput Surv 39(1):2
Poliseno L (2012) Pseudogenes: newly discovered players in human cancer. Sci Signal 5(242):5. doi:10.1186/gb-2012-13-8-r77
Poliseno L, Salmena L, Zhang J et al (2010) A coding-independent function of gene and pseudogene mRNAs regulates tumour biology. Nature 465(7301):1033–8
Searls DB (2002) The language of genes. Nature 420:211–217
Searls DB (2010) Molecules. Lang Autom LNAI 6339:5–10
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13(11):2498–2504
Sharma CM, Vogel J (2009) Experimental approaches for the discovery and characterization of regulatory small RNA. Curr Opin Microbiol 12:536–546
Tay Y, Kats L, Salmena L, Weiss D, Tan SM, Ala U, Karreth F, Poliseno L, Provero P, Di Cunto F, Lieberman J, Rigoutsos I, Pandolfi PP (2011) Coding-independent regulation of the tumor suppressor PTEN by competing endogenous mRNAs. Cell 147(2):344–357
Vinga S, Almeida J (2003) Alignment-free sequence comparison—a review. Bioinformatics 19(4):513–523
Vinga S, Almeida J (2007) Local Renyi entropic profiles of DNA sequences. BMC Bioinform 8:393
Wagner EGH, Simon RW (1994) Antisense RNA control in bacteria, phages, and plasmids. Annu Rev Microbiol 48:713–742
Wu et al (2010) Modularity of Escherichia coli sRNA regulation revealed by sRNA-target and protein network analysis. BMC Bioinform 11(Suppl 7):S11
Zhou F, Olman V, Xu Y (2008) Barcodes for genomes and applications. BMC Bioinform 9:546
Acknowledgments
The first author has been financially supported by CBMC (Center for Biomedical Computing), in Verona, Italy, which also provided us with the server where all the computations were performed. All the authors are grateful for numerous and detailed improvements suggested by anonymous referees, and inspiring discussions on Infogenomic approach with V. Manca.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Castellini, A., Franco, G. & Milanese, A. A genome analysis based on repeat sharing gene networks. Nat Comput 14, 403–420 (2015). https://doi.org/10.1007/s11047-014-9437-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11047-014-9437-6