Pattern discovery for microsatellite genome analysis
Introduction
Microsatellites (or simple sequence repeats SSRs) constitute one of the most important classes of genetic markers, widely applied in an array of research areas, such as studies of genetic variation and structure, or construction of genetic maps [1]. Some of the most well known applications in which microsatellites play a key role are paternity testing [2], [3], the confirmation of family pedigrees [4] and forensic investigations [5]. The ubiquity of microsatellites within genomes has played an important role in many genetic mapping projects such as in human [6], mouse [7], dog [8], trout [9] and other species.
Recently, the advent of next generation sequencing platforms has produced a wealth of genomic data, permitting a more in depth analysis of microsatellite genome abundance and distribution across different organisms. In this paper we propose a new algorithm for the detection of microsatellite loci in genomic data. The algorithm searches exhaustively for mono-, di-, tri-, tetra-, penta- and hexa-nucleotide microsatellites and for perfect, imperfect, perfect compound and imperfect compound microsatellites simultaneously in one execution. The algorithm has been implemented and is offered in the user friendly application Microsatellite Genome Analysis (MiGA). The MiGA application, the user manual and example datasets are available on http://mlkd.csd.auth.gr/bio/miga/index.html.
The paper is organized as follows. Section 2 provides some necessary background knowledge. Section 3 is dedicated to the detailed description of our approach. This includes the description of the algorithm, the data repository, the front end and the results. Section 4 presents related work and the existing tools. Section 5 presents results of a task oriented user evaluation. In Section 6, as an example, we present a full microsatellite analysis of the genome of Danio rerio, (an important vertebrate model organism). The paper is concluded in Section 7.
Section snippets
Repeated sequences
The entire genome of an organism contains non-coding regions as well as coding regions which are translated into proteins. Big parts of non-coding DNA are organized in repeated sequences. These sequences appear in various sizes and in multiple copies in the genome and it was initially believed that they had no particular role in biological processes. Today, it is accepted that they play a significant role in the structure, the function and the evolution of the genomes and can interact with gene
Pattern discovery algorithm for SSR extraction
MiGA's algorithm uses an exhaustive search in order to identify microsatellites in genomes. The algorithm searches for SSRs in a character string (sequence). It finds the first pattern (1–6 characters) that is repeated and continues by calling itself, with the remaining substring as input, to identify all the patterns and the number of their repeats. This was necessary in order to achieve better execution time. This is essential for complete genome analysis since complete genomes are big
Related work
Microsatellites present wide applications in the field of biology. Thus their analysis is a research area with many branches, especially now that genome information is increasingly available. For instance, LobSTR [17] and RepeatSeq [18] are such new applications that aim to infer diploid genotypes in microsatellite loci after full genome sequencing of individuals based on pre-existing reference genomes. Another new branch in microsatellite analysis is to trace microsatellite loci within
User evaluation
Biologists could be a very diverse group concerning being conversant with computers. A bioinformatics application can be judged inside the scientific community, in terms of success, based on the simplicity and the ease of accomplishing tasks. It is common sense that one of the most important aspects in software design is the usability. Usability can be measured in various ways, but one of the most reliable is through a task oriented evaluation, which is performed by ordinary users who had no
Results of full genome analysis of Danio rerio
Zebrafish (D. rerio) is a tropical freshwater fish and a model organism for the science of biology [31], [33], [34]. Its sequencing project started in 2001 by the Sanger Institute and from then on several assemblies have been released. Its full genome was published in 2013 [32]. The assembly of Zebrafish's genome that has been analyzed was 1.357.051.643 base pairs (bp). The parameters used for the microsatellite analyses, for each type were: (i) perfect SSRs: Minimum SSR length (bp)=12, (ii)
Summary and future work
Many programs exist, nowadays, that can detect microsatellites in the genome and some of them are quite successful. However, their drawbacks are numerous. MiGA manages to solve all these problems and moreover it provides functions that have never been offered before. In the future, we plan to further expand the application with new functions such as remote access to the database, in order to make best use of available computer lab resources. MiGA has some features i.e. data repository and
Conflict of interest statement
None declared.
Acknowledgments
Authors would like to thank Anestis Fachantidis and Efstratios Kontopoulos for their valuable comments and suggestions.
References (38)
- et al.
Microsatellite stability in human post-mortem tissues
Forensic Sci. Int.
(2001) - et al.
An exhaustive DNA microsatellite map of the human genome using high performance computing
Genomics
(2003) - et al.
Zebrafish heart regeneration as a model for cardiac tissue repair
Drug Discovery Today: Dis. Models
(2007) - et al.
The identification and characterization of microsatellites in the compact genome of the Japanese pufferfish, Fugu rubripes: perspectives in functional and comparative genomic analyses
J. Mol. Biol.
(1998) Microsatellites: simple sequences with complex evolution
Nat. Rev. Genet.
(2004)- et al.
Parentage testing in alpacas (Vicugna pacos) using semi-automated fluorescent multiplex PCRs with 10 microsatellite markers
Anim. Genet.
(2008) - et al.
DNA microsatellites in domesticated dogs: application in paternity disputes
Pflugers Arch.
(1996) - et al.
Pedigree analysis of the Sika deer (Cervus nippon) using microsatellite markers
Zool. Sci.
(2000) - et al.
A comparative approach to physical and linkage mapping of genes on canine chromosomes using gene-associated simple sequence repeat polymorphisms illustrated by studies of dog chromosome 9
J. Hered.
(1999) - et al.
A high-resolution microsatellite map of the mouse genome
Genome Res.
(1998)