Pattern discovery for microsatellite genome analysis

https://doi.org/10.1016/j.compbiomed.2014.01.002Get rights and content

Abstract

Microsatellite loci comprise an important part of eukaryotic genomes. Their applications in biology as genetic markers are related to numerous fields ranging from paternity analyses to construction of genetic maps and linkage to human disease. Existing software solutions which offer pattern discovery algorithms for the correct identification and downstream analysis of microsatellites are scarce and are proving to be inefficient to analyze large, exponentially increasing, sequenced genomes. Moreover, such analyses can be very difficult for bioinformatically inexperienced biologists. In this paper we present Microsatellite Genome Analysis (MiGA) software for the detection of all microsatellite loci in genomic data through a user friendly interface. The algorithm searches exhaustively and rapidly for most microsatellites. Contrary to other applications, MiGA takes into consideration the following three most important aspects: the efficiency of the algorithm, the usability of the software and the plethora of offered summary statistics. All of the above, help biologists to obtain basic quantitative and qualitative information regarding the presence of microsatellites in genomic data as well as downstream processes, such as selection of specific microsatellite loci for primer design and comparative genome analysis.

Introduction

Microsatellites (or simple sequence repeats SSRs) constitute one of the most important classes of genetic markers, widely applied in an array of research areas, such as studies of genetic variation and structure, or construction of genetic maps [1]. Some of the most well known applications in which microsatellites play a key role are paternity testing [2], [3], the confirmation of family pedigrees [4] and forensic investigations [5]. The ubiquity of microsatellites within genomes has played an important role in many genetic mapping projects such as in human [6], mouse [7], dog [8], trout [9] and other species.

Recently, the advent of next generation sequencing platforms has produced a wealth of genomic data, permitting a more in depth analysis of microsatellite genome abundance and distribution across different organisms. In this paper we propose a new algorithm for the detection of microsatellite loci in genomic data. The algorithm searches exhaustively for mono-, di-, tri-, tetra-, penta- and hexa-nucleotide microsatellites and for perfect, imperfect, perfect compound and imperfect compound microsatellites simultaneously in one execution. The algorithm has been implemented and is offered in the user friendly application Microsatellite Genome Analysis (MiGA). The MiGA application, the user manual and example datasets are available on http://mlkd.csd.auth.gr/bio/miga/index.html.

The paper is organized as follows. Section 2 provides some necessary background knowledge. Section 3 is dedicated to the detailed description of our approach. This includes the description of the algorithm, the data repository, the front end and the results. Section 4 presents related work and the existing tools. Section 5 presents results of a task oriented user evaluation. In Section 6, as an example, we present a full microsatellite analysis of the genome of Danio rerio, (an important vertebrate model organism). The paper is concluded in Section 7.

Section snippets

Repeated sequences

The entire genome of an organism contains non-coding regions as well as coding regions which are translated into proteins. Big parts of non-coding DNA are organized in repeated sequences. These sequences appear in various sizes and in multiple copies in the genome and it was initially believed that they had no particular role in biological processes. Today, it is accepted that they play a significant role in the structure, the function and the evolution of the genomes and can interact with gene

Pattern discovery algorithm for SSR extraction

MiGA's algorithm uses an exhaustive search in order to identify microsatellites in genomes. The algorithm searches for SSRs in a character string (sequence). It finds the first pattern (1–6 characters) that is repeated and continues by calling itself, with the remaining substring as input, to identify all the patterns and the number of their repeats. This was necessary in order to achieve better execution time. This is essential for complete genome analysis since complete genomes are big

Related work

Microsatellites present wide applications in the field of biology. Thus their analysis is a research area with many branches, especially now that genome information is increasingly available. For instance, LobSTR [17] and RepeatSeq [18] are such new applications that aim to infer diploid genotypes in microsatellite loci after full genome sequencing of individuals based on pre-existing reference genomes. Another new branch in microsatellite analysis is to trace microsatellite loci within

User evaluation

Biologists could be a very diverse group concerning being conversant with computers. A bioinformatics application can be judged inside the scientific community, in terms of success, based on the simplicity and the ease of accomplishing tasks. It is common sense that one of the most important aspects in software design is the usability. Usability can be measured in various ways, but one of the most reliable is through a task oriented evaluation, which is performed by ordinary users who had no

Results of full genome analysis of Danio rerio

Zebrafish (D. rerio) is a tropical freshwater fish and a model organism for the science of biology [31], [33], [34]. Its sequencing project started in 2001 by the Sanger Institute and from then on several assemblies have been released. Its full genome was published in 2013 [32]. The assembly of Zebrafish's genome that has been analyzed was 1.357.051.643 base pairs (bp). The parameters used for the microsatellite analyses, for each type were: (i) perfect SSRs: Minimum SSR length (bp)=12, (ii)

Summary and future work

Many programs exist, nowadays, that can detect microsatellites in the genome and some of them are quite successful. However, their drawbacks are numerous. MiGA manages to solve all these problems and moreover it provides functions that have never been offered before. In the future, we plan to further expand the application with new functions such as remote access to the database, in order to make best use of available computer lab resources. MiGA has some features i.e. data repository and

Conflict of interest statement

None declared.

Acknowledgments

Authors would like to thank Anestis Fachantidis and Efstratios Kontopoulos for their valuable comments and suggestions.

References (38)

  • P. Werner et al.

    Anchoring of canine linkage groups with chromosome-specific markers

    Mamm. Genome

    (1999)
  • R. Guyomard et al.

    A type I and type II microsatellite linkage map of rainbow trout (Oncorhynchus mykiss) with presumptive coverage of all chromosome arms

    BMC Genomics

    (2006)
  • P. Flicek et al.

    Ensembl

    Nucleic Acids Res.

    (2013)
  • G.J. Faulkner et al.

    Altruistic functions for selfish DNA

    Cell Cycle

    (2009)
  • C. Biemont

    A brief history of the status of transposable elements: from junk DNA to major players in evolution

    Genetics

    (2010)
  • B.E. Bernstein et al.

    An integrated encyclopedia of DNA elements in the human genome

    Nature

    (2012)
  • Initial sequencing and analysis of the human genome

    Nature

    (2001)
  • M.V. Katti et al.

    Differential distribution of simple sequence repeats in eukaryotic genome sequences

    Mol. Biol. Evol.

    (2001)
  • R. Kofler et al.

    SciRoKo: a new tool for whole genome microsatellite search and investigation

    Bioinformatics

    (2007)
  • Cited by (0)

    View full text