Visual software tools for bioinformatics

https://doi.org/10.1016/j.jvlc.2007.06.001Get rights and content

Abstract

Bioinformatics is the application of techniques from computer science, statistics and mathematics to problems in molecular biology. This interdisciplinary approach is rapidly revolutionizing biology. A survey of software tools for bioinformatics is presented. A special emphasis is placed on the visual aspects of these tools. The most important visualization tasks in bioinformatics are data sequence visualization and visualizing protein structures. The visualization of interactions between molecules in a metabolic pathway or network is an emerging area. Many important visualization techniques have yet to be applied in this application area.

Introduction

Bioinformatics has been defined as the application of information technology (computer science, mathematics and statistics) to the management of biological information. In particular, bioinformatics has been widely associated with molecular biology that is largely concerned with the study of three types of molecules—DNA, RNA and protein. The central dogma of molecular biology describes how the information stored in DNA is transcribed into RNA and then translated into protein. Each of these three molecules is a polymer—a string of simpler units, nucleotides in the case of DNA and RNA, amino acids in the case of protein. Each nucleotide contains one of four bases—adenine (abbreviated A), cytosine (C), guanine (G) and thymine (T). Uracil (U) takes the place of thymine in RNA. There are 20 naturally occurring amino acids. Each amino acid can be specified by either a three letter or a one letter code. For example, tryptophan is specified by either Trp or W. It is easily seen that the one letter code is more appropriate for computer processing.

The DNA molecule has the famous double helix structure in which each base from one strand of the double helix pairs with a base from the other strand. An A base pairs only with a T base, while a C base pairs only with a G base. Due to this, given the sequence of one of the strands, we can infer the sequence data for the complementary strand. Thus, a DNA molecule can be specified by giving the sequence of one the strands, for example, AAACGTC etc. The story is a bit more complex for RNA and proteins, which are single stranded. We still can usefully characterize the molecule by giving the sequence data (a string of bases for RNA, a string of amino acids for protein), however this does not completely characterize the molecules, since they can fold into irregular shapes which are functionally important. For these molecules, the sequence data is referred to as the primary structure while the secondary structure is the three-dimensional form of local segments of the polymer. Typical local structures for proteins are alpha helices and beta sheets while the stem-loop is a typical RNA secondary structure. The tertiary structure of a protein is its three-dimensional structure given by the atomic coordinates, while quaternary structure is the arrangement of multiple folded proteins in a protein complex. Fig. 1 below shows the secondary structure of the myoglobin protein which contains several alpha helices and random coils, but no beta sheets. The visual representations for the two secondary structures is typical.

One of the most important tasks of a bioinformatics tool is to perform sequence alignment. Given two different but related sequences (of possibly different lengths), the tool attempts to find the best match between them. In Fig. 2, the alignment between two zinc finger proteins produced by the freely available ClustalW program is shown. Notice that amino acids that have similar chemical properties are shown with the same color. The third row below the two sequences being aligned gives information about the goodness of the match in each column—a “*” symbol means that the two amino acids are identical, a “:” represents a conserved substitution (substitution of a similar amino acid), while a “.” represents a semi-conserved substitution.

When trying to deduce the evolutionary history of several organisms or genes, it is necessary to align the sequence data from each of the organisms/genes using a process called multiple sequence alignment. Fig. 3 shows a multiple sequence alignment produced by ClustalW for several instances of a particular protein from several different organisms. Note once again how color is used to help in the interpretation of the result.

Usually, a multiple sequence alignment results in a consensus sequence which shows the base or amino acid which occurs the most times in each column of the alignment. An alternative way to view sequence alignments is with the sequence logo format that was developed by Tom Schneider at the National Cancer Institute. This method shows more information about the alignment such as whether more than one base or amino acid occurs in each column, and the relative frequency of occurrence in the column. The sequence logo shows the frequencies of bases in each column as the relative height of the letter representing the base, along with the degree of sequence conservation as the total height of a stack of letters, measured in bits of information [1]. An example is shown in Fig. 4. The sequence logo shown is generated using the DELILA programs.

A related visualization technique developed by the same author is the sequence walker [2]. A sequence walker displays information about a single sequence of a multiple sequence alignment. The height of letters in the graphic representation indicates how much the base matches the consensus value at each position. Bases that have a positive match value are shown right side up while bases that have negative values are shown upside down and below the “horizon”. Bases that do not appear in the set of aligned sequences are shown negatively and in a black box. The zero coordinate (a position by which a set of binding sites—the place on a molecule that a protein binds to—is aligned) is inside a rectangle that has a light green background if the sequence has been evaluated as a binding site, and a pink background otherwise. An example of a sequence walker is shown in Fig. 5.

Multiple sequence alignments are sometimes used to infer evolutionary history that can be used to generate a phylogenetic tree that shows the evolutionary history of a number of organisms. Many bioinformatics tools allow for the generation and manipulation of such trees. An example Tree of Life (TOL) generated using Interactive Tree of Life (iTOL)[3], an online phylogenetic tree viewer is shown in Fig. 6.

The following section will survey several tools, both commercial and free, which can be used for doing bioinformatics.

Section snippets

Bioinformatics tools

There exist a huge number of bioinformatics tools, so any survey will necessarily be incomplete. In this section, I will introduce a number of representative tools.

Visualizing protein structures is an important tool for molecular and structural biologists. Visualization of the 3-D shape and structure of a protein can help the biologist identify catalytic and interactive sites and in other ways characterize a protein. Strap [4] (available for Windows, Linux, Mac and Unix) is an example of a

Conclusions

The tools surveyed integrate a variety of methods for visualizing sequence data and protein structures. While not yet widely available, new methods are emerging for the visualization of metabolic pathways as well. Pathfinder [10] is a tool for the dynamic visualization of metabolic pathways based on annotation data. Directed acyclic graphs represent the pathways and graph layout algorithms are used for dynamic drawing and visualization of pathways. MetNetVR [11] is an innovative approach that

Acknowledgment

I would like to thank my student Anand Doshi for his help in the preparation of this article.

References (13)

  • T.D. Schneider et al.

    Sequence logos: a new way to display consensus sequences

    Nucleic Acids Research

    (1990)
  • T.D. Schneider

    Sequence Walkers: a graphical method to display how binding proteins interact with DNA or RNA sequences

    Nucleic Acids Research

    (1997)
  • Interactive Tree of Life....
  • Strap....
  • Geneious....
  • CLC Combined Workbench....
There are more references available in the full text version of this article.

Cited by (4)

  • Screening of commercial cyclic peptide conjugated to HIV-1 Tat peptide as inhibitor of N-terminal heptad repeat glycoprotein-2 ectodomain Ebola virus through in silico analysis

    2017, Journal of Molecular Graphics and Modelling
    Citation Excerpt :

    This study showed that NHR GP2 Ectodomain can be used as drug targets for treating Ebola virus infection. Bioinformatics is generally defined as a branch of science that uses the techniques of mathematics, statistics, and informatics to solve biological problems and organize biological information by making or using a computer program, mathematical model, or both [16,17]. One of the branches of bioinformatics wich very related with medicine is Computer-Aided Drug Discovery and Development (CADD) [18].

  • PhenoBlocks: Phenotype Comparison Visualizations

    2016, IEEE Transactions on Visualization and Computer Graphics
View full text