Elsevier

Biosystems

Volume 90, Issue 1, July–August 2007, Pages 20-27
Biosystems

How to describe genes: Enlightenment from the quaternary number system

https://doi.org/10.1016/j.biosystems.2006.06.004Get rights and content

Abstract

As an open problem, computational gene identification has been widely studied, and many gene finders (software) become available today. However, little attention has been given to the problem of describing the common features of known genes in databanks to transform raw data into human understandable knowledge. In this paper, we draw attention to the task of describing genes and propose a trial implementation by treating DNA sequences as quaternary numbers. Under such a treatment, the common features of genes can be represented by a “position weight function”, the core concept for a number system. In principle, the “position weight function” can be any real-valued function. In this paper, by approximating the function using trigonometric functions, some characteristic parameters indicating single nucleotide periodicities were obtained for the bacteria Escherichia coli K12's genome and the eukaryote yeast's genome. As a byproduct of this approach, a single-nucleotide-level measure is derived that complements codon-based indexes in describing the coding quality and expression level of an open reading frame (ORF). The ideas presented here have the potential to become a general methodology for biological sequence analysis.

Section snippets

Background

In the past two decades, computational gene identification has been widely studied in response to the rapid growth of DNA sequence databases. Many methodologies have been employed to address this problem and they can be roughly categorized as follows: (i) Treating DNA sequence information as the result of a stochastic process, such as a Markov chain, Hidden Markov Model (HMM), or the corresponding expectation–maximization algorithm (Borodovsky and McIninch, 1993, Salzberg et al., 1998, Lukashin

The task of describing genes

Consider the Escherichia coli K12 genome as an example to explain the task. There exist a total of 4289 annotated protein coding genes in the GenBank release 131.0, among which the shortest is 45 base pairs and the longest is 7152 base pairs. In the database, DNA sequences are represented by a coarse-grained form: character sequences, viz., permutations of four nucleotides A, C, G, T from the viewpoint of combinatorics. However, if all the permutation numbers of the four nucleotides are summed

Quaternary number system

The base of quaternary number system is four and digits used in it are 0–3. For a number N with L digits, its decimal representation isN=k=1L4Lkxk,where xk (xk  {0, 1, 2, 3}) is the digit at position k and I(k) = 4Lk is the position weight function. Note that “position weight function” is a core concept for a number system, which means that different positions in a numerical sequence have different weights and the weights can be described by a real-valued function.

For an instance of a

Materials

The complete DNA sequence of the bacteria E. coli K12 genome and related annotation information were downloaded from GenBank Release 131.0 (accession no. U00096). The eukaryote Saccharomyces cerevisiae's genome and the latest classification of its ORFs were downloaded from http://speedy.mips.biochem.mpg.de. Here we used the first class (including 3275 ORFs with known functions) to test the efficiency of a position weight function.

Common features of genes are revealed as single nucleotide periodicities

Since we use trigonometric functions to approximate the position weight function, the common features of genes are revealed as single nucleotide periodicities in the DNA sequences. The coefficient am (see series (9)) corresponds directly to the m-periodicity. The absolute value of the coefficient indicates the magnitude of this periodicity in the coding sequences. All the coefficients am (m = 1, 2, …, 12) of the position weight function for E. coli K12 and yeast are listed in Table 1.

From its first

Conclusion

One purpose of this paper is to draw attention to the task of describing genes. By transforming DNA sequences into quaternary numbers, we presented a trial implementation for this task based on a generalized position weight function which is shared by protein coding genes. By trigonometric approximation of this function, the common features of genes are revealed as the single nucleotide periodicities. The results show that different species may have different strengths of single nucleotide

Acknowledgements

Heng Wei, Hong-Yu Ou and Yun-Tao Dou gave helpful discussions. Many thanks to Prof. Hong-Yu Zhang for his help in polishing this paper. We are grateful to the anonymous reviewer for his suggestions that make this paper clearer for understanding. This work was supported by the National Basic Research Program of China (2003CB114400) and the National Natural Science Foundation of China (30100035 and 30570383).

References (28)

  • F. Atsushi et al.

    Periodicity in prokaryotic and eukaryotic genomes identified by power spectrum analysis

    Gene

    (2002)
  • V.B. Bajic

    Comparing the success of different prediction software in sequence analysis: a review

    Brief Bioinform.

    (2000)
  • R. Jansen et al.

    Revisiting the codon adaptation index from a whole-genome perspective: analyzing the relationship between gene expression and codon occurrence in yeast using a variety of models

    Nucl. Acids Res.

    (2003)
  • S. Karlin et al.

    Codon usages in different gene classes of the Escherichia coli genome

    Mol. Microbiol.

    (1998)
  • Cited by (4)

    View full text