Latent periodicity of serine–threonine and tyrosine protein kinases and other protein families

https://doi.org/10.1016/j.compbiolchem.2005.04.003Get rights and content

Abstract

We identified latent periodicity in catalytic domains of approximately 85% of annotated serine–threonine and tyrosine protein kinases. Similar results were obtained for other 22 protein families and domains. We also designed the method of noise decomposition, which is aimed to distinguish between different periodicity types of the same period length. The method is to be used in conjunction with the method of cyclic profile alignment, and this combination is able to reveal structure-related or function-related patterns of latent periodicity. Possible origins of the periodic structure of protein kinase active sites are discussed. Summarizing, we presume that latent periodicity is the common property of many catalytic protein domains.

Introduction

The development of mathematical techniques for investigation of symbolic sequences is now acquiring ever-growing importance, since nowadays large amounts of genetic information are being gathered (Benson et al., 2000, Stoesser et al., 2001, Adams et al., 2000, Venter et al., 2001). Certainly, there are hidden caches in those huge sets of sequences, but our present searching tools have limited sensitivity. The presence of novel ways to extract information from symbolic sequences would radically improve the ability to extract biologically significant knowledge from genetic texts, the understanding of gene evolution processes and evolutionary rearrangements of genomes, and also the ability to create dynamic models of cell's genetic regulation and artificial proteins with predefined features.

One of the ways to investigate the features of a symbolic sequence is the investigation of its periodicity. The investigation of periodicity has reasonable biological meaning because multiple duplications of DNA sequence fragments followed by subsequent substitutions, insertions and deletions of symbols could serve as the ground for evolution of genes and genomes. This proposition may be validated by the fact that certain structure in genetic texts was previously revealed using various mathematical techniques applied to amino acid and nucleotide sequences (Shulman et al., 1981, Michel, 1986, Konopka et al., 1987, Tautz et al., 1986, Bell, 1996, Konopka and Martindale, 1995, Martindale and Konopka, 1996, Trifonov and Sussman, 1980, Zhurkin, 1983, Gabrielian and Bolshoy, 1999, Konopka and Chatterjee, 1988). The discovery of periodicity in active centers of enzymes could witness that, in the past, genes could be built up by simple repeating of certain relatively short DNA fragments. We may also suppose that such structure of protein active sites could mean possible participation of latent amino acid periodicity in choice and stabilization of the proper conformation of protein globule.

The techniques of dynamic programming (Heringa, 1994, Heringa, 1998, Heringa and Argos, 1993, Benson, 1997, Benson, 1999, Heger and Holm, 2000, Andrade et al., 2000) or Fourier transformation (Taylor et al., 2002, Lobzin and Chechetkin, 2000, Dodin et al., 2000, Jackson et al., 2000, Rackovsky, 1998, Chechetkin and Lobzin, 1998, Coward and Drablos, 1998, Voss, 1992, Silverman and Linsker, 1996, McLachlan, 1993) are commonly utilized for identification of periodicity. We had developed our own mathematical approach to the searching for periodicity, which is based on the Information Decomposition (ID) of symbolic sequences (Korotkov and Korotkova, 1995, Korotkov et al., 1997, Korotkov et al., 2003, Chaley et al., 1999, Korotkova et al., 1999). The main idea of this approach is that information content of any symbolic sequence could be decomposed into mutually nonoverlapping constituents. Each of the constituents represents the mutual information between the investigated symbolic sequence and the artificial periodic sequence with some period length. The interdependence of mutual information and period length may be presented in the form of spectral graph that resembles Fourier power spectra, but it has substantially different properties (Korotkov et al., 2003). This decomposition allows us to eliminate the shortcomings peculiar to dynamic programming and Fourier transformation, and it allows us to detect so-called latent periodicity, that is, the periodicity that other techniques are powerless to detect.

However, like Fourier-based techniques, the method of information decomposition (in its current form) is not capable of finding statistically significant latent periodicity in presence of many insertions and deletions. This may lead us to the conclusion that substantial part of latent periodicity occurrences in genetic texts remains unseen by information decomposition-based techniques as well as by other known methods. The simplest method of searching for latent periodicity with insertions and deletions is the combination of information decomposition and modified profile analysis. In this combination, information decomposition can serve as the method that detects latent periodicity in some amino acid sequences and creates the periodicity matrix (Korotkov et al., 2003), which can be used to determine the weights for each amino acid at each position in the period. Then modified profile analysis allows us to identify periodicity of corresponding type (defined by cyclic position-weight matrix we have just constructed) in all the primary sequence data bank (such as Swiss-Prot) with possible insertions and deletions, and search results can be used to modify the profile matrix for increased sensitivity and specificity of the search.

The first goal of this publication is to present the method of noise decomposition. For many sequence families, perfect tandem periodicity is disrupted with indels; the cyclic alignment thereby is to expand our possibilities of identifying latent periodicities to the cases where sufficiently short indels are present. The method also allows us to distinguish between different periodicity types of the same period length because sometimes there are nearby but distinct types of periodicity. In this paper, we demonstrate that our decomposition technique is able to distinguish two latent repeat profiles as close as those present in serine–threonine and tyrosine kinases.

The second goal is to reveal the prevalence of latent periodicity in protein kinases. Latent periodicity we previously identified in catalytic domains of seven protein kinases turned out to be more common property of these proteins than one would expect. To achieve this, we applied modified iterative circular profile analysis and the method of noise decomposition; our efforts resulted in certain modification of the initial position-weight matrix and identification of latent periodicity in active sites of 1215 protein kinases presented in Swiss-Prot data bank. The data we gathered witness that latent periodicity is a property of at least a great majority of eukaryotic protein kinases.

The third goal of this publication is to show that there is latent periodicity in a number of different protein families. For this purpose, we applied the ID and noise decomposition methods to investigate some selected protein families from Swiss-prot data bank. We found out that amino acid sequences of many protein domains have latent periodicity with various period lengths. We discuss these results and propose that latent periodicity could reflect the origin of proteins from manifold ancient tandem duplications.

Section snippets

Methods and algorithms

Let us define at first, which kinds of periodicity we may call latent. Generally, the latent periodicity is periodicity that is hard to identify with proper level of statistical significance using internal homology search techniques. The homology between periods (repeats) is often determined using amino acid similarity matrices, such as PAM or BLOSUM (Benson, 1997, Benson, 1999, Heger and Holm, 2000, Andrade et al., 2000), where the weights for similar amino acids are higher than those for

Results

First of all, we would like to demonstrate the consistency reliability and the usefulness of our techniques in searching for known types of tandem repeats. Ankyrin and leucine-rich repeats were chosen for this purpose. The initial profiles of these repeats were obtained using ID with the period lengths equal to 33 and 24 residues, correspondingly.

We identified 146 of 150 sequences containing at least three marked ankyrin repeats (three times the period length was chosen to be minimal length of

Discussion

The notion of latent periodicity and the technique of searching for it was initially presented in (Korotkov and Korotkova, 1995) and refined in subsequent investigations (Korotkov et al., 1997, Korotkov et al., 2003, Chaley et al., 1999, Korotkova et al., 1999). As a result of the studies we have performed, we revealed the existence of various types of latent periodicity in numerous amino acid sequences (Korotkova et al., 1999, Korotkov et al., 2003). These identified latently-periodic

Acknowledgements

We would like to thank Dr. Michael Ochs from Fox Chase Cancer Center (Philadelphia, PA) for helpful comments and suggestions on the manuscript.

References (80)

  • J. Heringa

    Identifying DNA and protein patterns with statistically significant alignments of multiple sequences

    Curr. Opin. Struct. Biol.

    (1998)
  • T. Hunter

    Protein kinase classification

    Meth. Enzymol.

    (1991)
  • J.H. Jackson et al.

    Vectors of Shannon information from Fourier signals characterizing base periodicity in genes and genomes

    Biochem. Biophys. Res. Commun.

    (2000)
  • H. Kentrup et al.

    Dyrk, a dual specificity protein kinase with unique structural features whose activity is dependent on tyrosine residues between subdomains VII and VIII

    J. Biol. Chem.

    (1996)
  • G. Knarr et al.

    BiP-binding Sequences in HIV gp160

    J. Biol. Chem.

    (1999)
  • A.K. Konopka

    Sequences and codes: fundamentals of biomolecular cryptology

  • A.K. Konopka et al.

    Distance Analysis and Sequence Properties of Functional Domains in Nucleic Acids and Proteins

    Gene Anal. Technol0.

    (1988)
  • A.K. Konopka et al.

    Complexity charts can be used to map functional domains in DNA

    Gene Anal. Technol. Appl.

    (1990)
  • E.V. Korotkov et al.

    Information decomposition of symbolic sequences

    Phys. Lett. A

    (2003)
  • C. Martindale et al.

    Oligonucleotide frequencies in DNA follow a Yule distribution

    Comput. Chem.

    (1996)
  • C.J. Michel

    New statistical approach to discriminate between protein coding and non-coding regions in DNA sequences and its evaluation

    J. Theor. Biol.

    (1986)
  • W.E. Muller et al.

    Gene structure and function of tyrosine kinases in the marine sponge Geodia cydonium: autapomorphic characters in Metazoa

    Gene

    (1999)
  • S.B. Needleman et al.

    A general method applicable to the search for similarities in the amino acid sequence of two proteins

    J. Mol. Biol.

    (1970)
  • R.W. Ruddon et al.

    Assisted protein folding

    J. Biol. Chem.

    (1997)
  • P. Salamon et al.

    A maximum entropy principle for distribution of local complexity in naturally occurring nucleotide sequences

    Comput. Chem.

    (1992)
  • P. Salamon et al.

    On the robustness of maximum entropy relationships for complexity distributions of nucleotide sequences

    Comput. Chem.

    (1993)
  • M.J. Shulman et al.

    The coding function of nucleotide sequences can be discerned by statistical analysis

    J. Theor. Biol.

    (1981)
  • T.F. Smith et al.

    Identification of common molecular subsequences

    J. Mol. Biol.

    (1981)
  • I.M. Takenaka et al.

    Hsc70-binding peptides selected from a phage display peptide library that resemble organellar targeting sequences

    J. Biol. Chem.

    (1995)
  • S.S. Taylor et al.

    Three protein kinase structures define a common motif

    Structure

    (1994)
  • W.J. Wilbur et al.

    A theory of information with special application to search problems

    Comput. Chem.

    (2000)
  • M.D. Adams

    The genome sequence of Drosophila melanogaster

    Science

    (2000)
  • A. Bairoch et al.

    The Swiss-Prot protein sequence database and its supplement TrEMBL

    Nucleic Acids Res.

    (2000)
  • D.A. Benson et al.

    GenBank

    Nucleic Acids Res.

    (2000)
  • G. Benson

    Sequence alignment with tandem duplication

    J. Comput. Biol.

    (1997)
  • G. Benson

    Tandem repeats finder: a program to analyze DNA sequences

    Nucleic Acids Res.

    (1999)
  • S. Breen et al.

    Renewal theory for several patterns

    J. Appl. Prob.

    (1985)
  • M.B. Chaley et al.

    Latent periodicity of 21 bases typical for mcp ii gene is widely present in various bacterial genes

    DNA Seq.

    (2003)
  • M.B. Chaley et al.

    Method revealing latent periodicity of the nucleotide sequences for a case of small samples

    DNA Res.

    (1999)
  • V.R. Chechetkin et al.

    Nucleosome units and hidden periodicities in DNA sequences

    J. Biomol. Struct. Dyn.

    (1998)
  • Cited by (0)

    View full text