Application of 2D graphic representation of protein sequence based on Huffman tree method

https://doi.org/10.1016/j.compbiomed.2012.01.011Get rights and content

Abstract

Based on Huffman tree method, we propose a new 2D graphic representation of protein sequence. This representation can completely avoid loss of information in the transfer of data from a protein sequence to its graphic representation. The method consists of two parts. One is about the 0–1 codes of 20 amino acids by Huffman tree with amino acid frequency. The amino acid frequency is defined as the statistical number of an amino acid in the analyzed protein sequences. The other is about the 2D graphic representation of protein sequence based on the 0–1 codes. Then the applications of the method on ten ND5 genes and seven Escherichia coli strains are presented in detail. The results show that the proposed model may provide us with some new sights to understand the evolution patterns determined from protein sequences and complete genomes.

Introduction

The rapid growth of biological sequence such as DNA and protein has created many challenges for bioscientists. Facing the explosive growth of DNA and protein sequences, experimental, mathematical and graphic approaches have been employed to study the structure, function, evolution and attribution [1] of these sequences.

Graphic techniques have emerged as a powerful tool for the analysis and visualization of long biology sequences. The advantage of graphic representations of biology sequences is that they provide a simple way of viewing, sorting, and comparing various gene structures, helping in recognizing major differences among similar DNA and protein sequences. Graphical method for visualizing DNA sequence is early proposed by Hamori in 1983 [2]. Afterwards, Hamori [3] and Jeffrey [4] considered two other graphical representation methods of DNA sequences. The original plot of a DNA sequence as a random walk on a 2D grid using the four cardinal directions to represent the four bases A (adenine), G (guanine), T (thymine) and C (cytosine) was done by Gates [5], Nandy [6] and Leong and Morgenthaler [7]. In recent ten years, some authors such as Bielinska-Waz [8], [9], Randić [10], [11], [12], [13], Jaklic [14], Novic [15] and Qi [16], [17], [18], also presented their graphical representations. These graphical methods visualizing DNA sequences provide useful insights into local and global characteristics along a sequence, which are not easily observed from DNA sequences. In recent two references, Randić et al. [19] and Ghosh and Nandy [20], authors gave more detailed introduction about graphical methods visualizing DNA sequences. Readers can find more detailed accounts of various graphical representation of DNA.

Compared with the graphical representation of DNA, the first graphical representation of proteins was published in 2004 [21]. It assumes a unique correspondence between one selected collections of 20 nucleotide triplets and the 20 amino acids, which they represent. The Virtual Genetic Code converts a protein sequence into a hypothetical DNA sequence, and allows one to use available graphical representations of DNA to generate a graphical representation for proteins [19]. Then some novel graphical approaches were developed for graphical representation of proteins that allow a direct representation of proteins [22], [23]. In addition, to reflect the difference among 20 natural amino acids, some graphic representations of proteins consider more physicochemical properties. For example, Chou et al. [24] proposed a 2D representation method, ‘wenxiang diagram’, to characterize the disposition of hydrophobic and hydrophilic residue. Wen and Zhang [25] proposes a 2D graphic representation based on the pKa values of different amino acids. Wu et al. [26] build up a web-server for creating graphic representation of protein sequences by two different physicochemical properties of their constituent amino acids.

In the present study, we propose a new 2D graphic representation of protein sequence based on the 0–1 codes of 20 amino acids from Huffman tree. The 0–1 codes of 20 amino acids based on Huffman tree can provide an approach with a compression to represent protein sequences by binary unit. Further, the use of 0–1 codes can still reflect the overall frequency characteristic within the considered protein sequences. The proposed method can graphically represent the sequence with no degeneracy and loss of information. The rest of the paper is organized as follows. Section 2 presents the 2D graphic representation of protein sequence based on Huffman tree method. Section 3 mainly discusses the application in genome comparison of seven Escherichia coli strains. Section 4 gives the conclusion of this paper.

Section snippets

The 0–1 codes of 20 amino acids by Huffman tree

Huffman codes are digital data compression codes resulted from the excellent piece of work by Prof. David A. Huffman (1925–1999) [27]. Huffman codes exploit the entropy of the message to give good compression ratios. The Huffman Encoding scheme is a method with variable length encoding. That is to say that the code for a symbol depends on the frequency of occurrence of that symbol. Huffman coding is again classified into two different groups Static Huffman coding and Adaptive Huffman coding. In

Geometrical center of genome curve

According to the proposed 2D graphic representation model, the protein sequences are represented by a set of geometrical points in 2D space. One can get a long graphic curve if he sequentially connects all points of every gene along genome. However, we cannot easily find out the difference of curves when the graphic representation reaches level of genome. In order to find some of the invariants sensitive to the form of the graphic curve, we use geometrical center of genome curve to represent

Conclusion

High complexity and degeneracy are major problems in previous protein sequence representations. The proposed method can provide a direct plotting approach to denote protein sequences without degeneracy, even genome. From the protein curves, the 20 amino acids as well as the original protein sequence can be recaptured mathematically without loss of information. It can provide intuitive inspection of data, helping in analyzing similarities among different protein sequences or genomes. And the

Conflict of interest statement

None declared.

Acknowledgment

We thank the anonymous reviewers for their valuable comments to improve this paper. This work is supported by Humanities and Social Sciences Research of Ministry of Education of China (Project name, The Origin, Propagation and Migration of Human Influenza Epidemic (1918–2010) from Space-time Perspective; Project no. 11YJCZH132).

Zhao-Hui Qi is with College of Information Science and Technology, at Shijiazhuang Tiedao University in Shijiazhuang, Hebei. Prior to him joining this University, he has received the Ph.D. degree in college of Computer at Tianjin University in Tianjin in 2006. His research interests include biological information processing and medical software developing.

References (35)

  • M. Randić et al.

    Graphical representation of proteins as four-color maps and their numerical characterization

    J. Mol. Graphics Modelling

    (2009)
  • J. Wen et al.

    2D graphical representation of protein sequence and its numerical characterization

    Chem. Phys. Lett.

    (2009)
  • Z.C. Wu et al.

    2D-MH: a web-server for generating graphic representation of protein sequences based on the physicochemical properties of their constituent amino acids

    J. Theor. Biol.

    (2010)
  • M. Randić et al.

    Novel 2-D graphical representation of proteins

    Chem. Phys. Lett.

    (2006)
  • E. Hamori

    Graphic representation of long DNA sequences by the method of H curves-current results and future aspects

    BioTechniques

    (1989)
  • H.J. Jeffrey

    Chaos game representation of gene structure

    Nucleic Acids Res.

    (1990)
  • A. Nandy

    A new graphical representation and analysis of DNA sequence structure: I. Methodology and application to globin genes

    Curr. Sci.

    (1994)
  • Cited by (26)

    • Similarity/dissimilarity analysis of protein structures based on Markov random fields

      2018, Computational Biology and Chemistry
      Citation Excerpt :

      If more thresholds are used, contact maps will allow us to extract more detailed information about the protein structure. Numerical characterization techniques offer a route toward quantitatively estimating the similarities/dissimilarities of the biological sequences and structures (Liao et al., 2010; Liao et al., 2011; Qi et al., 2012; Randic et al., 2009a,b). If similarities/dissimilarities of two protein structures is defined between the numerical characterizations rather than between their contact maps, it will make the comparison of different protein structures much simpler than direct comparison.

    • 20D-dynamic representation of protein sequences

      2016, Genomics
      Citation Excerpt :

      In the present work, we introduce 20D moments of inertia as new descriptors of protein sequences. Nowadays, alignment-free methods of comparison of protein sequences is a fast developing area of bioinformatics [3,4,14,15,19–38] (for reviews see [39,40]). The proposed method supplies a new nonstandard and efficient tool for the comparison and for phylogenetic analysis of protein sequences.

    • 3D representations of amino acids - Applications to protein sequence comparison and classification

      2014, Computational and Structural Biotechnology Journal
      Citation Excerpt :

      Since the original work of Swanson, many new geometric representations of protein sequences have been proposed [22,33–51]. These various representations have been used for detecting and measuring similarities between sequences [30,44,45,47,49], to study the evolution of protein sequences [50], to predict cleavage sites in protein [36], to predict the 3D fold of a protein [48], to predict sub-cellular locations of proteins [35], to predict the location of protein domains [34,41], and to provide a representation of the full protein sequence space [51]. Those that represent amino acids as vectors relate the directions and amplitudes of these vectors to the physico-chemical properties of the amino acids [39,43,44,46–48], to amino acid compositions in protein sequences [50], to evolution information [30,36,41], or simply follow the main axes of the feature space considered [38] or are uniformly distributed along a curve [37].

    • A graphical representation of protein based on a novel iterated function system

      2014, Physica A: Statistical Mechanics and its Applications
      Citation Excerpt :

      It is not feasible to directly expand the graphical representations of DNA sequences to protein sequences. Recently, many graphical representations of proteins have been suggested to describe and analyze protein sequences [5–31]. For example, according to the indices of some physicochemical properties of the twenty amino acids [9–12,16–19], some graphical representations of protein sequences have been proposed to compare the similarities/dissimilarities of proteins.

    View all citing articles on Scopus

    Zhao-Hui Qi is with College of Information Science and Technology, at Shijiazhuang Tiedao University in Shijiazhuang, Hebei. Prior to him joining this University, he has received the Ph.D. degree in college of Computer at Tianjin University in Tianjin in 2006. His research interests include biological information processing and medical software developing.

    Jun Feng is with College of Information Science and Technology, at Shijiazhuang Tiedao University in Shijiazhuang, Hebei. She has received the Ph.D. degree in University of Science and Technology Beijing in Beijing in 2005. His research interests include biological information processing.

    Xiao-Qin Qi has received the B.S. degree in Computer Education from Hunan Normal University in 2002. She is currently working towards a Master degree in Computer Application at Shijiazhuang Tiedao University, Hebei.

    Ling Li is with Basic Courses Department, at Zhejiang Shuren University in Hangzhou, Zhejiang. Her research interests include bioinformatics and medical software developing.

    View full text