Application of 2D graphic representation of protein sequence based on Huffman tree method

doi:10.1016/j.compbiomed.2012.01.011

Computers in Biology and Medicine

Volume 42, Issue 5, May 2012, Pages 556-563

https://doi.org/10.1016/j.compbiomed.2012.01.011 Get rights and content

Abstract

Based on Huffman tree method, we propose a new 2D graphic representation of protein sequence. This representation can completely avoid loss of information in the transfer of data from a protein sequence to its graphic representation. The method consists of two parts. One is about the 0–1 codes of 20 amino acids by Huffman tree with amino acid frequency. The amino acid frequency is defined as the statistical number of an amino acid in the analyzed protein sequences. The other is about the 2D graphic representation of protein sequence based on the 0–1 codes. Then the applications of the method on ten ND5 genes and seven Escherichia coli strains are presented in detail. The results show that the proposed model may provide us with some new sights to understand the evolution patterns determined from protein sequences and complete genomes.

Introduction

The rapid growth of biological sequence such as DNA and protein has created many challenges for bioscientists. Facing the explosive growth of DNA and protein sequences, experimental, mathematical and graphic approaches have been employed to study the structure, function, evolution and attribution [1] of these sequences.

Graphic techniques have emerged as a powerful tool for the analysis and visualization of long biology sequences. The advantage of graphic representations of biology sequences is that they provide a simple way of viewing, sorting, and comparing various gene structures, helping in recognizing major differences among similar DNA and protein sequences. Graphical method for visualizing DNA sequence is early proposed by Hamori in 1983 [2]. Afterwards, Hamori [3] and Jeffrey [4] considered two other graphical representation methods of DNA sequences. The original plot of a DNA sequence as a random walk on a 2D grid using the four cardinal directions to represent the four bases A (adenine), G (guanine), T (thymine) and C (cytosine) was done by Gates [5], Nandy [6] and Leong and Morgenthaler [7]. In recent ten years, some authors such as Bielinska-Waz [8], [9], Randić [10], [11], [12], [13], Jaklic [14], Novic [15] and Qi [16], [17], [18], also presented their graphical representations. These graphical methods visualizing DNA sequences provide useful insights into local and global characteristics along a sequence, which are not easily observed from DNA sequences. In recent two references, Randić et al. [19] and Ghosh and Nandy [20], authors gave more detailed introduction about graphical methods visualizing DNA sequences. Readers can find more detailed accounts of various graphical representation of DNA.

Compared with the graphical representation of DNA, the first graphical representation of proteins was published in 2004 [21]. It assumes a unique correspondence between one selected collections of 20 nucleotide triplets and the 20 amino acids, which they represent. The Virtual Genetic Code converts a protein sequence into a hypothetical DNA sequence, and allows one to use available graphical representations of DNA to generate a graphical representation for proteins [19]. Then some novel graphical approaches were developed for graphical representation of proteins that allow a direct representation of proteins [22], [23]. In addition, to reflect the difference among 20 natural amino acids, some graphic representations of proteins consider more physicochemical properties. For example, Chou et al. [24] proposed a 2D representation method, ‘wenxiang diagram’, to characterize the disposition of hydrophobic and hydrophilic residue. Wen and Zhang [25] proposes a 2D graphic representation based on the pKa values of different amino acids. Wu et al. [26] build up a web-server for creating graphic representation of protein sequences by two different physicochemical properties of their constituent amino acids.

In the present study, we propose a new 2D graphic representation of protein sequence based on the 0–1 codes of 20 amino acids from Huffman tree. The 0–1 codes of 20 amino acids based on Huffman tree can provide an approach with a compression to represent protein sequences by binary unit. Further, the use of 0–1 codes can still reflect the overall frequency characteristic within the considered protein sequences. The proposed method can graphically represent the sequence with no degeneracy and loss of information. The rest of the paper is organized as follows. Section 2 presents the 2D graphic representation of protein sequence based on Huffman tree method. Section 3 mainly discusses the application in genome comparison of seven Escherichia coli strains. Section 4 gives the conclusion of this paper.

Section snippets

The 0–1 codes of 20 amino acids by Huffman tree

Huffman codes are digital data compression codes resulted from the excellent piece of work by Prof. David A. Huffman (1925–1999) [27]. Huffman codes exploit the entropy of the message to give good compression ratios. The Huffman Encoding scheme is a method with variable length encoding. That is to say that the code for a symbol depends on the frequency of occurrence of that symbol. Huffman coding is again classified into two different groups Static Huffman coding and Adaptive Huffman coding. In

Geometrical center of genome curve

According to the proposed 2D graphic representation model, the protein sequences are represented by a set of geometrical points in 2D space. One can get a long graphic curve if he sequentially connects all points of every gene along genome. However, we cannot easily find out the difference of curves when the graphic representation reaches level of genome. In order to find some of the invariants sensitive to the form of the graphic curve, we use geometrical center of genome curve to represent

Conclusion

High complexity and degeneracy are major problems in previous protein sequence representations. The proposed method can provide a direct plotting approach to denote protein sequences without degeneracy, even genome. From the protein curves, the 20 amino acids as well as the original protein sequence can be recaptured mathematically without loss of information. It can provide intuitive inspection of data, helping in analyzing similarities among different protein sequences or genomes. And the

Conflict of interest statement

None declared.

Acknowledgment

We thank the anonymous reviewers for their valuable comments to improve this paper. This work is supported by Humanities and Social Sciences Research of Ministry of Education of China (Project name, The Origin, Propagation and Migration of Human Influenza Epidemic (1918–2010) from Space-time Perspective; Project no. 11YJCZH132).

Zhao-Hui Qi is with College of Information Science and Technology, at Shijiazhuang Tiedao University in Shijiazhuang, Hebei. Prior to him joining this University, he has received the Ph.D. degree in college of Computer at Tianjin University in Tianjin in 2006. His research interests include biological information processing and medical software developing.

References (35)

K.C. Chou
Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review)
J. Theor. Biol.
(2011)
E. Hamori et al.
H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences
J. Biol. Chem.
(1983)
M.A. Gates
A simple way to look at DNA
J. Theor. Biol.
(1986)
D. Bielinska-Waz et al.
Classification studies based on a spectral representation of DNA
J. Theor. Biol.
(2010)
M. Randić et al.
Novel 2-D graphic representation of DNA sequences and their numerical characterization
Chem. Phys. Lett.
(2003)
M. Randić
Another look at the chaos-game representation of DNA
Chem. Phys. Lett.
(2008)
Z.H. Qi et al.
3D graphic representation of DNA sequences and their numerical characterization
Chem. Phys. Lett.
(2007)
X.Q. Qi et al.
New 3D graphic representation of DNA sequence based on dual nucleotides
J. Theor. Biol.
(2007)
Z.H. Qi et al.
Novel 2D graphic representation of DNA sequence based on dual nucleotides
Chem. Phys. Lett.
(2007)
A. Ghosh et al.
Graphical representation and mathematical characterization of protein sequences and applications to viral proteins
Adv. Protein Chem. Struct. Biol.
(2011)

M. Randić et al.

Graphical representation of proteins as four-color maps and their numerical characterization

J. Mol. Graphics Modelling

(2009)

J. Wen et al.

2D graphical representation of protein sequence and its numerical characterization

Chem. Phys. Lett.

(2009)

Z.C. Wu et al.

2D-MH: a web-server for generating graphic representation of protein sequences based on the physicochemical properties of their constituent amino acids

J. Theor. Biol.

(2010)

M. Randić et al.

Novel 2-D graphical representation of proteins

Chem. Phys. Lett.

(2006)

E. Hamori

Graphic representation of long DNA sequences by the method of H curves-current results and future aspects

BioTechniques

(1989)

H.J. Jeffrey

Chaos game representation of gene structure

Nucleic Acids Res.

(1990)

A. Nandy

A new graphical representation and analysis of DNA sequence structure: I. Methodology and application to globin genes

Curr. Sci.

(1994)

Cited by (26)

Info2vec: An aggregative representation method in multi-layer and heterogeneous networks
2021, Information Sciences
Mapping nodes in multi-layer and heterogeneous networks to low-dimensional vectors has wide applications in community detection, node classification and link prediction, etc. In this paper, a generalized graph representation learning framework is proposed for information aggregation in various multi-layer and heterogeneous networks. Specifically, an aggregation network is firstly obtained by graph transformation, generating potential information links based on the network structure on different layers. A comprehensive measurement of the similarity between different nodes in the aggregation network is then carried out by aggregating the information of nodes’ identities of structure, nearness and attributes etc. Based on the comprehensive similarity values the nodes have, a context graph can be generated using a simple edge percolation method, which provides a basis facilitating some important downstream work such as classification, clustering and prediction etc. We demonstrate the effectiveness of the new framework in identifying subnetworks in a cyberspace network, where it significantly outperforms all the existing baselines.
Similarity/dissimilarity analysis of protein structures based on Markov random fields
2018, Computational Biology and Chemistry
Citation Excerpt :
If more thresholds are used, contact maps will allow us to extract more detailed information about the protein structure. Numerical characterization techniques offer a route toward quantitatively estimating the similarities/dissimilarities of the biological sequences and structures (Liao et al., 2010; Liao et al., 2011; Qi et al., 2012; Randic et al., 2009a,b). If similarities/dissimilarities of two protein structures is defined between the numerical characterizations rather than between their contact maps, it will make the comparison of different protein structures much simpler than direct comparison.
Protein Structure Similarity plays an important role in study on functional properties of proteins and evolutionary study. Many efficient methods have been proposed to advance protein structural comparison, but there are still some challenges in the contact strength definitions and similarity measures. In this work, we schemed out a new method to analyze the similarity/dissimilarity of the protein structures based on Markov random fields. We evaluated the proposed method with two experiments and compared it with the competing methods The results indicate that the proposed method exhibits a strong ability to detect the similarities/dissimilarities among the conformation of different cyclic peptides and protein structures. We also found that the alpha-C, oxygen O and N allow us to extract more conserved structures of the proteins, and Markov random fields with 2-point cliques (V) and orders 3 and 1 are more efficient in detecting the similarities/dissimilarities among different protein structures. This understanding can be used to design more powerful methods for similarities/dissimilarities analysis of different protein structures.
20D-dynamic representation of protein sequences
2016, Genomics
Citation Excerpt :
In the present work, we introduce 20D moments of inertia as new descriptors of protein sequences. Nowadays, alignment-free methods of comparison of protein sequences is a fast developing area of bioinformatics [3,4,14,15,19–38] (for reviews see [39,40]). The proposed method supplies a new nonstandard and efficient tool for the comparison and for phylogenetic analysis of protein sequences.
A new method of comparison of protein sequences has been formulated. The sequence of amino acids is represented by a set of point masses in a 20D space. The distribution of points in the space is obtained by applying the method of a walk in the 20D space. Projections of the 20D representation into 2D or 3D spaces illustrate the distribution of particular amino acids along the sequence. 20D moments of inertia are proposed as new descriptors of protein sequences.
A protein mapping method based on physicochemical properties and dimension reduction
2015, Computers in Biology and Medicine
The graphical mapping of a protein sequence is more difficult than the graphical mapping of a DNA sequence because of the twenty amino acids and their complicated physicochemical properties. However, the graphical mapping for protein sequences attracts many researchers to develop different mapping methods. Currently, researchers have proposed their mapping methods based on several physicochemical properties. In this article, a new mapping method for protein sequences is developed by considering additional physicochemical properties, which is a simple and effective approach.
Based on the 12 major physicochemical properties of amino acids and the PCA method, we propose a simple and intuitive 2D graphical mapping method for protein sequences. Next, we extract a 20D vector from the graphical mapping which is used to characterize a protein sequence.
The proposed graphical mapping consists of three important properties, one-to-one, no circuit, and good visualization. This mapping contains more physicochemical information. Next, this proposed method is applied to two separate applications. The results illustrate the utility of the proposed method.
To validate the proposed method, we first give a comparison of protein sequences, which consists of nine ND6 proteins. The similarity/dissimilarity matrix for the ssnine ND6 proteins correctly reveals their evolutionary relationship. Next, we give another application for the cluster analysis of HA genes of influenza A (H1N1) isolates. The results are consistent with the known evolution fact of the H1N1 virus. The separate applications further illustrate the utility of the proposed method.
3D representations of amino acids - Applications to protein sequence comparison and classification
2014, Computational and Structural Biotechnology Journal
Citation Excerpt :
Since the original work of Swanson, many new geometric representations of protein sequences have been proposed [22,33–51]. These various representations have been used for detecting and measuring similarities between sequences [30,44,45,47,49], to study the evolution of protein sequences [50], to predict cleavage sites in protein [36], to predict the 3D fold of a protein [48], to predict sub-cellular locations of proteins [35], to predict the location of protein domains [34,41], and to provide a representation of the full protein sequence space [51]. Those that represent amino acids as vectors relate the directions and amplitudes of these vectors to the physico-chemical properties of the amino acids [39,43,44,46–48], to amino acid compositions in protein sequences [50], to evolution information [30,36,41], or simply follow the main axes of the feature space considered [38] or are uniformly distributed along a curve [37].
The amino acid sequence of a protein is the key to understanding its structure and ultimately its function in the cell. This paper addresses the fundamental issue of encoding amino acids in ways that the representation of such a protein sequence facilitates the decoding of its information content. We show that a feature-based representation in a three-dimensional (3D) space derived from amino acid substitution matrices provides an adequate representation that can be used for direct comparison of protein sequences based on geometry. We measure the performance of such a representation in the context of the protein structural fold prediction problem. We compare the results of classifying different sets of proteins belonging to distinct structural folds against classifications of the same proteins obtained from sequence alone or directly from structural information. We find that sequence alone performs poorly as a structure classifier. We show in contrast that the use of the three dimensional representation of the sequences significantly improves the classification accuracy. We conclude with a discussion of the current limitations of such a representation and with a description of potential improvements.
A graphical representation of protein based on a novel iterated function system
2014, Physica A: Statistical Mechanics and its Applications
Citation Excerpt :
It is not feasible to directly expand the graphical representations of DNA sequences to protein sequences. Recently, many graphical representations of proteins have been suggested to describe and analyze protein sequences [5–31]. For example, according to the indices of some physicochemical properties of the twenty amino acids [9–12,16–19], some graphical representations of protein sequences have been proposed to compare the similarities/dissimilarities of proteins.
In this article, a novel family of iterated function system (IFS) was introduced to outline a 2D graphical representation of protein sequences, which incorporates with various physicochemical properties of amino acids. Then a mathematical description was suggested to quantificationally compare the similarities and dissimilarities of protein sequences from their 2D curves. Based on this method, similarities/dissimilarities were compared among sequences of the ND5 proteins of nine different species, as well as sequences of eight ND6 proteins. The phylogenetic tree of the nine ND5 proteins was constructed according to Fuzzy cluster analysis. By correlation analysis, the ClustalW results were compared with our similarity/dissimilarity results and other graphical representation results to demonstrate the effectiveness of our approach.

View all citing articles on Scopus

Jun Feng is with College of Information Science and Technology, at Shijiazhuang Tiedao University in Shijiazhuang, Hebei. She has received the Ph.D. degree in University of Science and Technology Beijing in Beijing in 2005. His research interests include biological information processing.

Xiao-Qin Qi has received the B.S. degree in Computer Education from Hunan Normal University in 2002. She is currently working towards a Master degree in Computer Application at Shijiazhuang Tiedao University, Hebei.

Ling Li is with Basic Courses Department, at Zhejiang Shuren University in Hangzhou, Zhejiang. Her research interests include bioinformatics and medical software developing.

View full text

Application of 2D graphic representation of protein sequence based on Huffman tree method

Abstract

Introduction

Section snippets

The 0–1 codes of 20 amino acids by Huffman tree

Geometrical center of genome curve

Conclusion

Conflict of interest statement

Acknowledgment

J. Theor. Biol.

J. Biol. Chem.

J. Theor. Biol.

J. Theor. Biol.

Chem. Phys. Lett.

Chem. Phys. Lett.

Chem. Phys. Lett.

J. Theor. Biol.

Chem. Phys. Lett.

Adv. Protein Chem. Struct. Biol.

J. Mol. Graphics Modelling

Chem. Phys. Lett.

J. Theor. Biol.

Chem. Phys. Lett.

Graphic representation of long DNA sequences by the method of H curves-current results and future aspects

BioTechniques

Chaos game representation of gene structure

Nucleic Acids Res.

A new graphical representation and analysis of DNA sequence structure: I. Methodology and application to globin genes

Curr. Sci.