A 9-state hidden Markov model using protein secondary structure information for protein fold recognition

https://doi.org/10.1016/j.compbiomed.2009.03.008Get rights and content

Abstract

In protein fold recognition, the main disadvantage of hidden Markov models (HMMs) is the employment of large-scale model architectures which require large data sets and high computational resources for training. Also, HMMs must consider sequential information about secondary structures of proteins, to improve prediction performance and reduce model parameters. Therefore, we propose a novel method for protein fold recognition based on a hidden Markov model, called a 9-state HMM. The method can (i) reduce the number of states using secondary structure information about proteins for each fold and (ii) recognize protein folds more accurately than other HMMs.

Introduction

The genome projects provided sequence information for a large number of proteins. An accurate 3D structural representation of proteins is necessary for determining detailed functionalities of these sequences [1]. Among methods for determination of protein structures is the experimental method, such as X-ray crystallography and NMR [2]. However, the experimental determination of protein structures remains a long and difficult task. To contend with these problems, many international researchers are developing structure prediction methods [1], [3], [4]. Protein structure prediction methods are used to predict 3D structures and various properties, secondary structures and domain boundaries of proteins from protein primary structures. Protein 3D structure prediction detects the first homology between protein sequences [3]. In the case of high sequence similarity, finding homologies is very easy. According to similarities of proteins, tertiary structure prediction is divided into three steps as follows [4]: first, when the similarity of two sequences is very high, fragment based comparative modeling easily finds homologies between protein sequences. Second, the fold recognition method is used to find remote homologies or distant homologies. Third, when template proteins among known 3D structure proteins cannot be found, the New Fold method, called ab initio or de novo, is used.

A newly identified protein can be related to proteins in annotated databases for which the 3D structure is known. The knowledge of protein structural class such as fold assists in the determination of the 3D structure of a protein. If the structural class of a protein is known, it can be used to considerably reduce the search space of structure prediction processes, since most of the structure alternatives could be eliminated and therefore the structure prediction task is simplified and the whole process is accelerated [10].

There are many tools for protein tertiary structure prediction using hidden Markov Models (HMMs) [3], [5], [6], [7], [8], [9], [10], [11]. In sequence-based approaches, HMMs are predominant and also demonstrate high performance [5], [11]. HMMs were applied for multi-class protein fold recognition [6], employing sequence alignment and modeling (SAM) software [5]. Secondary structure information was incorporated into HMMs [7]. Karchin et al. [9] used an identical approach and they evaluated different alphabets for backbone geometry. Douchaffra et al. applied relations between components (e.g. secondary structures of proteins) of an entity and the whole [8]. Lampros proposed a reduced state-space HMM, which simultaneously finds amino acid sequences and secondary structures for proteins [10]. However, the main disadvantage of HMMs is the employment of large model architectures, which require large data sets and high computational resources for training. It is necessary to reduce the parameters of HMMs, while HMMs maintain performance of fold recognition. Also, HMMs must consider sequential information about secondary structures of proteins, to improve prediction performance.

Therefore, we propose a novel fold recognition method for protein tertiary structure prediction based on a hidden Markov model. The method consists of nine hidden states and uses protein sequences and secondary structure information to recognize the fold of proteins. Our contribution can be summarized as follows: first, we present the 9-state HMM suited to protein fold recognition. The nine states are divided into three components: α-helix, β-strand, and coil. Each state of the three components represents the “beginning”, “middle”, and “ending” for the secondary structure. Second, we evaluate the proposed HMM for a low homology data set. Finally, we conjecture that the performance of protein fold recognition can be improved using amino acid frequency and secondary structure information in fold recognition. We also avoid utilizing the computationally expensive Baum–Welch algorithm [21] and reduce the number of parameters for HMM.

The rest of this paper is organized as follows: Section 2 introduces related work about protein structure prediction. In Section 3, we describe the proposed hidden Markov model. Section 4 shows a set of test data and several experimental results. Finally, Section 5 summarizes our conclusion and future work.

Section snippets

Related work

Hidden Markov models are the popular approach to machine learning in bioinformatics [12], [13], [14]. The theory of Hidden Markov models was introduced by Baum and his colleagues circa in 1960s. There was widespread application of the theory of HMMs to speech processing circa in 1970s.

An HMM describes a probability distribution over a potentially infinite number of sequences. The HMM is a doubly embedded stochastic process with an underlying stochastic process that is not observable (it is

A 9-state hidden Markov model

In this section, we propose a 9-state hidden Markov model and describe the protein fold recognition method using the model.

Simulation and results

In this section, we describe simulation environments, data sets, and experiments for the performance of HMMs. In the experiments, we used an Intel Pentium 4 2.8 GHz with 1 GB RAM, 160 GB HDD and Window XP. We implemented a 9-state HMM using the Eclipse editor in Java and MATLAB. A set of data consists of the training data and test data. For the purpose of our evaluation, all the proteins in training and testing have known structures in PDB [16] and known class labels and fold labels in SCOP [17].

Conclusion

There are many tools for protein fold recognition using HMMs. However, the main disadvantage of HMMs is the employment of large-scale model architectures which require large data sets and high computational resources for training. Also, HMMs must consider sequential information about secondary structures of proteins, to improve prediction performance. Therefore, we proposed a novel method for protein fold recognition based on a hidden Markov model, called the 9-state HMM. The model consists of

Conflict of interest statement

None declared.

Acknowledgment

This work was supported by the Korea Research Foundation Grant funded by the Korean Government (Ministry of Education, Science and Technology) (The Regional Core Research Program/Chungbuk BIT Research-Oriented University Consortium).

Sun Young Lee received the B.E. and M.E. degrees in Electrical Engineering from Chungbuk National University, South Korea in 2001 and 2005, respectively. She is a candidate for Ph.D. now in the Department of Computer Education at Chungbuk National University, South Korea. Her research interests include bioinformatics, databases, and ubiquitous computing.

References (21)

  • C.A. Orengo, D.T. Jones, J.M. Thornton, Bioinformatics: genes, proteins and computers, Advanced Text,...
  • A. Camproux, F. Guyon, R. Gautier, J. Laffray, P. Tuff’ery, A hidden Markov model applied to the analysis of protein...
  • A.D. Baxevanis, B.F. Ouellette, Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, third ed.,...
  • J.G. Lee, S.J. Oh, J.M. Kim, Developing Bioinformatics Computer Skills, Hanbit Media,...
  • R. Hughey et al.

    Hidden Markov models for sequence analysis: extension and analysis of the basic method

    CABIOS

    (1996)
  • E. Lindahl et al.

    Identification of related proteins on family, superfamily and fold level

    J. Mol. Biol.

    (2005)
  • J. Hargbo et al.

    Hidden Markov models that use predicted secondary structures for fold recognition

    Proteins

    (1999)
  • D. Bouchaffra, J. Tan, Protein fold recognition using a structural hidden Markov model, in: 18th International...
  • R. Karchin et al.

    Hidden Markov models that use predicted local structure for fold recognition: alphabets of backbone geometry

    Proteins

    (2003)
  • C. Lampros, C. Papalukas, T.P. Exarchos, Y. Goletsis, D.I. Fotiadis, Sequence-based protein prediction using a reduced...
There are more references available in the full text version of this article.

Cited by (16)

View all citing articles on Scopus

Sun Young Lee received the B.E. and M.E. degrees in Electrical Engineering from Chungbuk National University, South Korea in 2001 and 2005, respectively. She is a candidate for Ph.D. now in the Department of Computer Education at Chungbuk National University, South Korea. Her research interests include bioinformatics, databases, and ubiquitous computing.

Jong-Yun Lee received the B.E. and M.E. degrees in Computer Engineering from Chungbuk National University in 1985 and 1987, respectively and the Ph.D. degree in Computer Science from Chungbuk National University, South Korea in 1999. He worked as a research/project leader in Software Research and Development Institute of Hyundai Electronics Industrial Company Ltd. and Hyundai Information Technologies Company Ltd. in South Korea from 1990 to 1996. Also he worked with Bit Computer Cooperation in 1989. He had worked for the department of Information and Communication Engineering at Samcheok National University as an assistant professor from March 1999 to February 2003. After that, he is an associate professor in the Department of Computer Education of Chungbuk National University in South Korea. His current research interests include temporal databases, spatio-temporal databases, u-learning, bioinformatics, and especially query processing and optimization technologies in databases. He is a member of the IEEE, Korea Information Science Society, Multimedia Society, and served an international journal INFORMATION as an editorial board member in 2004 and Korea Information Processing Society as a journal editorial member from 2003 to 2006.

Kwang Su Jung received the Master's degree from Chungbuk National University, Republic of Korea. He is a Ph.D. candidate at Chungbuk National University, Republic of Korea. His research interests are data mining, protein structure, and bioinformatics.

Keun Ho Ryu received the Ph.D. degree from Yonsei University, Republic of Korea, in 1988. He is a professor at Chungbuk National University, Republic of Korea. His research interests include spatiotemporal databases, ubiquitous computing and stream data processing, data mining, and bioinformatics.

View full text