An improved memetic approach for protein structure prediction incorporating maximal hydrophobic core estimation concept

https://doi.org/10.1016/j.knosys.2018.06.022Get rights and content

Abstract

Protein Structure Prediction (PSP) from the primary amino acid sequence, even using a simplified Hydrophobic-Polar (HP) lattice model, continues to be extremely challenging. Finding an optimal conformation, even for a small sequence, by any of the currently known evolutionary approaches is computationally extensive and time consuming. Although Memetic Algorithms (MAs) have shown success in finding the optimal solution for PSP, no significant work on the incorporation of domain or problem specific knowledge into the search process to significantly improve their performance is reported. In this paper, we present an approach to incorporate such knowledge into the initial population to enhance the effectiveness of MA for PSP. The domain knowledge we propose to use is based on the concept of maximal ‘core’ formation by exploiting the fundamental property of the H residues to be at the core of the minimum energy optimal protein structure. A generic technique is proposed for estimating the maximal Hydrophobic core (H-core) in a protein sequence for 2D Square, 3D Cubic and a more complex and realistic 3D FCC (Face Centered Cubic) lattice models. Subsequently, the knowledge of this estimated core is incorporated in an MA. The experiments conducted using HP benchmark sequences for 2D Square, 3D Cubic and 3D FCC lattice models show that the proposed MA with the new core-based population initialization technique has superior performance to the existing methods in terms of convergence speed as well as minimal energy.

Introduction

Proteins are important biological macromolecules composed from a set of twenty amino acids which participate in the cellular metabolism to guarantee proper functioning of all living organisms. Protein’s functional performance is a consequence of its native structure i.e., the unique three dimensional structure having the lowest possible free energy [1]. Therefore, predicting the structure of protein accurately has far reaching implications in medical and biotechnological research as well as in drug design. The tertiary structure of a protein, based on the minimum energy hypothesis [2], can be predicted from its amino-acid sequence. However, according to the Levinthal paradox [3], the time to attain an accurate folded structure by an exhaustive enumeration is proportional exponentially to the number of residues. This means that, finding the optimal conformations, even for a small sequence, by exhaustive search is extremely time-consuming. Due to the complex nature of protein folding process, Protein Structure Prediction (PSP) is still considered a grand challenge in computational biology.

Finding a minimum energy conformation even in a simplified model has proved to be NP-hard [1], [4]. The discrete search space of PSP is enormously large and complex with very many peaks and troughs. In such problems which are characterized by many local optima, traditional optimization techniques tend to fail in locating the global optimum [5]. This is because simply searching the neighborhood of a current solution is not sufficient to explore the huge search space and to direct the search towards global minima. Hence, the search techniques like Hill climbing (HC), that uses one current solution to create new candidate solutions by exploring only the immediate neighbors of the current solution, may not necessarily guide towards the optimal solution. Moreover, many of these algorithms do not allow any backtracking or “downhill movement” before new fitness improvements occur, which may be required for escaping local optima [6]. Hence these algorithms have a tendency to stagnate on a local optimum making them less efficient for tackling multi-modal optimization. Population based methods that maintain several solutions simultaneously are particularly well-suited for handling multi-modal problems since they provide significant opportunity for exploring the large search space, as well as escaping local optima, by generating solutions that are not necessarily created as neighbors of existing solutions. Amongst various global search techniques, Genetic Algorithms (GAs) are efficient in exploring the search space to locate the promising region rapidly. Due to cross-over operation, GAs induce a global interaction among the individuals for constructing a global solution by recombining good features of different individuals. This has a large impact on the effectiveness of the search, since it allows exploration of the regions that are not accessible to either of the parent individuals [7]. However, exploitation along with exploration are two important aspects in the design of any global search technique [8]. Exploration ensures that the search space has been explored sufficiently to provide a reliable estimate of the global optimum whereas exploitation concentrates the search effort on the best solutions found thus far by searching their neighborhood to find better solutions [5]. However, the performance of any global optimization algorithm depends on the mechanism for balancing these two conflicting objectives [9].

Global search algorithms can rapidly locate the region in which the global optimum exists, though they lack the ability to locate the best solution in the best found region [10], [11] due to the operator’s inability to search the neighborhood of current solutions. Hence, it is now well-established that global search techniques are not well suited to fine-tuning the search in complex search spaces [12]. However, hybridization of these techniques with other local search approaches can greatly improve the efficiency of the search [13], [14].

Memetic Algorithms (MA) imbibe the virtues of both local and global search algorithms and can be more robust than other approaches in dealing with the complex and challenging nature of PSP. MAs are population-based meta-heuristic approaches which may be regarded as a marriage between a population-based global search and local improvement procedures [5]. Such a hybrid of global and local search methods can accelerate convergence and increase the probability of approaching to the global optimum.

MAs are intrinsically concerned with exploiting the available knowledge of the problem domain. Different studies [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [27], [28], [29] have demonstrated that, by incorporating problem specific prior knowledge in an MA framework, a memetic model exhibits a better performance than a global search technique. The influence of the memes employed has a major impact on the search performance of MAs and it has been shown extensively in [15], [16], [17]. The idea of employing different memes together has been introduced by Cowling et al. [16] for applying different memes to each individual. However, it may be noted that the choice of memes affects the performance of MAs significantly [15]. Ong and Keane [22] coined the term “meta-Lamarckian learning” to introduce the idea of adaptively choosing multiple memes during an MA search. The terms “meta-Lamarckian learning”, “hyperheuristic”, and “multi-memes” have been used interchangeably when referring to memes’ adaptation in adaptive MAs.

In adaptive MAs, multiple memes are used in the search and the decision regarding application of a meme is made dynamically. Different strategies have been employed to select a meme from a pool of memes for each individual to conduct the local improvements [5]. For example, in a random strategy Simplerandom, a meme is selected stochastically by keeping the probability of choosing each meme constant. Randomdescent is another random strategy where a meme is initially chosen randomly and used continually until no further local improvements occur, repeating the same process for all the other memes. On the other hand, in greedy strategy [16], every meme is experimented on each individual; and then a meme that results in the biggest improvement is chosen.

The performance of an optimization technique can be improved by incorporating the knowledge from the problem domain [30], [31]. It has also been postulated [32] that seeding of the initial population can increase the likelihood of the global solution through the iterative process of information exchange and can favorably bias the search towards optimal or near-optimal solutions [33]. Hence, incorporating prior knowledge in the initial population can have a significant impact on the performance of evolutionary algorithms. However, most of the techniques for generating initial population for the PSP mainly focus on the Self-Avoiding-Walk (SAW) constraint, along with diversity constraint. Thus it has become essential to incorporate the domain knowledge in the initial population to commence with good seeds to explore promising regions. In this paper, we propose a method to incorporate the knowledge of the maximal possible hydrophobic core that can be formed in the optimal conformation of a protein sequence into MA. The technique extracts the knowledge of the maximal core by analyzing different aspects of protein sequence. Furthermore, a novel knowledge-based initial population generation technique is also proposed to incorporate the knowledge of the core into MA. Subsequently, the accuracy of the extracted knowledge has been assessed by designing a deterministic search technique based on the knowledge of the core approximated by the core estimation technique. This paper also includes an extensive analysis on different aspects of a particular protein sequence that are instrumental in approximating the size of the maximal possible hydrophobic-core in its optimal conformation for different lattice models. The work extends our preliminary results of the maximal core estimation technique for 2D Square and 3D Cubic lattice models reported in [34], [35] to 3D FCC lattice model. Based on various traits observed for different lattice models, a generalized technique for estimating maximal hydrophobic-core is proposed for generating an initial population.

The remainder of the paper is organized as follows. In Section 2, we present the preliminaries on different lattice and PSP models, along with various initial population generation techniques. The importance of estimating the size of a maximal H-core is also highlighted in this section. Section 3 describes the proposed MA which is designated in 3 folds: i) illustrating maximal core estimation techniques for 2D Square, 3D Cubic and 3D FCC lattice models, ii) proposing a generic H-core estimation algorithm which is capable of dealing with all the three lattice models, and (iii) proposing technique to incorporate the knowledge of maximal H-core in initial population generation. In Section 4, the effects of the incorporated knowledge in the initial population as well as in MA are assessed and discussed with the aid of widely used benchmark sequences. Finally, Section 5 concludes the paper.

Section snippets

Background

This section provides a brief introduction on the HP protein model, methods for H-core estimation and initial population generation in MAs applied for PSP.

The method

We describe our proposed approach by introducing the concept of maximal H-core formation for three different lattice models, followed by the technique to incorporate the H-core to initial population in MA.

Experimental results and discussions

The proposed method for incorporating domain knowledge is compared with a popular state-of-the-art algorithm in 2D and 3D HP lattice models. The method is implemented in C++ language and executed at Monash Sun Grid (MSG) [43]. Experiments have been carried out on a data set of 11 benchmark protein sequences using three considered lattice models (i.e., 2D Square, 3D Cubic, and 3D FCC). Since the sequences with a small number of Hs are not affected significantly by the CIPG technique, we have

Conclusion

Incorporation of prior knowledge in any evolutionary algorithm, especially for predicting the structure of proteins, plays a crucial role in improving its performance. While there have been some effort towards incorporating domain specific knowledge in these algorithms with computational biology, very few methods have carried this out with biologically relevant a-priori knowledge. In this paper, we have proposed a systematic mechanism to incorporate the knowledge of maximal Hydrophobic-core or

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (47)

  • UngerR. et al.

    Genetic algorithms for protein folding simulations

    J. Mol. Biol.

    (1993)
  • PreuxP. et al.

    Towards hybrid evolutionary algorithms

    Int. Trans. Oper. Res.

    (1999)
  • BergerB. et al.

    Protein folding in the hydrophobic-hydrophilic (HP) model is NP-complete

    Comput. Biol.

    (1998)
  • AnfinsenC.B.

    Principles that govern the folding of protein chains

    Science

    (1973)
  • LevinthalC.

    Are there pathways for protein folding?

    J. Chem. Phys.

    (1968)
  • CrescenziP. et al.

    On the complexity of protein folding

  • OngY.-S. et al.

    Classification of adaptive memetic algorithms: a comparative study

    IEEE Trans. Syst. Man Cybern. B

    (2006)
  • UrsemR.K.

    Models for Evolutionary Algorithms and Their Applications in System Identification and Control Optimization, DS-03-6

    (2003)
  • TornA. et al.

    Global Optimization

    (1989)
  • El-MihoubT.A. et al.

    Hybrid genetic algorithms: A review

    Eng. Lett.

    (2006)
  • De JongK.A.

    Genetic algorithms: a 30 year perspective

  • LozanoM. et al.

    Real-coded memetic algorithms with crossover hill-climbing

    Evol. Comput.

    (2004)
  • DavisL.

    Handbook of Genetic Algorithms, vol. 115

    (1991)
  • GoldbergD.E. et al.

    Optimizing global-local search hybrids

  • HartW.

    Adaptive Global Optimization With Local Search

    (1994)
  • P. Cowling, G. Kendall, E. Soubeiga, A hyperheuristic approach to scheduling a sales summit, in: Practice and Theory of...
  • G. Kendall, P. Cowling, E. Soubeiga, Choice function and random hyperheuristics, in: Asia-Pacific Conference on...
  • N. Krasnogor, B. Blackburne, E.K. Burke, J.D. Hirst, Multimeme algorithms for protein structure prediction, in:...
  • KrasnogorN.

    Studies on the Theory and Design Space of Memetic Algorithms

    (2002)
  • J. Smith, Co-evolving memetic algorithms: Initial investigations, International Conference on Parallel Problem Solving...
  • OngY.S.

    Artificial Intelligence Technologies in Complex Engineering Design

    (2002)
  • OngY.S. et al.

    Meta-lamarckian learning in memetic algorithms

    IEEE TEC

    (2004)
  • WongK. et al.

    Using memetic algorithms for fuzzy modeling

    Aust. J. Intell. Inf. Process

    (2004)
  • Cited by (2)

    • Memetic search for the equitable coloring problem

      2020, Knowledge-Based Systems
      Citation Excerpt :

      A hybridmethod mixing these two approaches is expected to take advantage of complementary search strategies offered by the composing approaches. Since their introduction, MAs have been applied to solve many problems [29,30], including graph coloring [19,26,27,31] and other graph optimization problems (e.g., [32–34]). As a general optimization framework, MAs need to be carefully adapted to the given problem to achieve a high performance [35].

    • Exact Algorithm for Generating H-Cores in Simplified Lattice-Based Protein Model

      2024, Communications in Computer and Information Science
    View full text