1 Introduction

In recent decades, different algorithmic approaches have been introduced for solving combinatorial optimization problems to find the best or satisfactory solutions. Generally, we distinguish between exact approaches that derive an optimal solution in bounded computation time, and heuristic approaches that provide sub-optimal solutions within more practical computation time limits. Especially in those cases in which large-scale problem instances must be solved, heuristic or metaheuristic search strategies are often used instead of exact techniques. Metaheuristic algorithms include methods such as evolutionary algorithms, ant colony optimization and tabu search, just to name a few. Depending on the type of optimization problem, a whole range of different techniques can be used in the process of finding a good enough solution. Nowadays, many of these approaches take advantage of the joint use of exact and heuristic techniques. Such approaches are often called hybrid metaheuristics [1, 2] or matheuristics [3].

The hybrid metaheuristic framework Cmsa (Construct, Merge, Solve & Adapt) was introduced by [4] for solving combinatorial optimization problems. The concept behind Cmsa is to still profit from the power of exact solvers, even for the application to large problem instances. In particular, Cmsa solves—at each iteration—a sub-instance of the original problem instance. All these sub-instances are generated by merging different solutions constructed during the algorithm run. Moreover, sub-instances are usually expressed in terms of integer linear programming (ILP) models and solved by black-box ILP solvers such as CPLEX or Gurobi. Applications of Cmsa have shown remarkably good results for some interesting problems. Among these problems, we find, for example, minimum common string partition [5], which is a problem in the field of bioinformatics. Other examples include project scheduling [6], the maximum happy vertices problem [7], and the problem of test case generation for software checking [8].

The present proposal focuses on the development of a new variant of Cmsa based on the hybridization with a population-based metaheuristic, with the goal of adding a learning component to standard Cmsa. The usefulness of this proposal will be shown by the application to the so-called far from most string problem (FFMSP) [9], which is an NP-hard combinatorial optimization problem. This optimization problem forms part of a category of string-related problems known as sequence consensus problems. In these problems, a finite set of sequences is provided, and the goal is to identify their consensus—namely, a novel sequence that effectively encapsulates the essence of all the provided sequences. Various objectives, some potentially conflicting, can be contemplated within the realm of sequence consensus problems. Examples of such objectives are outlined in the following [10].

  • Closest string problem (CSP): Find a sequence whose total distance from all input sequences is minimal

  • Farthest string problem (FSP): Find a sequence that maximizes its overall distance from all input sequences.

  • Far from most string problem (FFMSP): Find a sequence far from most of the input sequences, where "most" is defined by a fixed threshold.

As mentioned above, to solve the FFMSP problem, we develop a new Cmsa variant that incorporates learning into the solution construction process. In the case of this hybrid approach, the learning component will result from the hybridization with a bacterial algorithm (BA), which is an evolutionary algorithm in which the crossover operator is inspired by processes observed in bacteria. Note that any population-based approach could have been used to add learning to Cmsa. Moreover, note that the BA algorithm will be described in metaphor-free language in this work.

In general, BAs are inspired by the survival of a bacterial population, and how a population of bacteria evolves and develops a resistance to antibiotics. Note that when bacteria develop such resistance, they are no longer affected by exposure to antibiotics. From the viewpoint of bacteria, developing such a resistance is, of course, desirable. From a medical point of view, however, this may result in severe problems for humans. This is due to the fact that infections triggered by resistant microorganisms frequently do not respond to standard treatments, leading to prolonged illness and an increased likelihood of mortality. In simpler terms, antibiotic resistance refers to a form of drug resistance wherein a microorganism can withstand exposure to an antibiotic [11]. Recent research concerning the exploitation of bacteria behavior has already resulted in the application to a group formation problem in the context of designing students’ activities. The core idea is that students collaborate in a group to improve their academic performance [12].

1.1 Contribution and Paper Outline

In this work, we present an algorithmic proposal that belongs to the category of hybrid algorithms based on problem instance reduction [1]. In particular, we augment the Cmsa algorithm with a learning mechanism by combining it with a population-based metaheuristic. The proposed hybrid algorithm, called \(\textsc {Learn}\_\textsc {Cmsa}\), is applied to the notoriously difficult far from most string problem. In the section on the experimental evaluation, we not only compare \(\textsc {Learn}\_\textsc {Cmsa}\) to its components (Cmsa and the bacterial algorithm), but we also compare to the current state of the art for the far from most string problem. Our results show that \(\textsc {Learn}\_\textsc {Cmsa}\) outperforms its algorithmic components and that it compares very favorably with the current state of the art.

The remainder of this paper is organized as follows. Section 2 introduces the far from most string problem. Section 3 first introduces the pure variants of both Cmsa and the bacterial algorithm, before discussing the hybridization of Cmsa with the bacterial algorithm. Section 4 outlines the experimental design and the obtained results, and finally, Sect. 5 offers conclusions derived from this research and an outlook to future work.

2 The Far from Most String Problem

A problem instance of the far from most string problem (FFMSP) is represented as \((\Omega , t)\), where set \(\Omega = {s_1, \ldots , s_n}\) contains n input strings over a finite alphabet \(\Sigma \). Each input string \(s_i\) in \(\Omega \) has a length of m, i.e., \(|s_i|=m\) for all \(s_i \in \Omega \). Additionally, a fixed threshold value \(0< t < m\) is provided. In subsequent discussions, \(s_i[j]\) denotes the jth character of a string \(s_i\). The Hamming distance between two strings \(s_i \ne s_j \in \Omega \) of equal length, expressed as \(d_H(s_i, s_j)\), is defined as the count of positions where corresponding characters in the two strings differ. In other words:

$$\begin{aligned} d_H(s_i, s_j) = \left| \left\{ k \in \{1,\ldots ,m\} \mid s_i[k] \not = s_j[k]\right\} \right| . \end{aligned}$$
(1)

Note that all possible strings of length m over alphabet \(\Sigma \) are valid FFMSP solutions. Given such a string s, its objective function value \(f_{\textrm{orig}}(s)\) is computed as follows:

$$\begin{aligned} f_{\textrm{orig}}(s):= |\{ s_i \in \Omega \mid d_H(s, s_i) \ge t\}|. \end{aligned}$$
(2)

This implies that the objective function value of a solution or string s is determined by the count of input strings for which the Hamming distance with s is greater than or equal to the threshold value t.

In addition to the technical problem description, we provide an integer linear programming (ILP) model of the FFMSP. This is because Cmsa and \(\textsc {Learn}\_\textsc {Cmsa}\) internally use an ILP solver for the purpose of solving sub-instances of the considered problem instance. In particular, we provide below the description of the ILP model from [10]. This model makes use of two sets of binary variables. The first one of these sets comprises a variable \(x_{j,a}\) for every combination of a position \(j = 1, \ldots , m\) in a potential solution and a character \(a \in \Sigma \). The second set encompasses binary variables \(y_i\) corresponding to each input string \(s_i \in \Omega \) (\(i = 1, \ldots , n\)). The ILP model can be formulated as follows.

$$\begin{aligned}&{\textbf { max}} \sum _{i=1}^n y_i \end{aligned}$$
(3)
$$\begin{aligned}&{{\textbf { subject to:}}} \nonumber \\&\sum _{a \in \Sigma } x_{j,a} = 1 \;\;\; \forall \; j \in \{1, \ldots , m\} \end{aligned}$$
(4)
$$\begin{aligned}&\sum _{j=1}^m x_{j,s_i[j]} \le m - t \cdot y_i \;\;\; \forall \; i \in \{1, \ldots , n\} \nonumber \\&x_{j,a}, y_i \in \{0, 1\} \end{aligned}$$
(5)

Observe that constraints (4) guarantee each position j in a solution string is occupied by exactly one letter from \(\Sigma \). Furthermore, constraints (5) force a variable \(y_i\) to take the value zero in case the Hamming distance of the solution string (defined by the x-variables) to input string \(s_i \in \Omega \) is below threshold t.

2.1 Existing Work on the FFMSP

The FFMSP can be regarded a well-studied problem. In 2003, for example, [13] proved that approximating the FFMSP within a polynomial factor is NP-hard for sequences over an alphabet \(\Sigma \) with \(\vert \Sigma \vert \ge 3\). Given the computational complexity of the problem, the research community has primarily concentrated on heuristic and metaheuristic approaches. The initial proposal consisted of a greedy heuristic with the subsequent application of local search [14]. In fact, approaches based on randomized solution construction—in particular, metaheuristics known as greedy randomized adaptive search procedures (GRASP)—have enjoyed popularity for the application to the FFMSP problem; see [15,16,17,18,19]. The newest one of these GRASP approaches [19] is, in fact, a hybrid technique that combines GRASP both with variable neighborhood search (VNS) and path relinking.

Apart from the GRASP proposals, the literature on the FFMSP also offers evolutionary algorithms (EAs) such as the one from [16]. This algorithm features a diversity preservation mechanism to augment its exploration ability. The second EA proposal from [20] is a memetic algorithm that makes use of local search to improve the generated solutions. This algorithm proposal takes profit from earlier work using GRASP mechanisms for the construction of solutions. Ant colony optimization (ACO) is another bioinspired metaheuristic that has been applied already twice to the FFMSP. The first application from [10] hybridizes an ACO algorithm with the application of the ILP solver CPLEX for a possible improvement of the output of ACO. In contrast, in [21] the authors propose the application of a very recent ACO variant, known as negative learning ACO, to the FFMSP. Finally, for the sake of completeness, it is worth mentioning the beam search approach from [9]. Currently, the negative learning ACO approach from [21] and the memetic algorithm from [20] can be regarded as the state-of-the-art approaches for solving the FFMSP.

2.2 Augmented Objective Functions

The FFMSP poses a significant challenge not only for exact techniques but also for metaheuristics, primarily due to the limited range of possible objective function values. In fact, for an instance with n input strings the set of possible objective function values is \(\{0,\ldots ,n\}\). Because of this, the search space of an FFMSP problem instance encompasses broad plateaus, leading to situations where similar solutions frequently share identical objective function values. For a metaheuristic, this means that the search space often offers little (or no) guidance regarding the question of how to keep moving and exploring during the search process. This results in the fact that metaheuristics often get trapped on such plateaus. In light of the previously discussed reasons, [21] tested four different augmented objective functions in addition to the original objective function. In this work, we will consider the most successful options. It is crucial to note, though, that these alternative functions can solely be employed for all solution evaluations/comparisons occurring within the Cmsa variants and in the bacterial algorithm. CPLEX, which is utilized in Cmsa and \(\textsc {Learn}\_\textsc {Cmsa}\) for solving sub-instances at every iteration, continues to rely on the original objective function.

[17] proposed the first alternative objective function, which is denoted as \(f_{\textrm{mou}}()\). The purpose of this function is to generate a search landscape with fewer plateaus and a decreased number of local optima. It evaluates a solution by considering the likelihood of it leading to improved solutions through a relatively small number of local search moves. It is important to note that if \(f_{\textrm{orig}}(s) > f_{\textrm{orig}}(s')\) for two valid solutions s and \(s'\), then \(f_{\textrm{mou}}(s) > f_{\textrm{mou}}(s')\) also holds. Consequently, \(f_{\textrm{mou}}()\) can function independently. Due to the complexity of \(f_{\textrm{mou}}()\) and the limited space available for its description, we direct interested readers to the original publication [17] or to [20], where the authors implemented \(f_{\textrm{mou}}()\) in the context of their memetic algorithm.

[10] introduced a second alternative objective function, which we will refer to as \(f_{\textrm{blu}}()\). This function is a lexicographic objective function that primarily employs the original objective function as its first criterion. The second criterion utilizes the following function:

$$\begin{aligned} h(s):= & {} \sum _{\{s_i \in \Omega \mid d_H(s, s_i) \ge t\}} d_H(s, _si)\nonumber \\{} & {} + \max _{\{s_i \in \Omega \mid d_H(s, s_i) < t\}} \{d_H(s, s_i)\}. \end{aligned}$$
(6)

In more straightforward language, h(s) adds up the Hamming distances between s and the input strings \(s_i \in \Omega \) in those cases in which the Hamming distance is at least t. It also takes into account the maximum Hamming distance between s and the input strings \(s_i \in \Omega \) where the Hamming distance is less than t. The original objective function and h() are then integrated using a lexicographic approach:

$$\begin{aligned} \begin{array}{r@{}l} f_{\textrm{blu}}(s)> f_{\textrm{blu}}(s') { {\textbf {iff}} } &{}{} f_{\textrm{orig}}(s)> f_{\textrm{orig}}(s') { {\textbf {or}} } \\ &{}{} (f_{\textrm{orig}}(s) = f_{\textrm{orig}}(s') { {\textbf {and}} } h(s) > h(s')). \quad \end{array}\nonumber \\ \end{aligned}$$
(7)

For valid solutions s and \(s'\) to the problem, the rationale behind h() can be described as follows: a higher value of h(s) corresponds to a lower likelihood that minor modifications in s will result in a reduction of the original objective function. In addition to the two functions mentioned earlier, we also examine a simplified variant of \(f_{\textrm{blu}}()\) that employs function \(h'()\) rather than h():

$$\begin{aligned} h'(s):= \max _{\{s_i \in \Omega \mid d_H(s, s_i) < t\}} \{d_H(s, s_i)\}. \end{aligned}$$
(8)

It is important to note that \(h'(s)\), unlike h(s), focuses solely on the maximum Hamming distance between s and the input strings \(s_i \in \Omega \) where the Hamming distance is less than t. The resultant simplified lexicographic function is from now on denoted as \(f_{\textrm{sim}}()\).

Lastly, [21] also considered a combined objective function involving three criteria of the ones already presented.

3 The Proposed Algorithms

In the following, we first describe the pure variants of Cmsa and the population-based metaheuristic, before the developed hybrid technique (\(\textsc {Learn}\_\textsc {Cmsa}\)) is presented.

3.1 Cmsa for the FFMSP

Algorithm 1
figure a

Cmsa for the FFMSP

Cmsa algorithms assemble solutions from a finite set C of so-called solution components. In the case of our Cmsa algorithm for the FFMSP, each combination of a position j in the solution string (where \(j=1,\ldots ,m\)) and a letter \(a \in \Sigma \) is a solution component \(c_{j,a}\). That is, \(C:= \{c_{j,a} \mid j=1,\ldots ,m \hbox { and } a \in \Sigma \}\). Any feasible solution S is a subset of C such that for each position \(j=1,\ldots ,m\), S contains exactly one of the solution components from \(C_j:= \{c_{j,a} \mid a \in \Sigma \}\). Similarly, the sub-instance \(C'\) is always a subset of C. Note that, in the following, \(f_{\textrm{orig}}(S):= f_{\textrm{orig}}(s)\), where s is the solution string which is derived in a well-defined way from the solution components in S. Moreover, let \(f_{\textrm{orig}}(\emptyset ):= 0\).

At the start of the Cmsa algorithm, the best-so-far solution (\(S_{\textrm{bsf}}\)) and the sub-instance (\(C'\)) are initialized to the empty set; see lines 3 and 4. Moreover, the so-called age value \(age[c_{j,a}]\) of each solution component \(c_{j,a} \in C\) is initialized to zero (see line 5). Then, at each iteration, \(n_{\textrm{a}}\) valid solutions are probabilistically generated in the construct-step of Cmsa; see line 8. This is done in function ConstrucSolution() which is described below. After constructing a solution S, all those solution components from S that do not yet form part of the sub-instance \(C'\) are added to \(C'\), and their age values are set to zero; see lines 9–12. This step of Cmsa is known as the merge-step. Next, in the solve-step of Cmsa, an ILP solver (in our case, CPLEX) is applied to sub-instance \(C'\) with a computation time limit of \(t_{\textrm{solver}}\) CPU seconds. The ILP model corresponding to sub-instance \(C'\) is obtained by adding the following constraints to the ILP model from Sect. 2:

$$\begin{aligned} x_{j,a} = 0 \quad \forall c_{j,a} \in C \setminus C'. \end{aligned}$$
(9)

In other words, position-letter combinations that are not found in \(C'\)—that is, position-letter combinations whose corresponding solution component does not form part of \(C'\)—are not allowed in the ILP model. The best solution returned by CPLEX in the given computation time limit is henceforth called \(S^{\prime }_{\textrm{opt}}\). As a last step, the adapt-step of Cmsa is carried out in function Adapt(\(C'\), \(S^{\prime }_{\textrm{opt}}\), \(\text {age}_{\textrm{max}}\)). This function implements a mechanism to eliminate seemingly irrelevant solution components from the sub-instance \(C'\) at each algorithm iteration. Specifically, it involves three steps: first, incrementing the age values of all vertices in \(C'\); second, resetting the age values of all vertices in \(S^{\prime }_{\textrm{opt}}\) to zero; and third, removing the vertices with age values exceeding \(\text {age}_{\textrm{max}}\) from \(C'\). The output of Cmsa, after reaching the overall computation time limit, is solution \(S_{\textrm{bsf}}\) (respectively its string variant).

3.2 Probabilistic Solution Construction for FFMSP

The Cmsa algorithm outlined above requires a way to probabilistically generate valid solutions (in function ConstructSolution() of Algorithm 1). For this purpose, we implemented the following solution construction procedure which is pseudo-coded in Algorithm 2. Note that this is a variant of the heuristic proposed by [14]. The construction of a solution starts by determining the first position j of the solution string to which we will assign a letter. This is done in function DetermineStartingPosition() of line 4 as follows. Let \(\hbox {occ}(i, a)\) be the number of occurrences of letter \(a \in \Sigma \) at position i of all input strings, that is:

(10)

Furthermore, let be defined as the minimal number of occurrences of any letter at position i of all input strings, that is:

(11)

Then, j (the starting position for solution construction) is determined in function DetermineStartingPosition() of line 4 as follows:

(12)

In the case of ties, we take—among all tied options—the one with the lowest index. Starting from j, the randomized heuristic from Algorithm 2 then assigns a letter to all positions in sequential order. When the end of the solution string is reached, j is set to one; see lines 19 and 20. In this context, note that a solution S is assembled as a set of solution components. However, assigning a solution component \(c_{j,a}\) to S is exactly the same as assigning the letter a to position j of the solution S in string form. The choice of a letter for the current position j is done as follows. First, a random number r is drawn uniformly at random from [0, 1]. In case \(r > d_{\textrm{rate}}\)—where \(d_{\textrm{rate}}\) is an algorithm parameter called the determinism rate—a letter from \(\Sigma \) is chosen uniformly at random and assigned to j. Otherwise, it is first determined if a letter \(a \in \Sigma \) exists such that the objective function value of the extended partial solution \(S \cup \{c_{j,a}\}\)—that is, \(f_{\textrm{partial}}(S \cup \{c_{j,a}\})\)—is better than the objective function value of the partial solution S; see line 9. Hereby, function \(f_{\textrm{partial}}()\) is exactly the same as the original objective function \(f_{\textrm{orig}}()\), just that it only considers those positions of the partial solution to be evaluated that are already occupied with a letter. If such a letter \(a \in \Sigma \) exists, it is deterministically chosen and assigned to position j; see line 10. Otherwise, we deterministically choose letter \(b \in \Sigma \) such that \(\hbox {occ}(j,b) \le \hbox {occ}(j,z)\) for all \(z \in \Sigma \) and assign it to position j of S; see line 13. The solution construction procedure finishes once all positions of S have an assigned letter.

Algorithm 2
figure b

Randomized heuristic for the FFMSP

3.3 The Population-Based Metaheuristic for the FFMSP

With the aim of introducing learning into the solution construction mechanism of Cmsa, in this paper, we present a hybridization of Cmsa with a specific type of bacterial algorithm (BA) inspired by mechanisms bacteria have developed to fight antibiotics. This type of algorithm was first described in [12, 22]. In this work, we present an adaptation of this algorithm for solving the FFMSP, before describing its hybridization with Cmsa. Before we delve into the algorithm description, the interested reader should note however that we believe that nearly any population-based metaheuristic can be used for the same purpose. Moreover, being aware of the abuse that natural phenomena have suffered in recent years in the context of so-called new nature-inspired optimization algorithms, the BA algorithm will be described in metaphor-free language.

Bacteria, classified as microorganisms, are tiny life forms similar to viruses, algae, fungi, and protozoa. These unicellular organisms, existing at a microscopic scale, have the ability to thrive in diverse environments such as oceans, land, space, and even the human intestine. The interaction between humans and bacteria is intricate; at times, bacterial behavior proves beneficial or even essential to human well-being, while at other times, it can lead to harmful diseases and health issues. Since penicillin was discovered in 1929 by Alexander Fleming, antibiotics have been important in the treatment of diseases caused by bacteria and other microorganisms. However, a serious problem is that, when frequently exposed to the same type of antibiotics, bacteria develop defense mechanisms to neutralize the action of antibiotics. This key mechanism for bacteria survival is achieved through communication with other members of the population and can be understood as a collaborative mechanism based on transferring DNA among bacteria. In this way, stronger bacteria may transfer their characteristics to weaker bacteria which can acquire the capability to resist the common enemy: the antibiotic.

A relevant difference between superior organisms and bacteria is the mechanism for reproduction and recombination of genetic material. In fact, populations of superior organisms vary genetically in a vertical process, that is, offspring is created as part of a new generation through sexual interaction between parents. On the other side, genetic variation in bacteria populations may happen through a horizontal process in which genetic material is transferred among individuals, without requiring the creation of a new individual. Therefore, in the context of bacteria, it seems more natural to talk about donors and receptors instead of parents and offspring. In fact, reproduction in bacteria is achieved by means of a cell division, a replication, which implies a bacteria generation containing exactly the same genetic material. Eventually, this process may be affected by a mistake in the replication process or by the influence of an external agent, a mutagen.

Algorithm 3
figure c

Bacteria Algorithm (BA) for the FFMSP

As previously indicated, the emergence of antibiotic resistance poses a significant concern for humans. Nevertheless, for bacteria, it signifies an evolutionary advantage that enhances their ability to survive. Viewed in this light, this particular bacterial behavior serves as a valuable source of inspiration for optimization processes, as demonstrated in [12, 22]. This is especially true for the horizontal transfer of DNA material, referring to the sharing of genetic material within the local community, i.e., among neighbors belonging to the same generation. In the following, we show how these principles were applied to solving the FFMSP.

The pseudo-code of the bacterial algorithm (BA) is provided in Algorithm 3. Apart from a problem instance \((\Omega , t)\), the algorithm requires the following five parameters as input (see lines 1 and 2):

  1. 1.

    Population size (\(p_\textrm{size}\))

  2. 2.

    Rate of initial solutions generated by the probabilistic heuristic (\(pr_{\textrm{heur}}\))

  3. 3.

    Determinism rate used by the probabilistic heuristic (\(d_{\textrm{rate}}\))

  4. 4.

    Probability of mutation during the conjugation phase (\(pr_{\textrm{mut}}\))

  5. 5.

    Probability of mutation during the regeneration phase (\(pr_{\textrm{reg}}\))

At the start of the algorithm, the best-so-far solution (\(S_{\textrm{bsf}}\)) is set to the empty set. Then, the initial population of solutions of size \(p_\textrm{size}\) (where each solution corresponds to a bacterium) is generated in function GenerateInitialPopulation\((p_\textrm{size}, pr_{\textrm{heur}}, d_{\textrm{rate}})\). Hereby, with probability \(pr_{\textrm{heur}}\), a solution is generated by means of the randomized heuristic from Algorithm 2. Otherwise, the solution is generated uniformly at random. The first action of each iteration consists of the determination of the iteration-best solution \(S_{\textrm{ib}}\) from the current population (see line 6) to update the best-so-far solution if necessary (line 7). Then, the two main procedures—conjugation and regeneration—of the BA are executed. Both procedures start in the same way (see lines 8 and 9, respectively lines 12 and 13). In particular, the current population P is divided into two parts: donor solutions (\(P_\textrm{donor}\)) and receptor solutions (\(P_\textrm{receptor}\)). For this purpose, first, a separator level (\(S_{\textrm{level}}\)) is determined as follows in function DetermineSeparatorLevel(P). Two pairs of solutions—say, \((S_i, S_j)\) and \((S_k, S_l)\)—are chosen uniformly at random from the current population P. Then, the best solution from each pair is selected. Let \(S_1:= \hbox {argmax}\{f(S_i), f(S_j)\}\) and \(S_2:= \hbox {argmax}\{f(S_k), f(S_l)\}\). Furthermore, let \(S_{\min }\) be the lower quality solution between \(S_1\) and \(S_2\), that is, \(S_{\min }:= \hbox {argmin}\{f(S_1), f(S_2)\}\). Then \(S_{\textrm{level}}\) is defined as \(f(S_{\min })\). Thus, a low-cost procedure is used to choose a solution with a fitness value close to the population’s median. \(S_{\textrm{level}}\) is then used to divide the current population into a sub-population \(P_\textrm{donor}\) of donor solutions and a sub-population \(P_\textrm{receptor}\) of recipient solutions. Hereby, the solutions in \(P_\textrm{receptor}\) have an objective function value below \(S_{\textrm{level}}\), and donor solutions have an objective function value of at least \(S_{\textrm{level}}\).

In the conjugation step of BA, all receptor solutions from \(P_\textrm{receptor}\) receive a random piece of genetic material from some randomly chosen donor solution from \(P_\textrm{donor}\). This piece of genetic material corresponds to a contiguous section of arbitrary size extracted from the donor solution. As in nature, this operation may suffer a corruption in the genetic transcription (mutation). Thus, each letter of the transferred sub-string may be changed to any other letter with a probability \(pr_{\textrm{mut}}\).

In contrast, in the regeneration step of BA, after classifying the members of the current population into donors \(P_\textrm{donor}\) and receptors \(P_\textrm{receptor}\), all solutions from \(P_\textrm{receptor}\) are exchanged with clones of randomly chosen donor solutions after applying mutation with probability \(pr_{\textrm{reg}}\) to each position.

These steps are iterated until a computation time limit is reached. After termination, the best-so-far solution \(S_{\textrm{bsf}}\) is provided as output.

3.4 The \(\textsc {Learn}\_\textsc {Cmsa}\) Algorithm

Algorithm 4
figure d

\(\textsc {Learn}\_\textsc {Cmsa}\) for the FFMSP

The pseudo-code of the hybridization between Cmsa and BA is provided in Algorithm 4. In addition to the Cmsa and BA parameters, as outlined before, it takes the following two parameters, which regulate the interplay between Cmsa and BA, as input:

  1. 1.

    \(b_{\textrm{iter}}\): number of BA iterations executed in function Execute_BA_Algorithm(P, \(b_{\textrm{iter}}\), \(p_\textrm{size}\), \(pr_{\textrm{heur}}\), \(d_{\textrm{rate}}\), \(pr_{\textrm{mut}}\), \(pr_{\textrm{reg}}\)) at each Cmsa iteration; see line 10.

  2. 2.

    \(r_{\textrm{inject}}\): the rate of injection of the solution returned by the ILP solver (\(S_{\textrm{bsf}}\)) into the current BA population in function InjectSolverSolution(P, \(S_{\textrm{bsf}}\), \(r_{\textrm{inject}}\)); see line 21.

The differences to the pure Cmsa algorithm from Sect. 3.1 are as follows. First, in addition to the initialization of Cmsa in lines 5–7, the initial population P of the BA algorithm is generated in line 8 in the same way as explained before in the context of the BA algorithm in Sect. 3.3. Next, at the beginning of each Cmsa iteration function Execute_BA_Algorithm(P, \(b_{\textrm{iter}}\), \(p_\textrm{size}\), \(pr_{\textrm{heur}}\), \(d_{\textrm{rate}}\), \(pr_{\textrm{mut}}\), \(pr_{\textrm{reg}}\)) executes \(b_{\textrm{iter}}\) iterations of the BA algorithm exactly in the same way as explained in Sect. 3.3. This function returns the current BA population P as output. Then, function \(\textsf {Extract\_From}(P, n_{\textrm{a}})\) (see line 11) extracts exactly \(n_{\textrm{a}}\) solutions from the current BA population P and stores them in set T. In particular, T contains the best solution from P, in addition to (\(n_{\textrm{a}}- 1\)) randomly selected donor solutions from P. Note that, for this purpose, the separator level (\(S_{\textrm{level}}\)) is determined and the population P is divided into donors and receptors, as outlined in Sect. 3.3. The solutions from T are then added to the current sub-instance \(C'\) of Cmsa, replacing the randomized solution construction procedure of standard Cmsa. Finally, at the end of each \(\textsc {Learn}\_\textsc {Cmsa}\) iteration, the solution \(S_{\textrm{bsf}}\) returned by the solver after being applied to the current sub-instance \(C'\) is used to replace \(\lfloor r_{\textrm{inject}}\cdot |P_\textrm{receptor}| \rfloor \) receptor solutions from P. This is done in function InjectSolverSolution(P, \(S_{\textrm{bsf}}\), \(r_{\textrm{inject}}\)); see line 21.

Note that in this way of hybridizing Cmsa and BA, both memory mechanisms—the sub-instance \(C'\) of Cmsa and the population P of BA—influence each other. In particular, a set of donor solutions from P is added to \(C'\) at each iteration, while Cmsa influences BA by injecting \(S_{\textrm{bsf}}\) into the BA population P.

4 Experimental Evaluation

Cmsa, BA, and \(\textsc {Learn}\_\textsc {Cmsa}\) were implemented in C++ using GCC 11.3.0 for compiling the software. Experiments were all performed on a cluster (Luthier) of the Engineering Faculty of the University of Concepción, Chile, in single-threaded mode. Luthier is composed of 30 computing nodes (servers). All nodes have an Intel®CPU Xeon®E3-1270 v6 at 3.8 GHz with 64 GB RAM. Moreover, all ILP models were solver with IBM ILOG CPLEX version 20.1.

Fig. 1
figure 1

Study of the evolution of the optimality gaps produced by CPLEX for a range of different threshold values (x-axis)

In this section, we first provide an overview of the benchmark datasets used. Next, we detail the parameter tuning procedure employed to optimize the configuration of BA, Cmsa and \(\textsc {Learn}\_\textsc {Cmsa}\). Finally, we present the numerical results.

4.1 Benchmark Sets

Various benchmark instance sets for the FFMSP have been introduced by different authors, and for this study, we adopt the following set. [19] employed a collection of instances comprising 100 problem instances with randomly generated input strings over alphabet \(\Sigma = \{\texttt {A}, \texttt {C}, \texttt {T}, \texttt {G}\}\) for each combination of \(n \in \{100, 200, 300, 400\}\) and \(m \in \{200, 600, 800\}\). This set, totaling 1200 problem instances, is referred to as Ferone from hereon. It is worth noting that all instances were previously solved in other studies with thresholds \(t \in \{0.75m, 0.8m, 0.85m\}\). A subset of these instances—specifically those with \(n \in \{100, 200\}\)—was already utilized in earlier publications [10, 18].

A set of problem instances with specifications akin to those introduced by [19] was presented in [20]. However, in contrast to the 100 random instances per combination of n and m in the former, the set by [20] comprises only five random instances per combination. Furthermore, this set is limited to \(n \in \{100, 200\}\). The collection of random instances from [20] is hereinafter referred to as Gallardo.

It is important to highlight that in this study, we address all the outlined problem instances for \(t \in \{0.8m, 0.85m\}\). The threshold \(t=0.75m\) was excluded from consideration due to our observation (similar to findings in prior works) that problems resulting from this threshold are easily solved optimally. Also, we perform a comparison only in the range of \(n \in \{100, 200\}\) to maintain a complete comparison between all algorithms from the state-of-the-art.

4.1.1 New Datasets

We further plan to investigate the performance of our algorithms on problem instances of different alphabet sizes, specifically focusing on sizes 12 and 20. Suitable threshold values were determined based on an analysis of the optimality gaps produced by CPLEX after a computation time of 600 s per instance. This served as an indicator of the problem’s computational complexity and ensured that the chosen thresholds were high enough to produce challenging problem instances.

Threshold determination was performed for \(\Sigma =\{12,20\}\), using thresholds t spanning from 0.80 to 1.0 in increments of 0.01. Test instances were defined by combinations of \(n=\{100,200,300,400\}\) and \(m=\{300, 600, 800\}\). For each parameter combination \((\Sigma , n, m)\), we randomly selected one instance and solved it using CPLEX for every specified threshold. The outcomes of these experiments (in terms of the optimality gaps) can be found in Fig. 1. Note that, in the legends, the first integer after the keyword ’cplex’ indicates the value of n and the second one the value of m.

Based on the observed behavior, we have established threshold values for the 12-character alphabet at \(t=(0.95m, 0.97m, 1.0m)\). For the 20-character alphabet, the thresholds are set at \(t=(0.98m, 0.99m, 1.0m)\). We chose a higher starting value for the 20-character set, because the figures clearly show that the larger alphabet presents a simpler problem for CPLEX. Finally, the new dataset for \(\Sigma =\{12,20\}\) is composed of 100 problem instances for each combination of \(|\Sigma | \in \{12, 20\}\), \(n \in \{100, 200, 300, 400\}\) and \(m \in \{300, 600, 800\}\), which makes a total of 2400 problem instances.

4.2 Algorithm Tuning

The adjustment of the parameter values of BA, Cmsa, and \(\textsc {Learn}\_\textsc {Cmsa}\) was performed with the tuning tool irace [23]. During preliminary experiments, we noticed changes concerning the parameter value requirements of the tested algorithms for different threshold values. Another important factor of a problem instance is n, the number of input strings. Therefore, we decided to tune parameters separately for small problem instances (\(n=\{100,200\}\)) and large problem instances (\(n=\{300,400\}\)). In this way, we perform 16 different tuning processes for each algorithm, one for each combination of \(|\Sigma |\), n (small vs. large), and t. As tuning instances, we use the first (out of 100 instances) for each case from the Ferone dataset (for \(|\Sigma |=4)\), respectively from our new datasets (for \(|\Sigma | \in \{12, 20\}\)).

The algorithm parameters and the corresponding domains are detailed in Tables 1a, b, and 2. Additionally, irace was allocated a budget of 5000 runs, each run with a time limit of 600 s. The parameter configurations identified by irace are presented in the respective tables.

Table 1 Parameters of Cmsa (a) and BA (b) together with their domains considered for the tuning process and chosen values for all datasets
Table 2 Algorithm parameters for \(\textsc {Learn}\_\textsc {Cmsa}\) and their domains considered for the tuning process with irace for all datasets

Analysis of the \(\textsc {Learn}\_\textsc {Cmsa}\) tuning results. In Table 2, we can observe the tuning results obtained with irace for \(\textsc {Learn}\_\textsc {Cmsa}\). Considering the case with \(|\Sigma |=4\), there is a significant difference in behavior between \(t=0.8m\) and \(t=0.85m\). At \(t=0.8m\), a collaborative interaction between BA and CPLEX can be noticed. Such an interaction is characterized by the allocation of shorter computation times to the ILP solver (\(t_{\textrm{solver}}\)), a lower maximum age for solution components (\(\text {age}_{\textrm{max}}\)), and a fewer number of extracted solutions from BA for feeding the sub-instance of \(\textsc {Learn}\_\textsc {Cmsa}\) (\(n_{\textrm{a}}\)). This limits the introduction of new solution components per iteration, thereby maintaining a more controlled sub-instance size. Furthermore, BA seems to play an important role as it works with higher quality initial solutions (see the high value for heuristic initialization \(pr_{\textrm{heur}}\)) and with high levels of determinism (\(\textrm{d}\)). This is not the case with \(t=0.85m\), where the process seems to be more of a pipeline between BA and CPLEX, evident from the high computation time allocated to CPLEX and a larger number of solutions (\(n_{\textrm{a}}\)) extracted for inclusion in the sub-instance. This results in larger and more complex sub-instances which demand longer execution times by the solver.

Concerning \(|\Sigma |=12\), the analysis is less obvious. In some cases where the CPLEX time is limited ((\(t=0.95m\), small), (\(t=0.97m\), small), and (\(t=1.0m\), large)), the allocated number of solutions extracted from BA (\(n_{\textrm{a}}\)) does not seem consistent, resulting in potentially large instances (\(t=0.95m\), small). Also, these cases are not accompanied with a high determinism for BA (see the values for \(\textrm{d}\) and \(pr_{\textrm{heur}}\)), unlike what was observed for \(Sigma=4\). This could be attributed to the fact that, with a growing alphabet size, the problem seems to become easier, allowing irace more flexibility for the parameter value selection. In the pipeline scenario, the parameters appear clearer, especially when solver times are high ((\(t=0.97m\), small), (\(t=0.97m\), large), (\(t=1.0m\), large)), resorting to extracting the maximum number of solutions from BA (\(n_{\textrm{a}}\)) at each iteration.

Table 3 Numerical results concerning instances from the Ferone set

Finally, concerning \(|\Sigma |=20\), it also appears that a pipeline approach is preferred with extended solver times (\(t_{\textrm{solver}}\)) and the maximum number of solutions extracted from BA (\(n_{\textrm{a}}\)), particularly in small cases, and for (\(t=0.98m\), large) and (\(t=0.99m\), large). An exception is observed for (\(t=1.0m\), large), which features a more limited computation time for the solver, even though it deals with a potentially larger sub-instance (maximum value for \(n_{\textrm{a}}\)).

Table 4 Numerical results for the instances from the Gallardo set

Upon examining the tuning results concerning the chosen objective functions, it is evident that there is a tendency to utilize the enhanced functions \(o_f=1\) (\(f_{\textrm{blu}}\)) and \(o_f=2\) (\(f_{\textrm{sim}}\)) in over \(80\%\) of the configurations. This aligns with the previous assertion that these functions enhance the performance of the search process compared to the original objective function (\(f_{\textrm{orig}}\)). Hereby, note that \(f_{\textrm{blu}}\) is the most frequently used objective function (\(56\%\)). Another interesting observation is that in cases where the CPLEX time is shorter, leading to greater collaboration between BA and CPLEX within the framework, the percentage of solution integration from CPLEX (\(r_{\textrm{inject}}\)) is lower over BA than in cases operating in a pipeline manner. This results in a reduced influence of CPLEX solutions on the bacteria population.

Analysis of the Cmsa tuning results. In Table 1a, the parameter values selected by irace for Cmsa are presented. The prevailing configuration across these instances is marked by important computation time allocations to CPLEX, coupled with a considerable number of heuristic solutions generated at each iteration. These solutions exhibit low levels of determinism, causing a notable variability in the components generated. Consequently, this leads to the creation of large sub-instances that demand extensive computational time in a pipeline-oriented setup of the approach. It is also noteworthy that the enhanced objective functions are more frequently utilized than their original counterparts, albeit at a marginally lower rate than in \(\textsc {Learn}\_\textsc {Cmsa}\) (by a difference of \(0.75\%\)).

Analysis of the BA tuning results. Table 1b presents the parameter values selected by irace for BA. Notably, each configuration tends to exhibit high—sometimes even reaching the maximum—population sizes (\(p_{\textrm{size}}\)). The recommended configurations predominantly resort to heuristic initialization, showcasing medium-to-high determinism levels (\(d_{\textrm{rate}}\)). Within its operational framework, the BA is characterized by employing elevated mutation rates for its recombination operator (\(pr_{\textrm{mut}}\)), while maintaining very low mutation rates during regeneration phases (\(pr_{\textrm{reg}}\)). A salient observation is BA’s consistent preference for enhanced objective functions, eschewing the original altogether. This preference is rationalized by the potential inadequacy of the original function to differentiate between solutions. It might happen, for example, that in a population of random individuals, all have an original objective function value of zero. Again, function \(f_{\textrm{blu}}\) (\(o_f=1\)) emerges as the primary choice, being utilized in \(93\%\) of cases.

4.3 Results

The \(\textsc {Learn}\_\textsc {Cmsa}\), Cmsa and BA algorithms were applied precisely once to each of the 1,200 problem instances from the Ferone dataset (Table 3) and 10 times to each of the 60 problem instances from the Gallardo dataset (Table 4). These choices are due to the way in which existing algorithms from the literature were evaluated on these datasets. A computational time constraint of 600 CPU seconds was imposed for all executions across both datasets. The tables present the results in terms of average objective function values. For the Ferone dataset, the displayed value represents the average objective value across problem instances sharing the same (nm) combination. In contrast, for the Gallardo dataset, the value signifies the average result over the 10 repeated runs for all available instances.

In the case of the Ferone dataset, the results of \(\textsc {Learn}\_\textsc {Cmsa}\), Cmsa and BA are compared with CPLEX (standalone and with the same computation limit as \(\textsc {Learn}\_\textsc {Cmsa}\), Cmsa and BA), a hybrid ACO algorithm called HyAco from  [10], the best performing of six variants of GRASP from [19], and the current state-of-the-art algorithm called \(\textsc {Aco}_{\textrm{neg}}^+\) from [24]. GRASP variants use a time limit of 30 CPU seconds. This might seem unfair as \(\textsc {Learn}\_\textsc {Cmsa}\), Cmsa and BA were allowed 600 CPU seconds per run. However, the low time limit of 30 CPU seconds was chosen by the authors of [19] because GRASP did not take advantage from longer running times. It is important to note that \(\textsc {Learn}\_\textsc {Cmsa}\) outperforms the competing algorithms in all problem configurations within the Ferone dataset. Not only does \(\textsc {Learn}\_\textsc {Cmsa}\) demonstrate superior performance compared to other heuristic approaches, but it also surpasses the standalone CPLEX component, the traditional implementation of Cmsa, and the BA. In other words, \(\textsc {Learn}\_\textsc {Cmsa}\) clearly seems to profit from the synergy between Cmsa and BA, which are the algorithmic components of \(\textsc {Learn}\_\textsc {Cmsa}\).

In the case of the Gallardo dataset, again a computation time limit of 600 s per run were employed. Table 4 offers a comparison of our algorithms (\(\textsc {Learn}\_\textsc {Cmsa}\), Cmsa and BA) with the standalone-application of the Cplex component, a Memetic Algorithm (Ma) from [20], Ferone’s GRASP proposal [18] (\(\textsc {Grasp}_{\textrm{fer}}\)), Mousavi’s GRASP proposal [17] (\(\textsc {Grasp}_{\textrm{mou}}\)), and the \(\textsc {Aco}_{\textrm{neg}}^+\) approach from [24] ( \(\textsc {Aco}_{\textrm{neg}}^+\)).

Table 5 Numerical results for our new dataset of instances with \(|\Sigma | \in \{12, 20\}\)

Based on the experimental results, the following observations can be made:

  • In main terms, the relative comparison between the algorithms allows very similar conclusions as in the case of the Ferone dataset. \(\textsc {Learn}\_\textsc {Cmsa}\) is significantly better than the MA for threshold value \(t=0.80m\).

  • When \(t=0.85m\), \(\textsc {Learn}\_\textsc {Cmsa}\) results competitive with respect to the MA, being better than this last one in large instances of the problem.

Fig. 2
figure 2

Critical difference (CD) plots. The four plots show the results for different (sub-)sets of problem instances

Note that the average performance of the \(\textsc {Learn}\_\textsc {Cmsa}\) algorithm is better than the one of all other algorithms for the Gallardo dataset.

Finally, Table 5 presents the numerical results of our methods (\(\textsc {Learn}\_\textsc {Cmsa}\), Cmsa and BA), CPLEX and \(\textsc {Aco}_{\textrm{neg}}^+\) for our new dataset featuring instances on alphabets of sizes 12 and 20. The table clearly shows that \(\textsc {Learn}\_\textsc {Cmsa}\) outperforms not only \(\textsc {Aco}_{\textrm{neg}}^+\)—which is, as mentioned before, one of the current state-of-the-art approaches—but also the individual algorithm components (Cmsa and BA) of \(\textsc {Learn}\_\textsc {Cmsa}\). Nevertheless, \(\textsc {Aco}_{\textrm{neg}}^+\) exhibits its competitiveness in the context of specific scenarios including, for example, (\(n=200, t=0.95m, \Sigma =12\)) and (\(n=300, t=1.0m, \Sigma =20\)). Nonetheless, on average, \(\textsc {Learn}\_\textsc {Cmsa}\) is superior to all considered algorithms for both alphabet sizes. Cmsa beats CPLEX by a narrow margin and falls short when juxtaposed with \(\textsc {Aco}_{\textrm{neg}}^+\) and \(\textsc {Learn}\_\textsc {Cmsa}\). Importantly, BA’s performance is suboptimal when operating in a standalone manner. In summary, it can again be observed that the synergies between Cmsa and BA are exploited very well in \(\textsc {Learn}\_\textsc {Cmsa}\).

For a statistical validation of the results, we aimed to test for performance differences among the algorithms across the encompassed datasets. For this validation, we included the results of \(\textsc {Aco}_{\textrm{neg}}^+\) (the strongest competitor of \(\textsc {Learn}\_\textsc {Cmsa}\)), standalone CPLEX, \(\textsc {Learn}\_\textsc {Cmsa}\), and the individual components of \(\textsc {Learn}\_\textsc {Cmsa}\) (Cmsa, BA). Initially, a simultaneous comparative analysis was conducted via the Friedman test. Following the rejection of the null hypothesis postulating equivalent performance across all algorithms, pair-wise assessments were undertaken through the Nemenyi post hoc test [25]. The findings are illustrated in Fig. 2, via critical difference (CD) plots. These plots spatially position each method based on its mean ranking across the considered instance (sub)set. The CD for a significance level of 0.05 is computed, with algorithmic performances deemed statistically indistinguishable if differing less than the CD, as depicted by the horizontal bars connecting the algorithms in the graphical representation. All tests and plots were created using the R scmamp package by Calvo and Santafé (2016), accessible at https://github.com/b0rxa/scmamp.

The CD plots substantiate \(\textsc {Learn}\_\textsc {Cmsa}\)’s dominance over competing algorithms across all datasets, with \(\textsc {Aco}_{\textrm{neg}}^+\) trailing as a significant secondary. Cmsa and CPLEX are closely matched, yet Cmsa displays superiority in a comprehensive assessment of instances (as depicted in Fig. 2a). It is important to note that the differences between Cmsa and CPLEX is greatest in the context of small-size alphabets (\(|\Sigma |=4\); see Fig. 2b), reduces considerably with \(|\Sigma |=12\) (Fig. 2c), and disappears for \(|\Sigma |=20\) (Fig. 2d). Note that this suggests that, with an increasing alphabet size, the FFMSP seems to become easier to solve. The graphical analysis also confirms the non-competitiveness of the standalone BA variant.

5 Conclusions and Future Work

In this paper, we have proposed a hybrid variant of the Cmsa algorithm, based on a combination with a population-based metaheuristic (bacterial algorithm, BA) whose population provides, at each iteration of Cmsa, the solutions that are merged into the current sub-instance of Cmsa. In turn, the solution provided by the ILP solver when solving the current sub-instance of Cmsa is fed back into the population of the BA. Therefore, it can be said that the proposed \(\textsc {Learn}\_\textsc {Cmsa}\) algorithm is based on two memory mechanisms: (1) the sub-instance of Cmsa and (2) the population of BA. Both algorithms—Cmsa and BA—employ a mutual interaction to improve each other’s search process. The proposed algorithm is applied to the so-called far from most string problem, an NP-hard problem from the field of bioinformatics. The obtained results show, first, that \(\textsc {Learn}\_\textsc {Cmsa}\) performs significantly better than its two algorithm components (Cmsa and BA). Second, our results show that \(\textsc {Learn}\_\textsc {Cmsa}\) is a new state-of-the-art approach for solving the far from most strings problem.

However, our new \(\textsc {Learn}\_\textsc {Cmsa}\) approach certainly also has limitations. The first limitation concerns an elevated number of algorithm parameters, which results from this new algorithm being designed as a combination of two standalone approaches, each one coming with its own set of parameters. A second possible limitation is that the sub-ordinate optimization technique which is used within \(\textsc {Learn}\_\textsc {Cmsa}\)—BA, in the case of this paper—must work harmonically with the outer Cmsa approach. When an optimization problem different to the far from most string problem is considered, BA might not work well for this purpose, and finding a suitable technique might not be straightforward in all cases.

In future work, we plan to apply the same mechanism to other NP-hard combinatorial optimization problems to show the universal use of the proposed technique. Moreover, we plan to replace the BA mechanism by other population-based metaheuristics to show the generality of our algorithmic proposal.