Keywords

1 Introduction

In the paper, we deal with automata, which are important for numerous practical applications, such as compiler design, bioinformatics [20], and grammatical inference [19]. These automata are finite, nondeterministic and minimal in terms of the number of states. They are given by \(A = (Q, \varSigma , \delta , q_0\), \(Q_F)\), where Q is the finite set of states of the automaton, \(\varSigma \) is the input alphabet, \(\delta : Q \times \varSigma \rightarrow 2^Q\) is the transition function, \(q_0 \in Q\) is the initial state and \(Q_F \subseteq Q\) is the set of final states [11]. The automata are also consistent with the given sample \(S = (S_+, S_-)\), in which \(S_+\) denotes the examples, and \(S_-\) denotes the counterexamples. The automaton A is said to be consistent with the sample S if for each word \(w \in S_+\) there exists a sequence of transitions between state \(q_0\) and at least one state \(q \in Q_F\) and for each word \(w \in S_-\) the condition does not hold.

In order to find a minimal nondeterministic finite automaton (NFA) consistent with the given sample S, we transform the induction problem into a constraint satisfaction problem (CSP) as discussed in [12, 15, 18]. Using the CSP formulation we ask whether for the given sample S and a given positive integer k there exists a k-state automaton consistent with S. By taking \(k = 1, 2, \ldots \), we ensure that we find the minimal NFA for the given sample.

The main contributions of the paper are as follows. Firstly, we discuss two variable ordering methods devised specifically to solve the given CSP formulation of the problem. And secondly, we perform comprehensive experiments on input samples built of amino acid sequences from WaltzDB database [2]. To assess how the sizes of sets \(S_+\) and \(S_-\) affect the performance of the induction algorithms, we consider samples for which \(|S_+| \ll |S_-|\), \(|S_+| = |S_-|\), and \(|S_+| \gg |S_-|\) hold.

The paper is organized into 6 sections. In Sect. 2 we present a brief literature review. In Sect. 3 we recall the formulation of minimal NFA induction problem as a CSP. In Sect. 4 we present the parallel induction algorithm and the two proposed variable ordering methods. We present the results of conducted experiments in Sect. 5. Finally, Sect. 6 contains the conclusions.

2 State of the Art

The problem of finding a minimal NFA consistent with the given sample is hard. In particular, it cannot be done from polynomial time and data [6]. It was also shown that inducing a minimal NFA given a deterministic finite automaton (DFA) is PSPACE-complete [16].

The existing NFA induction methods focus around state merging algorithms, designed initially for DFAs. The merging is performed over a prefix tree acceptor constructed for the sample S. The algorithms remove redundant states keeping the consistency with the sample. Among the existing algorithms there are DeLeTe2 [8], NRPNI [1], and the state merging methods proposed in [5, 9]. Some algorithms transform the induction problem to CSPs [12, 14, 15, 18].

The variable ordering methods are either static – defined before the algorithm begins, or dynamic – changing as the algorithm proceeds. Examples of static methods include deg [7] and ddeg heuristics based on the initial and current number of constraints involving the variable (degree). Dynamic methods include domain-size based method dom [10], as well as dom-deg [3], dom-ddeg [17] or dom-futdeg [4] methods, which follow the dom heuristic but in case of ties use the initial, current or future degree of the variables, respectively.

3 Problem Formulation

Since we assume the fixed number of states k, to find the NFA we only need to find the transition function \(\delta \) and the set of final states \(Q_F\). As the result, the following statements about the given CSP always hold [12, 15, 18]:

  1. 1.

    Let \(l = |\varSigma |\) be the number of symbols in the alphabet. Then we need exactly \(n = k^2l + k\) binary variables x and y.

  2. 2.

    For each pair of states \(p, q \in Q\) and each symbol \(a \in \varSigma \), if \(q \in \delta (p, a)\), then \(x_i = 1\), for some \(i \in [1, k^2l]\). Likewise, \(x_i = 0\) means that \(q \notin \delta (p, a)\) holds.

  3. 3.

    For each state \(p \in Q_F\), \(y_i = 1\) holds for some \(i \in [1, k]\). Otherwise, \(y_i = 0\).

  4. 4.

    Let ij, \(i \ne j\) be the indices of x variables. Let \(a, b \in \varSigma \), and \(p, q, r \in Q\), \(p \ne r, q \ne r\). Then the x variables are ordered as follows:

    1. (a)

      If \(a < b\) in lexicographical order, then \(i < j\) holds.

    2. (b)

      Given the transitions \(p \overset{a}{\rightarrow } r\) and \(q \overset{a}{\rightarrow } r\), if p precedes q in Q, then \(i < j\).

    3. (c)

      Given the transitions \(r \overset{a}{\rightarrow } p\) and \(r \overset{a}{\rightarrow } q\), if p precedes q in Q, then \(i < j\).

  5. 5.

    If p precedes q in Q, then \(i < j\), where ij are the y variables indices.

Example 1

Let the set of states be \(Q = \{q_0, q_1\}\), and so \(k = 2\), and let \(\varSigma = \{a, b\}\). Then variables \(y_1\) and \(y_2\) correspond to the possible final states \(q_0\) and \(q_1\). Since \(a < b\) lexicographically, and \(q_0 < q_1\) in set Q, then the following relations hold: variable \(x_1\) corresponds to the transition \(q_0 \overset{a}{\rightarrow } q_0\), variable \(x_2\) corresponds to the transition \(q_0 \overset{a}{\rightarrow } q_1\), \(\ldots \), variable \(x_8\) corresponds to the transition \(q_1 \overset{b}{\rightarrow } q_1\).

Given the above considerations, a path is a product of x variables followed by a y variable. Using Example 1, the product \(x_1x_6x_8y_1\) corresponds to the transitions \(q_0 \overset{a}{\rightarrow } q_0\), \(q_0 \overset{b}{\rightarrow } q_1\), and \(q_1 \overset{b}{\rightarrow } q_1\), i.e., a word abb ending in state \(q_1\).

Let us note that for the given number of states k and for each word w, there are exactly \(k^{|w|}\) paths on which this word can be spelled out. There are also \(k^{|w|}\) products of length \(|w| + 1\) corresponding to these paths. In the sequel, we use \(P^w_{mi}\) to denote the m-th product of x variables corresponding to word w, and ending in variable \(y_i\). Hence an NFA is consistent with the sample S when:

  1. 1.

    for all words \(w \in S_+ \setminus \{\lambda \}\) it holds that (examples acceptance):

    $$\begin{aligned} \sum _{i = 1..k} \sum _{m = 1..k^{|w| - 1}} P^w_{mi}y_i = 1, \end{aligned}$$
    (1)
  2. 2.

    for all words \(w \in S_- \setminus \{\lambda \}\) it holds that (counterexamples rejection):

    $$\begin{aligned} \sum _{i = 1..k} \sum _{m = 1..k^{|w| - 1}} P^w_{mi}y_i = 0. \end{aligned}$$
    (2)

Note that the summation is performed according to the rules of Boolean algebra. Note that (1) is satisfied iff there exists a path for word w that ends in a final state. Note also that (2) is satisfied iff either there is no path for word w, or the path ends in a non-final state. Finally, if \(\lambda \in (S_+ \cup S_-)\) holds, then we set:

$$\begin{aligned} y_1 = {\left\{ \begin{array}{ll} 1, \text{ for } \lambda \in S_+,\\ 0, \text{ for } \lambda \in S_-. \end{array}\right. } \end{aligned}$$
(3)

The above condition allows to reduce the solution space to be searched [14].

4 Algorithms and Methods

Hereinafter, let C be the set of constraints, \(c_+ \in C\) be a constraint given by (1), and \(c_- \in C\) be a constraint given by (2). Let |c| be the number of active (non-zero) products P in (1) or (2), and d(cx) be the number of active products in c containing variable x. The variable ordering methods pertain to the x variables. The y variables are set first to produce independent instances of the CSP, solved in parallel [12, 14]. It also applies to the deg method used in the experiments.

Fig. 1.
figure 1

The induction algorithm

Fig. 2.
figure 2

The Evaluate procedure for min-max-ex variable ordering method

The induction procedure InduceNFA, executed for each independent instance of the CSP, is given in Fig. 1. The procedure consists of the evaluation phase (represented by the Evaluate procedure in line 2), and the ordering phase (represented by the Reorder procedure in line 5). They both differ depending on the selected variable ordering method.

The min-max-ex Method. The Evaluate procedure (Fig. 2) implements a ‘fail-fast’ behavior, by checking first the constraints that may result in a contradiction (lines 6 and 8). It aims at explicitly satisfying constraints \(c_+ \in C\) (line 9), setting the other \(x_i\) variables to zeros if all \(c_+\) are satisfied (line 10). It updates \(|c_+|\) and \(d(c_+, x_j)\) using the techniques described in [12] (lines 3–5).

The Reorder procedure (Fig. 3) selects a constraint \(c_+' \in C\), for which \(|c_+'|\) is minimal (line 2) and the index of the most frequent variable \(x_i \in c'_+\) (line 3). It sets v to zero (line 4). The ordering makes the evaluation shorter, by choosing constraint with fewer products and enforcing the ‘fail-fast’ behavior.

Fig. 3.
figure 3

The Reorder procedure for min-max-ex variable ordering method

The min-max-cex Method. The second variable ordering method, called min-max-cex, differs from min-max-ex in that, that in the Evaluate procedure, for \(v = 1\) it only checks for a contradiction in any \(c_- \in C\). For \(v = 0\) it checks for a contradiction in any \(c_+ \in C\), it updates the \(|c_-|\) and \(d(c_-, x_j)\) values as in lines 3–5, and checks for satisfied constraints \(c_-\). If all \(c_-\) are satisfied it sets all unset \(x_j\) variables to 1. In the Reorder procedure, in line 2 we look for a constraint \(c_- \in C\), such that \(|c_-|\) is minimal, and we propose that \(v \leftarrow 1\) in line 4.

The deg Method. In the Evaluate procedure the deg method first checks for any contradictions, either in \(c_+\) or in \(c_-\) (for \(v = 0\) or \(v = 1\), respectively), and then it checks for satisfied constraints. We require that all constraints are explicitly satisfied. Since the method is static, the order of variables is established before the induction begins. So the Reorder procedure merely returns the next variable according to the already known order. It sets \(v \leftarrow 0\) in line 4 of Fig. 3.

Example 2

Due to space limitations, we provide an example trace of the induction algorithm at https://github.com/tjastrzab/iccs.

5 Experimental Evaluation

We generated 450 samples based on the sets of amino acid sequences [2]. The number of sequences in set \(S_+\) (resp. \(S_-\)) was 5 (resp. 45), 25 (resp. 25), and 45 (resp. 5), for the samples, for which \(|S_+| \ll |S_-|\), \(|S_+| = |S_-|\), and \(|S_+| \gg |S_-|\) hold. The algorithms were implemented in Java and run on an Intel Xeon E5-2640 2.60 GHz processor (16 cores, 8 GB RAM). The time limit was 3 hours. We considered two scenarios. In Exp. 1, we sought the first consistent k-state automaton. In Exp. 2, the configuration of the final states was also given to see how the algorithms perform when searching for a specific NFA. All minimal automata had 2 states.

The distribution of the run times, given by the minimum, maximum and average values, is shown in Table 1. Note that in all cases, the average times for the best- and worst-performing methods, differ by an order of magnitude or more. Moreover, the balanced samples are the hardest to solve, since the average times are the longest for both Exp. 1 and Exp. 2. The min-max-ex method prevails when \(|S_+| \ll |S_-|\) holds, while min-max-cex is the fastest for the cases in which \(|S_+| \gg |S_-|\) is true. This is expected since the time spent in the Reorder procedure is then much shorter (as the ordering is based on \(S_+\) or \(S_-\) only). The min-max-cex method is also the fastest for the hardest sample type. This proves that the new methods are competitive with respect to the deg method.

Table 1. Minimum, maximum and average run times (in seconds)

To explain the differences in the run times, we counted the number of calls of the InduceNFA procedure, i.e., the number of visited solution tree nodes. We observed that the number of calls differed by two to four orders of magnitude.

Finally, to observe the relation between the run time performance and the automaton size, we counted the number of transitions in each NFA. We observed that min-max-ex and deg methods produce NFAs of similar size, while the min-max-cex method generates on average 17–30 transitions more. It is so because min-max-ex and deg methods produce mostly the transitions necessary to accept the examples. The min-max-cex method satisfies all counterexamples and makes the remaining variables equal to one, which generates more transitions.

6 Conclusions

In the paper we have investigated the problem of nondeterministic finite automata induction. The experiments have shown that the proposed variable ordering methods perform better than the state-of-the-art one, especially in case of imbalanced sizes of sets \(S_+\) and \(S_-\). The result is important since it is not uncommon that, for instance, we know just a few factors causing a disease (set \(S_+\)) and much more factors that are not responsible for this particular disease (set \(S_-\)). Hence, being able to classify these factors efficiently and correctly using the induced NFAs, can be of help in some bioinformatics tasks.