1 Introduction

A powerful and well-known tool to represent patterns and objects is offered by the graph-based representation. Graphs are able to depict the components of a pattern by the mean of vertices, and the relational properties between them using edges. Moreover, through the attributes (labels) that are assigned to vertices and edges, a graph can carry more information and characteristics about the pattern. Finding the (dis)similarities between two graphs requires the computation and the evaluation of the best matching between them. Since exact isomorphism rarely occurs in pattern analysis applications, the matching process must be error-tolerant, i.e., it must tolerates differences in the topology and/or its labeling. Finally, graphs can be used in tasks such as classification and clustering, by studying and comparing them. Error-tolerant graph matching problem is known to be difficult, mainly due to its computational complexity, especially for large graphs. Graphs have become popular in the context of Structure-Activity Relationships (SAR), since they provide a natural representation of the atom-bond structure of molecules. Each vertex of the graph then represents an atom, while an edge represents a molecular bond [9]. SAR is an important application field for graph matching.

Graph Edit Distance (GED) problem is an error-tolerant graph matching problem. It provides a dissimilarity measure between two graphs, by computing the cost of editing one graph to transform it into another. The set of edit operations are substitution, insertion and deletion, and can be applied on both vertices and edges. Solving the GED problem consists of finding the set of edit operations that minimizes the total cost. GED, by concept, is known to be flexible because it has been shown that changing the edit cost properties can result in solving other matching problems like maximum common subgraph, graph and subgraph isomorphism [3]. Also, GED problem has many applications such as Image Analysis, Handwritten Document Analysis, Malware Detection, Bio- and Chemoinformatics and many others [12]. GED is a NP-Hard problem [13], so the optimal solution cannot be obtained in polynomial time. Nevertheless, many heuristics have been proposed to compute good solutions in reasonable amount of time. The works in [10, 11] presented fast algorithms that mainly solve the linear sum assignment problem for vertices, and then deduce the edges assignment. The vertices cost matrix includes information about the edges, through estimating the edges assignment cost implied by assigning two vertices. However, one drawback in this approach is that it takes into account only local structures, rather than the global one. Other algorithms based on beam search are presented in [4, 8]. The first builds the search tree for all vertices assignment, then only the beam-size nodes are processed. While the second solves the vertices assignment problem, and then using beam search, it tries to improve the initial assignment by switching vertices, and re-computing the total cost. On the other hand, GED problem is addressed by the means of mathematical programming tools. Two types of mathematical formulations can be found in the literature: linear models as in [6, 7] and quadratic as in [2].

This work proposes the use of Local Branching (LocBra) heuristic, which is well-known in Operational Research domain. It is presented originally in [5] as a general metaheuristic for MILP models. It makes use of a MILP solver in order to explore the solution space, through a defined branching scheme. As well, it involves techniques, such as intensification and diversification during the exploration. To benefit from the efficiency of these techniques, a LocBra version is designed and adapted to handle the GED problem. Also, the original diversification is modified to consider information about the problem that will improve the exploration and return better solutions. Since LocBra depends on a MILP model, \( MILP^{JH} \) is chosen in the implementation of LocBra. Because to the best of our knowledge, it is one of the most efficient for GED problem [7]. Henceforth, the heuristic is referred to as \( LocBra\_GED \). Subsequently, it is evaluated and compared with existing competitive heuristic algorithms. The remainder is organized as follows: Sect. 2 presents the definition of GED problem, followed with a review of \(MILP^{JH}\) model. Then, Sect. 3 details the proposed heuristic. And Sect. 4 shows the results of the computational experiments. Finally, Sect. 5 highlights some concluding remarks.

2 GED Definition and \(MILP^{JH}\) Model

An attributed graph is a 4-tuple \( G = (V,E,\mu ,\xi ) \) where, V is the set of vertices, E is the set of edges, such that \( E \subseteq V \times V \), \( \mu : V \rightarrow L_V \) (resp. \( \xi : E \rightarrow L_E \)) is the function that assigns attributes to a vertex (resp. an edge), and \( L_V \) (resp. \( L_E \)) is the label space for vertices (resp. edges).

Next, given two graphs \( G = (V,E,\mu ,\xi ) \) and \( G' = (V',E',\mu ',\xi ') \), GED is the task of transforming one graph source into another graph target. To accomplish this, GED introduces the vertices and edges edit operations: \( (u \rightarrow v) \) is the substitution of two nodes, \( (u \rightarrow \epsilon ) \) is the deletion of a node, and \( (\epsilon \rightarrow v) \) is the insertion of a node, with \( u \in V, v \in V' \) and \( \epsilon \) refers to the empty node. The same logic goes for the edges. The set of operations that reflects a valid transformation of G into \( G' \) is called a complete edit path, defined as \( \lambda (G,G') = \{e_1,\ldots , e_k\} \) where \( e_i \) is an elementary vertex (or edge) edit operation and k is the number of operations. GED is then

$$\begin{aligned} d_{min}(G,G') = \min _{\lambda \in \varGamma (G,G')} \sum _{e_i \in \lambda } c(e_i) \end{aligned}$$
(1)

where \(\varGamma (G,G') \) is the set of all complete edit paths, \( d_{min} \) represents the minimal cost obtained by a complete edit path \( \lambda (G,G') \), and c is the cost function that assigns the costs to elementary edit operations.

\( MILP^{JH} \) is a model proposed in [6], that solves the GED problem. The main idea consists in determining the permutation matrix minimizing the \(L_1\) norm of the difference between adjacency matrix of the input graph and the permuted adjacency matrix of the target one. The details about the construction of the model can be found in [6]. The model is as follows:

$$\begin{aligned} \displaystyle \min _{P,S,T \in \{0,1\}^{N \times N}} \sum _{i = 1}^{N} \sum _{j = 1}^{N} c\left( \mu (u_i),\mu '(v_j)\right) P^{ij} + \left( \frac{1}{2} \times const \times (S + T)^{ij} \right) \end{aligned}$$
(2)

such that

$$\begin{aligned} \left( AP - PA' + S - T \right) ^{ij} = 0 \ \forall i, j \in \{1,N\} \end{aligned}$$
(3)
$$\begin{aligned} \sum _{i = 1}^N P^{ik} = \sum _{j = 1}^N P^{kj} = 1 \ \forall k \in \{1,N\} \end{aligned}$$
(4)

where A and \( A' \) are the adjacency matrices of graphs G and \( G' \) respectively, \( c: (\mu (u_i),\mu '(v_j)) \rightarrow \mathbb {R}^+ \) is the cost function that measures the distance between two vertices attributes. As for PS and T, they are the permutation matrices of size \( N \times N \), and of boolean type, with \( N = |V| + |V'| \). P represents the vertices matching e.g. \( P^{ij} = 1 \) means a vertex \( i \in V \cup \{\epsilon \} \) is matched with vertex \( j \in V' \cup \{\epsilon \} \). While S and T are for edges matching. Hence, the objective function (Eq. 2) minimizes both, the cost of vertices and edges matching. As for constraint 3, it is to make sure that when matching two couples of vertices, the edges between each couple have to be mapped. Constraint 4 guarantees the integrity of P. This model has a limitation that it does not consider the attributes on edges, so edge substitution costs 0 while deletion and insertion have a \( const \in \mathbb {R}^+ \) cost.

Fig. 1.
figure 1

Local branching flow. (a) Depicts the left and right branching. (b) Shows the neighborhoods in the solution space

3 Local Branching Heuristic for GED

The general MILP formulation is of the form:

$$\begin{aligned} \min _{x} \ c^Tx \end{aligned}$$
(5)
$$\begin{aligned} Ax \ge b \end{aligned}$$
(6)
$$\begin{aligned} x_j \in \{0,1\}, \forall j \in B \ne 0 \end{aligned}$$
(7)
$$\begin{aligned} x_j \in \mathbb {N}, \forall j \in I \end{aligned}$$
(8)
$$\begin{aligned} x_j \in \mathbb {R}, \forall j \in C \end{aligned}$$
(9)

where the variable index set is split into three sets (BIC) , respectively stands for binary, integer and continuous. \( MILP^{JH} \) can be written in the same form, where the sets \( I = C = \{\phi \} \) since all the variables are binary.

As presented in [5], LocBra heuristic is a local search approach that makes use of MILP solver to explore the neighborhoods of solutions through a branching scheme. In addition, it involves mechanisms such as intensification and diversification. Starting from an initial solution \( x^0 \), it defines the k-opt neighborhood \( N(x^0,k) \), with k a given integer. In other words, the neighborhood set contains the solutions that are within a distance no more than k from \( x^0 \) (in the sense of Hamming distance). This implies adding the following local branching constraint to the base \( MILP^{JH} \) model:

$$\begin{aligned} \varDelta (x,x^0) = \sum _{j \in S^0} (1 - x_j) + \sum _{j \in B \setminus S^0} x_j \le k \end{aligned}$$
(10)

such that, B is the index set of binary variables defined in the model, and \( S^0 = \{j \in B : x^0_j = 1\} \). This new model is then solved leading to the search of the best solution in \( N(x^0,k) \). This phase corresponds to intensifying the search in a neighborhood e.g. node 2 in Fig. 1a. If a new solution \( x^1 \) is found, the constraint (Eq. 10) is replaced by \( \varDelta (x,x^0) \ge k + 1 \) (node 3 in Fig. 1a). This constraint makes sure that visited solutions (e.g. \( x^0 \)) will not be selected twice. Next, a new constraint Eq. 10 is added but now with \( x^1 \) to explore its neighborhood. The process is repeated until a stopping criterion is met e.g. a total time limit is reached. Moreover, it cannot be generalized that an improved solution could be found, due to reasons such as node time limit is reached or the problem may become infeasible. For instance, assuming that at node 6 (Fig. 1a) the solution of model \( MILP^{JH} \) plus equation \( \varDelta (x,x^2) \le k \) does not lead to a feasible solution in the given time limit. It might be interesting to apply a complementary intensification phase, by adding constraint \( \varDelta (x,x^2) \le k/2 \) and solving the new model. If again, no feasible solution is found (e.g. node 7 of Fig. 1a), then a diversification phase is applied by adding the constraint \( \varDelta (x,x^2) \ge k\_dv \) (with \( k\_dv \) a positive integer), to jump to another point in the solution space (e.g. node 8). The diversification constraint simply forces the algorithm to change the region in the search space by choosing any solution that has \( k\_dv \) differences from the current solution (the authors in [5] uses another diversification constraint). Figure 1b shows the evolution of the solution search and the neighborhoods.

The key point of this heuristic is the selection of the variables while branching. To adapt the method to GED, \( \varDelta (x,x^i) \) is replaced by \( \varDelta '(x,x^i) \) where x contains only the set of binary variables that represent the vertices assignment (edges assignment are excluded). The reason behind this relies on the fact that edges assignment are driven by the vertices assignment, i.e. deleting one vertex implies deleting all edges that are connected to it, this is based on the definition of the GED problem. Another improvement is proposed for the diversification mechanism, where also not all binary variables are included but a smaller set of important variables is used instead. The diversification constraint is then \( \varDelta ''(x,x^i) = \sum _{j \in S_{imp}^i} (1 - x_j) + \sum _{j \in B_{imp} \setminus S_{imp}^i} x_j \ge k\_dv \) with \( B_{imp} \) is the index set of binary important variables and \( S^i_{imp} = \{j \in B_{imp} : x^i_j = 1\} \). The selection of these variables is based on the assumption that one variable is considered important if changing its value from \( 1 \rightarrow 0 \) (or the opposite) highly impacts the objective function’s value. This, in turn, helps skipping local solutions and change the matching. Accordingly, \( B_{imp} \) is defined by computing a special cost matrix \( [C_{ij}] \) for each possible assignment of a vertex \( i \in V \cup \{\epsilon \} \), to a vertex \( j \in V' \cup \{\epsilon \} \). Each value \( C_{ij} = c_{ij} + \theta _{ij} \), where \( c_{ij} \) is the vertex operation cost induced by assigning vertex i to vertex j, and \( \theta _{ij} \) is the cost of assigning the set of edges \( E_i = \{(i,v) \in E \} \) to \( E_j = \{(j,v') \in E'\} \). This assignment problem, of size \( max(|E_i|,|E_j|) \times max(|E_i|,|E_j|) \), is solved by the Hungarian algorithm which requires (\( O(max(|E_i|,|E_j|)^3) \)) time. Next, the standard deviation is computed at each row of the matrix \( [C_{ij}] \), resulting in a vector \( [\sigma _i] \). The values in \( [\sigma _i] \) are split into two clusters min and max. Finally, for every \( \sigma _i \) belonging to max cluster, the indexes of all binary variables that represent a vertex assignment for i are added to \( B_{imp} \). Consequently, the local structure of a vertex is considered to assess its influence on the objective function value. Preliminary experiments, not reported here, have shown that such diversification helps improving the local branching heuristic better than the original diversification defined in [5].

4 Computational Experiment

Database: There are many graph databases that represent chemical molecules e.g. MUTA, PAH, MAO [1] in the literature. In the current experiment, MUTA database is chosen because it contains different subsets of small and large graphs and known to be difficult. There are 7 subsets, each of which has 10 graphs of same size (10 to 70 vertices) and a subset of also 10 graphs with mixed graph sizes. Each pair of graphs is considered as an instance. Therefore, a total of 800 instances (100 per subset) are considered in this experiment.

Table 1. \( LocBra\_GED \) vs. literature heuristics on MUTA instances.

Comparative heuristics and experiment settings: To solve the \( MILP^{JH} \), the solver CPLEX 12.6.0 is used in mixed-integer programming mode. \( LocBra\_GED \) algorithm is implemented in C language. The tests were executed on a machine with the following configuration: Windows 7 (64-bit), Intel Xeon E5 4 cores and 8 GB RAM. \( LocBra\_GED \) is applied on all instances with the following parameter values: \( k = 20 \), \( k\_dv = 30 \), \( total\_time\_limit = 900s \), \( node\_time\_limit = 180s \). Since \( LocBra\_GED \) uses CPLEX with time limit, it is interesting to study the performance of the solver itself with the same time limit in order to see whether the proposed heuristic actually improves the solution of the problem, the method is called CPLEX-900s. Other competitive heuristics are included in the experiment such as BeamSearch and SBPBeam. Both are based on beam-search method, so their performance varies based on the choice of the beam size, therefore two versions of each are considered: BeamSearch 5, BeamSearch 15000, SBPBeam 5 and SBPBeam 400. All parameter values are set based on preliminary experiments that are not shown here. For instance, beam sizes 15000 and 400 increases the execution time of BeamSearch and SBPBeam to be around 900s. For each heuristic, the following values are computed for each subset of graphs: \( t_{avg} \) is the average CPU time in seconds for all instances. Then, \( d_{min}, d_{avg}, d_{max} \) are the deviation percentages between the solutions obtained by one heuristic, and the optimal solutions. Given an instance I and a heuristic H, deviation percentage is equal to \( \frac{solution^H_I - optimal_I}{optimal_I} \times 100 \), with \( optimal_I \) is the optimal solution for I. Lastly, \( \eta _{I} \) represents the number of solutions obtained by a heuristic that are equal to the optimal ones.

Results and analysis: Table 1 shows the results of the experiment. Looking at \( \eta _I \), \(LocBra\_GED\) and CPLEX 900s heuristics seem to have the highest values among the others. Moreover, \(LocBra\_GED\) has higher values than CPLEX 900s for all subsets except Mixed with 87 against 84 solutions equal to the optimal. Now considering the \( d_{avg} \), again \(LocBra\_GED\) and CPLEX 900s has the smallest deviations w.r.t. the optimal solutions (always less than \( 2\% \)). For easy instances (subsets 10 to 40), both heuristics have scored \( 0\% \), except for subset 40 where \(LocBra\_GED\) has \( 0.06\% \) average deviation, but still very close to CPLEX 900s. For hard instances (subsets 50 to 70), \(LocBra\_GED\) has the lowest deviations (always less than \( 1\% \)). The same conclusion can be seen when checking the deviations for Mixed subset. The \( d_{avg} \) values for beam-search based methods are very high comparing to the first two heuristics, where on hard instances they reach over \( 100\% \) (e.g. SBPBeam for subset 70 has \( 107\% \)). Regarding the execution time of the heuristics, BeamSearch 5 is the fastest with \( t_{avg} \) always less than 0.2s, but in terms of solution quality it is very far from the optimal with \( d_{avg} \) up to \( 73\% \). And even after increasing the beam size for both BeamSearch and SBPBeam both are not able to provide better solutions and the average deviations remain very high. BeamSearch 15000 was not applicable on subset Mixed because it did not return feasible solutions for all the instances, so the deviations and \( \eta _I \) are not computed. In summary, \( LocBra\_GED \) is the most accurate and has provided near optimal solutions.

5 Conclusion

In this work, a local branching heuristic is proposed for the GED problem. As a conclusion, \( LocBra\_GED \) significantly improves the literature heuristics and provides near optimal solutions. This is basically due, to the analysis and the branching scheme combined with the efficiency of CPLEX when solving \( MILP^{JH} \) model. A second important factor is the diversification procedure that is problem dependent and really helps escaping local optima solutions. Next, more heuristic algorithms will be tested against \( LocBra\_GED \), and more techniques will be investigated in order to boost the solution of the method and also to allow dealing with graphs that have attributes on their edges.