Keywords

1 Introduction

Computing the dissimilarity between two graphs is a crucial issue for graph-based pattern recognition problems, e.g. malware detection [10], chemoinformatics [5], or document analysis [2]. A large number of algorithms are proposed in the literature to compute graph dissimilarity. Among existing approaches, Graph Edit Distance (GED) has retained a lot of attention during the two last decades. Using GED, graph dissimilarity computation is directly linked to a matching process through the introduction of a set of graph edit operations (e.g. vertex insertion, vertex deletion). Each edit operation being characterized by a cost, the GED is the total cost of the least expensive sequence of edit operations that transforms one graph into the other one. A major theoretical advantage of GED is that it is a dissimilarity measure for arbitrarily structured and arbitrarily attributed graphs. Moreover, together with the dissimilarity, it provides the corresponding sequence of edit operations, which can be itself useful for application purpose. A limitation for the use of GED is its computational complexity. Indeed, as stated in [22], the GED problem is NP-hard. This explains why many recent contributions focus on the computation of GED approximations which can be applied on large graphs. Among existing approximations, some are based on the proposition of new heuristics to improve the performance of exact approaches [3, 21] whereas others propose faster but suboptimal methods which approximate the exact GED (e.g. [1, 4, 1416, 18, 20]). These latter are faster, but do not guarantee to find the optimal solution.

In this paper, we are interested in the exact computation of the GED between two graphs, for a given set of edit costs. In this context, the main family of existing approaches is based on the widely known A\(^*\) algorithm [3, 21]. This algorithm relies on the exploration of the tree of solutions. In this tree, a node corresponds to a partial edition of the input graph towards the target one. A leaf of the tree corresponds to a complete edit path which transforms one graph into the other. The exploration of the tree is led by developing the most promising ways on the basis of an estimation of GED. For each node, this estimation is the sum of the cost associated to the partial edit path and an estimation of the cost for the remaining path. The latter is given by a heuristic. Provided that the estimation of the future cost is lower than or equal to the real cost, an optimal path from the root node to a leaf node is guaranteed to be found [7]. The different A*-based methods published in the literature mainly differ in the implemented heuristics for the future cost estimation which correspond to different tradeoffs between approximation quality and their computation time [3, 21].

A second family of algorithms consists in using Binary Linear Programing (BLP) for computing the GED. Justice and Hero [9] proposed a BLP formulation of the GED problem. The proposed program searches for the permutation matrix which minimizes the cost of transforming \(G_1\) into \(G_2\), with \(G_1\) and \(G_2\) two unweighted and undirected attributed graphs. The criterion to be minimized takes into account costs for matching vertices, but the formulation does not integrate the ability to process graphs with labels on their edges.

In some previous works, we also have investigated the use of Integer Linear Programing for graph matching problems. In [11, 13], a formulation and a toolbox are proposed for substitution-tolerant subgraph isomorphism and its use for symbol spotting in technical drawings. In [12], a BLP-based minimum cost subgraph matching is described. It has been used for document analysis purpose in [6]. In this paper, we propose an extension of [12] for GED computation. The proposed framework can be used to compute the distance between richly attributed graphs (i.e. with attributes on both vertices and edges) which can not be tackled using the BLP formulation proposed in [9]. Experiments led on reference datasets show that the new BLP formulation can compute an exact GED on larger graphs than A\(^*\) based algorithm.

This paper is organized as follows: Sect. 2 presents the important definitions necessary for introducing our formulations of the GED. Then, Sect. 3 describes the proposed binary linear programming formulation. Section 4 presents the experiments and analyses the obtained results. Section 5 provides some concluding remarks.

2 Definitions

In this paper, we are interested in computing GED between attributed graphs. This section is dedicated to the introduction of the notations and definitions used in the remaining of the paper.

Definition 1

An attributed graph G is a 4-tuple \(G=(V, E, \mu , \xi )\), where :

  • V is a set of vertices,

  • E is a set of edges, such that \(\forall e=ij \in E, i \in V \text { and } j \in V\),

  • \(\mu : V \rightarrow L_V\) is a vertex labeling function which associates the label \(\mu (v)\) to all vertices v of V, where \(L_V\) is the set of possible labels for the vertices,

  • \(\xi : E \rightarrow L_E\) is an edge labeling function which associates the label \(\xi (e)\) to all edges e of E, where \(L_E\) is the set of possible labels for the edges.

The vertices (resp. edges) label space \(L_V\) (resp. \(L_E\)) may be composed of any combination of numeric, symbolic or string attributes.

Definition 2

The graph edit distance d(., .) is a function

$$\begin{aligned} d : \mathcal {G} \times \mathcal {G}\rightarrow & {} \mathbb {R}^+ \\ (G_1,G_2)\mapsto & {} d(G_1,G_2) = \min _{o=(o_1,\ldots ,o_k)\in \varGamma (G_1,G_2)} \sum _{i=1}^kc(o_i) \end{aligned}$$

where \(G_1=(V_1,E_1, \mu _1, \xi _1)\) and \(G_2=(V_2,E_2, \mu _2, \xi _2)\) are two graphs from the set \(\mathcal {G}\) and \(\varGamma (G_1,G_2)\) is the set of all edit paths \(o=(o_1,\ldots ,o_k)\) allowing to transform \(G_1\) into \(G_2\). An elementary edit operation \(o_i\) is one of vertex substitution (\(v_1 \rightarrow v_2\)), edge substitution (\(e_1 \rightarrow e_2\)), vertex deletion (\(v_1 \rightarrow \epsilon \)), edge deletion: (\(e_1 \rightarrow \epsilon \)), vertex insertion (\( \epsilon \rightarrow v_2\)) and edge insertion (\( \epsilon \rightarrow e_2\)) with \(v_1 \in V_1\), \(v_2 \in V_2\), \(e_1 \in E_1\) and \(e_2 \in E_2\). \(\epsilon \) is a dummy vertex or edge which is used to model insertion or deletion. c(.) is a function which associates a cost to each elementary edit operation \(o_i\).

In the more general case where GED is computed between attributed graphs, edit costs are generally defined as functions of vertices (resp. edges) attributes. More precisely, substitution costs are defined as a function of the attributes of the substituted vertices (resp. edges), whereas insertion and deletion are penalized with a value linked to the attributes of the inserted/deleted vertex (resp. edge):

$$\begin{aligned} \begin{array}{l} c(v_1 \rightarrow v_2) = c(v_2 \rightarrow v_1) = f_v(\mu _1(v_1),\mu _2(v_2)) \\ c(e_1 \rightarrow e_2) = c(e_2 \rightarrow e_1) = f_e(\xi _1(e_1),\xi _2(e_2)) \\ c(v \rightarrow \epsilon ) = c(\epsilon \rightarrow v) = g_v(\mu (v)) \\ c(e \rightarrow \epsilon ) = c(\epsilon \rightarrow e) = g_e(\xi (e)) \end{array} \end{aligned}$$

3 BLP Formulation of GED

Binary Linear Programing is a restriction of integer linear programming (ILP) with binary variables. Its general form is :

$$\begin{aligned} \min _{x}&~{c^T x} \end{aligned}$$
(1a)
$$\begin{aligned} \text {subject to }&Ax \le b \end{aligned}$$
(1b)
$$\begin{aligned}&x \in \{0,1\}^n \end{aligned}$$
(1c)

where \(c \in \mathbb {R}^{n}, A \in \mathbb {R}^{n \times m} \text { and } b \in \mathbb {R}^{m}\) are data of the problem. A feasible solution is a vector x of n binary variables (1c) which respects linear inequality constraints (1b). The constraint (1c) which enumerates the admissible values of variables is called a domain constraint. If the program has at least a feasible solution, then the optimal solutions are the ones that minimize the objective function (1a) which is a linear combination of variables of x weighted by the components of the vector c.

In this section, we successively define the variables, the objective function and the linear constraint of the BLP used for formulating GED as a BLP.

3.1 BLP-GED Variables

Our goal is to compute GED between two graphs \(G_{1} = (V_1,E_1,\mu _1,\xi _1) \) and \(G_{2} = (V_2,E_2,\mu _2,\xi _2)\). In the rest of this section, for the sake of simplicity of notations, we consider that the graphs \(G_{1}\) and \(G_{2}\) are simple directed graphs. However, the formulations given in this section can be applied directly without modification to multigraphs. Besides, the extension to the undirected case only needs some slight modifications.

In the GED definition provided in Sect. 2, the edit operations that are allowed to match the graphs \(G_{1}\) and \(G_{2}\) are (i) the substitution of a vertex (respectively an edge) of \(G_{1}\) with a vertex (resp. an edge) of \(G_{2}\), (ii) the deletion of a vertex (or an edge) from \(G_{1}\) and (iii) the insertion of a vertex (or an edge) of \(G_{2}\) in \(G_{1}\). For each type of edit operation, we define a set of corresponding binary variables:

  • \(\forall (i,k) \in V_1 \times V_2, x_{i,k} = \left\{ \begin{array}{l} 1 \text { if }\ i\ \text { is substituted with }\ k,\\ 0\ \text {otherwise.}\\ \end{array} \right. \)

  • \(\forall (ij,kl) \in E_1 \times E_2, y_{ij,kl} = \left\{ \begin{array}{l} 1\ \text {if}\ ij\ \text { is substituted with }\ kl,\\ 0\ \text {otherwise.}\\ \end{array} \right. \)

  • \(\forall i \in V_1, u_i = \left\{ \begin{array}{l} 1 \text { if }\ i\ \text { is deleted from } G_{1}\\ 0\ \text {otherwise.}\\ \end{array} \right. \)

  • \(\forall ij \in E_1, e_{ij} = \left\{ \begin{array}{l} 1 \text { if }\ ij\ \text { is deleted from } G_{1}\\ 0\ \text { otherwise.}\\ \end{array} \right. \)

  • \(\forall k \in V_2, v_k = \left\{ \begin{array}{l} 1 \text { if }\ k\ \text { is inserted in } G_{1}\\ 0 \text { otherwise.}\\ \end{array} \right. \)

  • \(\forall kl \in E_2, f_{kl} = \left\{ \begin{array}{l} 1 \text { if }\ kl\ \text { is inserted in } G_{1}\\ 0\ \text {otherwise.}\\ \end{array} \right. \)

Using these notations, we define an edit path between \(G_{1}\) and \(G_{2}\) as a 6-tuple \((\mathbf {x}, \mathbf {y}, \mathbf {u}, \mathbf {v}, \mathbf {e}, \mathbf {f})\) where \(\mathbf {x} = (x_{i,k})_{(i,k) \in V_1 \times V_2}\), \(\mathbf {y} = (y_{ij,kl})_{(ij,kl) \in E_1 \times E_2}\), \(\mathbf {u} = (u_i)_{i \in V_1}\), \(\mathbf {e} = (e_{ij})_{ij \in E_1}\), \( \mathbf {v} = (v_k)_{k \in V_2}\) and \(\mathbf {f} = (f_{kl})_{kl \in E_2}\).

In order to evaluate the global cost of an edit path, elementary costs for each edit operation must be defined. We adopt the following notations for these costs:

  • \(\forall (i,k) \in V_1 \times V_2, c(i \rightarrow k)\) is the cost of substituting the vertex i with k,

  • \(\forall (ij,kl) \in E_1 \times E_2, c(ij \rightarrow kl)\) is the cost of substituting the edge ij with kl,

  • \(\forall i \in V_1, c(i \rightarrow \epsilon )\) is the cost of deleting the vertex i from \(G_{1}\),

  • \(\forall ij \in E_1, c(ij \rightarrow \epsilon )\) is the cost of deleting the edge ij from \(G_{1}\),

  • \(\forall k \in V_2, c(\epsilon \rightarrow k)\) is the cost of inserting the vertex k in \(G_{1}\),

  • \(\forall kl \in E_2, c(\epsilon \rightarrow kl)\) is the cost of inserting the edge kl in \(G_{1}\).

3.2 Objective Function

The objective function is the overall cost induced by applying an edit path \((\mathbf {x}, \mathbf {y}, \mathbf {u}, \mathbf {v}, \mathbf {e}, \mathbf {f})\) that transforms a graph \(G_{1}\) into \(G_{2}\), using the elementary costs defined above. GED between \(G_{1}\) and \(G_{2}\) is the minimum value of the objective function (2) over \((\mathbf {x}, \mathbf {y}, \mathbf {u}, \mathbf {v}, \mathbf {e}, \mathbf {f})\) subject to constraints detailed in Sect. 3.3.

$$\begin{aligned} \begin{aligned} d(G_1,G_2) = \min _{\mathbf {x,y,u,v,e,f}} \Biggl (&\sum _{i \in V_1}\sum _{k \in V_2} c(i \rightarrow k) \cdot x_{i,k} + \sum _{ij \in E_1}\sum _{kl \in E_2} c(ij \rightarrow kl) \cdot y_{ij,kl} \\&+ \sum _{i \in V_1} c(i \rightarrow \epsilon ) \cdot u_i + \sum _{k \in V_2} c(\epsilon \rightarrow k) \cdot v_k \\&+ \sum _{ij \in E_1} c(ij \rightarrow \epsilon ) \cdot e_{ij} + \sum _{kl \in E_2} c(\epsilon \rightarrow kl) \cdot f_{kl} \Biggr ) \end{aligned} \end{aligned}$$
(2)

3.3 Constraints

The constraints presented in this part are designed to guarantee that the admissible solutions of the BLP are edit paths that transform \(G_{1}\) in a graph which is isomorphic to \(G_{2}\). An edit path is considered as admissible if and only if the following conditions are respected:

  1. 1.

    It provides a one-to-one mapping between a subset of the vertices of \(G_{1}\) and a subset of the vertices of \(G_{2}\). The remaining vertices are either deleted or inserted,

  2. 2.

    It provides a one-to-one mapping between a subset of the edges of \(G_{1}\) and a subset of the edges of \(G_{2}\). The remaining edges are either deleted or inserted,

  3. 3.

    The vertices matchings and the edges matchings are consistent, i.e. the graph topology is respected.

The following paragraphs describe the linear constraints used to integrate these conditions into the BLP.

1 - Vertices Matching Constraints. The constraint (3) ensures that each vertex of \(G_{1}\) is either matched to exactly one vertex of \(G_{2}\) or deleted from \(G_{1}\), while the constraint (4) ensures that each vertex of \(G_{2}\) is either matched to exactly one vertex of \(G_{1}\) or inserted in \(G_{1}\):

$$\begin{aligned} u_i + \sum _{k \in V_2} x_{i,k} = 1 \quad \forall i \in V_1 \end{aligned}$$
(3)
$$\begin{aligned} v_k + \sum _{i \in V_1} x_{i,k} = 1 \quad \forall k \in V_2 \end{aligned}$$
(4)

2 - Edges Matching Constraints. Similar to the vertex matching constraints, the constraints (5) and (6) guarantee a valid mapping between the edges:

$$\begin{aligned} e_{ij} + \sum _{kl \in E_2} y_{ij,kl} = 1 \quad \forall ij \in E_1 \end{aligned}$$
(5)
$$\begin{aligned} f_{kl} + \sum _{ij \in E_1} y_{ij,kl} = 1 \quad \forall kl \in E_2 \end{aligned}$$
(6)

3 - Topological Constraints. In order to respect the graph topology in the matching, an edge \(ij \in E_1\) can be matched to an edge \(kl \in E_2\) only if the head vertices \(i \in V_1\) and \(k \in V_2\), on the one hand, and if the tail vertices \(j \in V_1\) and \(l \in V_2\), on the other hand, are respectively matched. This quadratic constraint can be expressed linearly with the following constraints (7) and (8):

  • ij and kl can be matched if and only if their head vertices are matched:

    $$\begin{aligned} y_{ij,kl} \le x_{i,k} \quad \forall (ij, kl) \in E_1 \times E_2 \end{aligned}$$
    (7)
  • ij and kl can be matched if and only if their tail vertices are matched:

    $$\begin{aligned} y_{ij,kl} \le x_{j,l} \quad \forall (ij, kl) \in E_1 \times E_2 \end{aligned}$$
    (8)

Equations 2 to 8, coupled with domain constraints which ensure that the solution is binary leads to our BLP formulation :

$$\begin{aligned} \text {(BLP-GED)} \end{aligned}$$
$$\begin{aligned} \begin{aligned} \min _{\mathbf {x,y,u,v,e,f}} \Biggl (&\sum _{i \in V_1}\sum _{k \in V_2} c(i \rightarrow k) \cdot x_{i,k} \\&+ \sum _{ij \in E_1}\sum _{kl \in E_2} c(ij \rightarrow kl) \cdot y_{ij,kl} \\&+ \sum _{i \in V_1} c(i \rightarrow \epsilon ) \cdot u_i + \sum _{k \in V_2} c(\epsilon \rightarrow k) \cdot v_k \\&+ \sum _{ij \in E_1} c(ij \rightarrow \epsilon ) \cdot e_{ij} + \sum _{kl \in E_2} c(\epsilon \rightarrow kl) \cdot f_{kl} \Biggr ) \end{aligned} \end{aligned}$$
(9a)
$$\begin{aligned} \text {subject to}\quad&u_i + \sum _{k \in V_2} x_{i,k} = 1 \quad \forall i \in V_1\end{aligned}$$
(9b)
$$\begin{aligned}&v_k + \sum _{i \in V_1} x_{i,k} = 1 \quad \forall k \in V_2\end{aligned}$$
(9c)
$$\begin{aligned}&e_{ij} +\sum _{kl \in E_2} y_{ij,kl} = 1 \quad \forall ij \in E_1\end{aligned}$$
(9d)
$$\begin{aligned}&f_{kl}+\sum _{ij \in E_1} y_{ij,kl} = 1 \quad \forall kl \in E_2\end{aligned}$$
(9e)
$$\begin{aligned}&y_{ij,kl} \le x_{i,k} \quad \forall (ij, kl) \in E_1 \times E_2\end{aligned}$$
(9f)
$$\begin{aligned}&y_{ij,kl} \le x_{j,l} \quad \forall (ij, kl) \in E_1 \times E_2\end{aligned}$$
(9g)
$$\begin{aligned} \text {with}\quad&x_{i,k} \in \{0, 1\} \quad \forall (i, k) \in V_1 \times V_2\end{aligned}$$
(9h)
$$\begin{aligned}&y_{ij,kl} \in \{0, 1\} \quad \forall (ij, kl) \in E_1 \times E_2\end{aligned}$$
(9i)
$$\begin{aligned}&u_i \in \{0, 1\} \quad \forall i \in V_1\end{aligned}$$
(9j)
$$\begin{aligned}&v_k \in \{0, 1\} \quad \forall k \in V_2\end{aligned}$$
(9k)
$$\begin{aligned}&e_{ij} \in \{0, 1\} \quad \forall ij \in E_1\end{aligned}$$
(9l)
$$\begin{aligned}&f_{kl} \in \{0, 1\} \quad \forall kl \in E_2 \end{aligned}$$
(9m)

4 Experiments

In this section, we present some experimental results obtained using the proposed formulation, compared with those of two reference methods. The first one is based on the A\(^*\) algorithm with a bipartite heuristic [21]. This method is the most well-known exact GED method and is often used to evaluate the accuracy of approximate methods. The second exact method is the BLP proposed by Justice and Hero in [9]. This method, called JH in the paper, is directly linked to our proposal. Since this method cannot deal with edge attributes, we could not perform JH on all our datasets.

In this practical work, our method has been implemented in \(C\#\) using CPLEX Concert Technology. All the methods were run on a 2.6 GHz quad-core computer with 8 GB RAM. For the sake of comparison, none of the methods were parallelized and CPLEX was set up in a deterministic manner.

The 3 methods are evaluated on 7 reference datasets:

  • GREC [17] is composed of undirected graphs of rather small size (i.e. up to 20 vertices in our experiments). In addition, continuous attributes on vertices and edges play an important role in the matching procedure. Such graphs are representative of pattern recognition problems where graphs are involved in a classification stage.

  • PROTEIN [17] is a molecule dataset. In similar datasets, vertices are labeled with atomic chemical elements, which imposes a stringent constraint of label equality in the matching process. In this particular dataset, vertices are labeled with chemical element sequences. The stringent constraint can be relaxed thanks to the string edit distance. So the matching process can be tolerant and accommodate with differences on labels.

  • ILPISO [8] stands apart from the others in the sense that this dataset hold directed graphs. The aim is to illustrate the flexibility of our proposal that can handle different types of graphs.

  • LETTER [17] is broken down into three parts (LOW, MED, HIGH) which corresponds to distortion levels. The LETTER dataset is useful because it holds graphs of rather small size (maximum of 9 vertices), what makes feasible the computations of all GED methods.

  • PAH Footnote 1 is a purely structural database with no labels at all. PAH is a hard dataset gathering graphs of rather large size (\(\ge 20 \quad vertices\)). Graphs are complex and hard to match in cases where neighborhoods and attributes do not allow easily to differentiate between vertices.

All these datasets are publicly available on IAPR TC15 websiteFootnote 2. In order to evaluate the algorithms behaviours when the size of the problem grows, we have built subsets where all graphs have the same number of vertices for GREC, PROTEIN, ILPISO datasets. Concerning edit costs, we borrow the setting from [19] for GREC, PROT, and LETTER databases.

In order to evaluate an exact GED computation method, the natural criterion is the computation time. However, from a practical point of view, exact GED solving can be infeasible in a reasonable time or can overstep memory capacity. In our framework, if a time limit of 300 s is reached, the instance is considered as unsolved. The same situation arises when the memory limit of 1 Gb is reached.

Obtained results when comparing the 3 methods are presented in Table 1. Two values are given to qualify each method on each dataset: the percentage of instances solved to optimality (without time or memory problems), called “Opt” and the average running time in milliseconds called “Time”. In the objective of a fair comparison, the average running time is calculated only from instances solved to optimality by all the methods. If the memory limit is reached, “OM” appears in the table. The “NA” values mean that the algorithm can not be applied to the dataset. It occurs for the JH method on GREC, PROT and ILPISO dataset since JH formulation does not take into account edge labels.

Table 1. Results

Some observations can be made from these results. First, as expected, A* is the worst method. It reaches the optimal solution only for datasets composed of very small graphs (LETTER and GREC10). A* is the fastest method for small graphs. When graphs are larger than 10 vertices, all the instances are not optimally solved. When graphs hold 10 vertices, the execution time increase considerably by a factor 2000 compared to graphs of 5 vertices. The combinatorial explosion is not prevented by the pruning strategy. Beyond 10 vertices, A* cannot converge to the optimality because of memory saturation phenomenon. The size of the list OPEN containing pending solutions grows exponentially according to the graph size and the bipartite heuristic fails to prune the search tree efficiently.

Second, when analysing BLP-GED results, one can see that for datasets composed of smaller graphs (LETTER, ILPISO10, GREC5-10), BLP-GED converges to the optimality. For larger graphs (GREC20, PROT and ILPISO25-50), the optimal solution is not always reached, because of the time restriction but BLP-GED is the only method to solve instances and to output solutions. An interesting comment comes from the GREC dataset where nearly all instances are solved to optimality however increasing the graph size of only 5 vertices can lead to an important increase of time. The relation between graph size and solving time is not linear at all and it recalls us humbly how hard is the GED problem.

Finally, concerning JH, the method performs well for the PAH dataset solving 100 % of instances optimally against 51 % for BLP-GED. JH achieved better results because it is not generic but dedicated to compare unweighted graphs. However, on a smaller graphs like in LETTER, JH is the slowest method due to a higher number of variables and constraints than BLP-GED formulation. Finally, JH is not flexible and the method is unable to consider edge labels and cannot be applied to GREC, PROT and ILPISO.

5 Conclusion

In this paper, an exact binary linear programming formulation of the GED problem has been presented. One of the major advantage of our proposal against another binary program (JH) is that it can deal with a wide range of attributed relational graphs: directed or undirected graphs, simple graphs or multigraphs, with a combination of symbolic, numeric and/or string attributes on vertices and edges. Moreover, obtained results show that our BLP is about 100 times faster than \(A*\) for medium-size graphs while being more memory efficient. Our future works concern the optimization and the approximation of our formulation of GED computation.