On the exact computation of the graph edit distance
Introduction
Labelled graphs can be used for modelling various kinds of objects, such as images, molecular structures, and many more. Because of this, labelled graphs have received increasing attention over the past years. One task researchers have focused on is the following: Given a database that contains labelled graphs, find all graphs that are sufficiently similar to a query graph H or find the k graphs from that are most similar to H [1], [2], [3]. Being able to quickly answer queries of these kind is crucial for the development of performant pattern recognition techniques in various application domains [4] such as keyword spotting in handwritten documents [5] and cancer detection [6].
For answering graph similarity queries, a distance measure between two labelled graphs G and H has to be defined. A very flexible, sensitive and therefore widely used measure is the graph edit distance (GED), which is defined as the minimum cost of an edit path between G and H. An edit path is a sequence of graphs starting at G and ending at a graph that is isomorphic to H such that each graph on the path can be obtained from its predecessor by applying one of the following edit operations: adding or deleting an isolated node or an edge, and relabelling an existing node or edge. Each edit operation comes with an associated edit cost. The cost of an edit path is defined as the sum of the costs of its edit operations. If the costs of all edit operations equal 1, we say that the edit costs are uniform. In many scenarios, it is natural to consider non-uniform edit costs [7]. For instance, if the graphs model spatial objects and the node labels are Euclidean coordinates, the node relabelling cost should probably be defined as the Euclidean distance between the node labels.
It has been shown that, even for uniform edit costs, it is NP-hard to exactly compute GED [8]. However, efficient exact algorithms are still important; mainly because many objects that are readily modelled by labelled graphs — for instance, some molecular compounds — induce relatively small graphs [9]. For these graphs, GED based similarity queries can in principle be answered. Of course, one would first use efficiently computable upper and lower bounds in order to filter out data graphs that are very far away from the query graphs. However, for the surviving candidates, the exact graph edit distance still has to be computed.
The standard approach [10] for exactly computing GED carries out a node-based best-first search in order to find the optimal edit path. It is very slow and has huge memory requirements. Recently, better performing algorithms have been proposed: Abu-Aisheh et al. [11] developed an algorithm which carries out a node-based depth-first search for finding the cheapest edit path. has been found to be much more memory-efficient and slightly faster than . Gouda and Hassaan [12] developed the algorithm which carries out an edge-based depth-first search. It also has been found to be both faster and much more memory-efficient than . Finally, Lerouge et al.[13] developed the algorithm which computes GED by solving an integer programming formulation (IP) of the problem of computing GED by calling a commercial IP solver. The IP formulation employed by has Θ(|EG| · |EH|) variables and Θ(|VG| · |EH|) constraints. too, has been found to be faster and more memory-efficient than . While and cover non-uniform edit costs, only works for uniform edit costs. A direct comparison between and is lacking. This might partly be due to the fact that was published in a database venue, while and were published in pattern recognition outlets, and that the database and the pattern recognition communities use slightly different definitions of the graph edit distance.
This paper consolidates and extends these recent advances. More specifically, it contains the following contributions:
- 1.
All existing algorithms for the exact computation of GED are presented in a unified way. This enables a fair comparison and allows future research to combine techniques employed by the existing approaches in order to come up with ideas for more efficient new algorithms.
- 2.
The slightly different definitions of GED employed in the database and in the pattern recognition communities are harmonised.
- 3.
A speed-up of and for uniform edit costs is provided. The speed-up exploits the fact that, in the uniform case, a subroutine that and employ at each node of their search trees can be implemented to run in linear rather than cubic time.
- 4.
By using the harmonisation of the definitions of GED, is generalised to non-uniform edit costs. This comes at the price of a slightly increased runtime. However, the increase is very moderate, as the computational complexity is increased only at the leafs of ’s search tree.
- 5.
A new IP-based algorithm is presented. is mainly interesting from a theoretical viewpoint, as its IP has only Θ(|VG| · |VH|) variables and constraints and is hence much smaller than the one employed by .
- 6.
All existing algorithms for the computation of GED are compared in a thorough empirical evaluation. The main results of the experiments are that, for small graphs, our generalisation is the best algorithm for non-uniform edit costs, while for uniform edit cost, our speed-up of is the best algorithm. For larger graphs, is the best performing currently available algorithm.
The remainder of the paper is organised as follows: In Section 2, we introduce key concepts and notations that are used throughout the paper. In Section 3, we harmonise the definitions of the graph edit distance employed in the database and in the pattern recognition communities. In Section 4, we present the node-based algorithms and and show how they can be speed up for uniform edit costs. In Section 5, we present the edge-based algorithm and show how to generalise it to non-uniform edit costs. In Section 6, we explain how GED can be computed in an IP-based way, present the IP employed by and develop the IP used by . In Section 7, we experimentally evaluate the algorithms. Section 8 concludes the paper and points out to possible future work. The paper extends the results published in [14], where the speed-up of and the extension of to non-uniform metric were presented.
Section snippets
Preliminaries
In this paper, we consider undirected, labelled graphs. However, all results can straightforwardly be extended to directed graphs. Formally, an undirected, labelled graph G is a 4-tuple where VG is a set of nodes, is a set of undirected edges, and and are labelling functions that assign nodes and edges to labels from alphabets ΣV and ΣE. Both ΣV and ΣE contain a special label ε for dummy nodes and edges. The following definition of GED was
Harmonising the definitions of GED
Definition 1 assumes that the edit operations and the edit cost are defined over the labels. In the pattern recognition community, a slightly different definition due to Bunke and Allermann [20] is used, which assumes the edit operations and costs to be defined directly over the nodes and edges. Bougleux et al.[21] showed that, with this definition, GED can equivalently be defined aswhere is the set of all complete node maps between G and H introduced in
Node-based computation of GED
In this section, we discuss the standard paradigm for the computation of GED, which consists in the enumeration of the space of all node maps between G and H. In Section 4.1, we present the existing algorithms and which employ this paradigm. In Section 4.2, we show how they can be speed up for uniform edit costs.
Edge-based computation of GED
In this section, we discuss how to compute GED in an edge-based way. In Section 5.1, we present the existing algorithm which introduced the edge-based paradigm for uniform edit costs. In Section 5.2, we show that, with the help of the harmonisation of the definitions of GED offered in Section 3, and the edge-based paradigm it employs can be generalised to non-uniform edit costs.
IP-based computation of GED
GED can also be computed in an IP-based way. The backbone of this paradigm is the observation that the alternative GED Definition 1 straightforwardly translates into the quadratic programming formulation (2) of the problem of computing GED: The binary variable xi, k indicates whether or not a complete node map π maps i to k. The first group of constraints ensures that and that π is functional on VG, while the second group ensures that and that π is injective on VH.
Setup and datasets
In the experiments, we compared the performance of and for both uniform and non-uniform edit costs. All algorithms were implemented in C++ and employ the same data structures and subroutines. For the IP-based algorithms and we used Gurobi Optimization. All tests were carried out on a machine with two Intel Xeon E5-2667 v3 processors with 8 cores each and 98 GB of main memory running GNU/Linux.
We conducted tests on the datasets Protein,
Conclusions and future work
In this paper, we harmonised the definitions of GED used in the pattern recognition and in the database communities, suggested new methods for its exact computation, and carried out extensive experiments for comparing available exact GED algorithms. One negative takeaway message of these experiments is that no currently available algorithm manages to reliably compute GED within reasonable time between graphs with more than 16 nodes. However, there is at least one good reason to believe that
References (28)
A long trip in the charming world of graphs for pattern recognition
Pattern Recogn.
(2015)- et al.
New binary linear programming formulation to compute the graph edit distance
Pattern Recognit.
(2017) - et al.
Inexact graph matching for structural pattern recognition
Pattern Recogn. Lett.
(1983) - et al.
Graph edit distance as a quadratic assignment problem
Pattern Recogn. Lett.
(2017) - et al.
Approximate graph edit distance computation by means of bipartite graph matching
Image Vis. Comput.
(2009) - et al.
Thirty years of graph matching in pattern recognition
IJPRAI
(2004) - et al.
Graph matching and learning in pattern recognition in the last 10 years
IJPRAI
(2014) - et al.
A survey on applications of bipartite graph edit distance
GbRPR
(2017) - et al.
A novel graph database for handwritten word images
S+SSPR
(2016) - et al.
A hybrid classification model for digital pathology using structural and statistical pattern recognition
IEEE Trans. Med. Imaging
(2013)
A survey of graph edit distance
Pattern Anal. Applic.
Comparing stars: on approximating graph edit distance
PVLDB
IAM graph database repository for graph based pattern recognition and machine learning
S+SSPR
Speeding up graph edit distance computation with a bipartite heuristic
MLG
Cited by (53)
Efficient parallel branch-and-bound approaches for exact graph edit distance problem
2022, Parallel ComputingCalculating Pairwise Similarity of Polymer Ensembles via Earth Mover’s Distance
2024, ACS Polymers AuBenchmarking Whole Knowledge Graph Embedding Techniques
2024, International Journal of Software Engineering and Knowledge Engineering