On the exact computation of the graph edit distance

https://doi.org/10.1016/j.patrec.2018.05.002Get rights and content

Highlights

  • We consolidate the state on the exact computation of GED by presenting all existing algorithms in a unified way.

  • We harmonise the definitions of GED employed in different research communities.

  • We provide a speed-up for uniform and a generalisation to non-uniform edit costs of two state of the art algorithms.

  • We suggest the smallest currently available IP-formulation of the problem of computing GED.

  • We provide the first empirical evaluation that compares all available algorithms for the exact computation of GED.

Abstract

The graph edit distance is a widely used distance measure for labelled graph. However, AGED, the standard approach for its exact computation, suffers from huge runtime and memory requirements. Recently, three better performing algorithms have been proposed: The general algorithms DFGED and BIPGED, and the algorithm CSIGED, which only works for uniform edit costs. All newly proposed algorithms outperform the standard approach AGED. However, cross-comparisons are lacking. This paper consolidates and extends these recent advances. To this purpose, we present all existing algorithms in a unified way and show that the slightly different definitions of the graph edit distance underlying AGED and DFGED, on the one side, and CSIGED, on the other side, can be harmonised. This harmonisation allows us to develop a generalisation of CSIGED to non-uniform edit cost. Moreover, we present a speed-up of AGED and DFGED for uniform edit costs, which build upon the fact that, in the uniform case, a continuously used subroutine can be implemented to run in linear rather than cubic time. We also suggest an algorithm MIPGED which builds upon a very compact new mixed integer linear programming formulation. Finally, we carry out a thorough empirical evaluation, which, for the first time, compares all existing exact algorithms.

Introduction

Labelled graphs can be used for modelling various kinds of objects, such as images, molecular structures, and many more. Because of this, labelled graphs have received increasing attention over the past years. One task researchers have focused on is the following: Given a database G that contains labelled graphs, find all graphs GG that are sufficiently similar to a query graph H or find the k graphs from G that are most similar to H [1], [2], [3]. Being able to quickly answer queries of these kind is crucial for the development of performant pattern recognition techniques in various application domains [4] such as keyword spotting in handwritten documents [5] and cancer detection [6].

For answering graph similarity queries, a distance measure between two labelled graphs G and H has to be defined. A very flexible, sensitive and therefore widely used measure is the graph edit distance (GED), which is defined as the minimum cost of an edit path between G and H. An edit path is a sequence of graphs starting at G and ending at a graph that is isomorphic to H such that each graph on the path can be obtained from its predecessor by applying one of the following edit operations: adding or deleting an isolated node or an edge, and relabelling an existing node or edge. Each edit operation comes with an associated edit cost. The cost of an edit path is defined as the sum of the costs of its edit operations. If the costs of all edit operations equal 1, we say that the edit costs are uniform. In many scenarios, it is natural to consider non-uniform edit costs [7]. For instance, if the graphs model spatial objects and the node labels are Euclidean coordinates, the node relabelling cost should probably be defined as the Euclidean distance between the node labels.

It has been shown that, even for uniform edit costs, it is NP-hard to exactly compute GED [8]. However, efficient exact algorithms are still important; mainly because many objects that are readily modelled by labelled graphs — for instance, some molecular compounds — induce relatively small graphs [9]. For these graphs, GED based similarity queries can in principle be answered. Of course, one would first use efficiently computable upper and lower bounds in order to filter out data graphs that are very far away from the query graphs. However, for the surviving candidates, the exact graph edit distance still has to be computed.

The standard approach AGED [10] for exactly computing GED carries out a node-based best-first search in order to find the optimal edit path. It is very slow and has huge memory requirements. Recently, better performing algorithms have been proposed: Abu-Aisheh et al. [11] developed DFGED, an algorithm which carries out a node-based depth-first search for finding the cheapest edit path. DFGED has been found to be much more memory-efficient and slightly faster than AGED. Gouda and Hassaan [12] developed the algorithm CSIGED, which carries out an edge-based depth-first search. It also has been found to be both faster and much more memory-efficient than AGED. Finally, Lerouge et al.[13] developed the algorithm BIPGED, which computes GED by solving an integer programming formulation (IP) of the problem of computing GED by calling a commercial IP solver. The IP formulation employed by BIPGED has Θ(|EG| · |EH|) variables and Θ(|VG| · |EH|) constraints. BIPGED, too, has been found to be faster and more memory-efficient than AGED. While AGED, BIPGED, and DFGED cover non-uniform edit costs, CSIGED only works for uniform edit costs. A direct comparison between BIPGED, DFGED, and CSIGED is lacking. This might partly be due to the fact that CSIGED was published in a database venue, while DFGED and BIPGED were published in pattern recognition outlets, and that the database and the pattern recognition communities use slightly different definitions of the graph edit distance.

This paper consolidates and extends these recent advances. More specifically, it contains the following contributions:

  • 1.

    All existing algorithms for the exact computation of GED are presented in a unified way. This enables a fair comparison and allows future research to combine techniques employed by the existing approaches in order to come up with ideas for more efficient new algorithms.

  • 2.

    The slightly different definitions of GED employed in the database and in the pattern recognition communities are harmonised.

  • 3.

    A speed-up of AGED and DFGED for uniform edit costs is provided. The speed-up exploits the fact that, in the uniform case, a subroutine that AGED and DFGED employ at each node of their search trees can be implemented to run in linear rather than cubic time.

  • 4.

    By using the harmonisation of the definitions of GED, CSIGED is generalised to non-uniform edit costs. This comes at the price of a slightly increased runtime. However, the increase is very moderate, as the computational complexity is increased only at the leafs of CSIGED’s search tree.

  • 5.

    A new IP-based algorithm MIPGED is presented. MIPGED is mainly interesting from a theoretical viewpoint, as its IP has only Θ(|VG| · |VH|) variables and constraints and is hence much smaller than the one employed by BIPGED.

  • 6.

    All existing algorithms for the computation of GED are compared in a thorough empirical evaluation. The main results of the experiments are that, for small graphs, our generalisation CSIGED is the best algorithm for non-uniform edit costs, while for uniform edit cost, our speed-up of DFGED is the best algorithm. For larger graphs, BIPGED is the best performing currently available algorithm.

The remainder of the paper is organised as follows: In Section 2, we introduce key concepts and notations that are used throughout the paper. In Section 3, we harmonise the definitions of the graph edit distance employed in the database and in the pattern recognition communities. In Section 4, we present the node-based algorithms AGED and DFGED and show how they can be speed up for uniform edit costs. In Section 5, we present the edge-based algorithm CSIGED and show how to generalise it to non-uniform edit costs. In Section 6, we explain how GED can be computed in an IP-based way, present the IP employed by BIPGED, and develop the IP used by MIPGED. In Section 7, we experimentally evaluate the algorithms. Section 8 concludes the paper and points out to possible future work. The paper extends the results published in [14], where the speed-up of DFGED and the extension of CSIGED to non-uniform metric were presented.

Section snippets

Preliminaries

In this paper, we consider undirected, labelled graphs. However, all results can straightforwardly be extended to directed graphs. Formally, an undirected, labelled graph G is a 4-tuple G=VG,EG,VG,EG, where VG is a set of nodes, EG(VG2) is a set of undirected edges, and VG:VGΣV and EG:EGΣE are labelling functions that assign nodes and edges to labels from alphabets ΣV and ΣE. Both ΣV and ΣE contain a special label ε for dummy nodes and edges. The following definition of GED was

Harmonising the definitions of GED

Definition 1 assumes that the edit operations and the edit cost are defined over the labels. In the pattern recognition community, a slightly different definition due to Bunke and Allermann [20] is used, which assumes the edit operations and costs to be defined directly over the nodes and edges. Bougleux et al.[21] showed that, with this definition, GED can equivalently be defined asGED(G,H)=min{g(π)πΠ¯(G,H)},where Π¯(G,H) is the set of all complete node maps between G and H introduced in

Node-based computation of GED

In this section, we discuss the standard paradigm for the computation of GED, which consists in the enumeration of the space of all node maps between G and H. In Section 4.1, we present the existing algorithms AGED and DFGED, which employ this paradigm. In Section 4.2, we show how they can be speed up for uniform edit costs.

Edge-based computation of GED

In this section, we discuss how to compute GED in an edge-based way. In Section 5.1, we present the existing algorithm CSIGED, which introduced the edge-based paradigm for uniform edit costs. In Section 5.2, we show that, with the help of the harmonisation of the definitions of GED offered in Section 3, CSIGED and the edge-based paradigm it employs can be generalised to non-uniform edit costs.

IP-based computation of GED

GED can also be computed in an IP-based way. The backbone of this paradigm is the observation that the alternative GED Definition 1 straightforwardly translates into the quadratic programming formulation (2) of the problem of computing GED: The binary variable xi, k indicates whether or not a complete node map π maps i to k. The first group of constraints ensures that supp(π)VG and that π is functional on VG, while the second group ensures that img(π)VH and that π is injective on VH.

Setup and datasets

In the experiments, we compared the performance of AGED, DFGED, CSIGED, BIPGED, and MIPGED for both uniform and non-uniform edit costs. All algorithms were implemented in C++ and employ the same data structures and subroutines. For the IP-based algorithms BIPGED and MIPGED, we used Gurobi Optimization. All tests were carried out on a machine with two Intel Xeon E5-2667 v3 processors with 8 cores each and 98 GB of main memory running GNU/Linux.

We conducted tests on the datasets Protein,

Conclusions and future work

In this paper, we harmonised the definitions of GED used in the pattern recognition and in the database communities, suggested new methods for its exact computation, and carried out extensive experiments for comparing available exact GED algorithms. One negative takeaway message of these experiments is that no currently available algorithm manages to reliably compute GED within reasonable time between graphs with more than 16 nodes. However, there is at least one good reason to believe that

References (28)

  • X. Gao et al.

    A survey of graph edit distance

    Pattern Anal. Applic.

    (2010)
  • Z. Zeng et al.

    Comparing stars: on approximating graph edit distance

    PVLDB

    (2009)
  • K. Riesen et al.

    IAM graph database repository for graph based pattern recognition and machine learning

    S+SSPR

    (2008)
  • K. Riesen et al.

    Speeding up graph edit distance computation with a bipartite heuristic

    MLG

    (2007)
  • Cited by (53)

    • Benchmarking Whole Knowledge Graph Embedding Techniques

      2024, International Journal of Software Engineering and Knowledge Engineering
    View all citing articles on Scopus
    View full text