Elsevier

Information Systems

Volume 56, March 2016, Pages 157-173
Information Systems

Tree edit distance: Robust and memory-efficient

https://doi.org/10.1016/j.is.2015.08.004Get rights and content

Highlights

  • We address the memory problem of the strategy computation in the RTED algorithm for the tree edit distance.

  • We prove an upper bound which guarantees that the strategy computation never uses more memory than the distance computation.

  • We compute the optimal strategy in the class of all-path strategies which subsumes the LRH strategies used before.

  • We develop new single-path functions which are better in terms of runtime and memory than the previously used functions.

Abstract

Hierarchical data are often modelled as trees. An interesting query identifies pairs of similar trees. The standard approach to tree similarity is the tree edit distance, which has successfully been applied in a wide range of applications. In terms of runtime, the state-of-the-art algorithm for the tree edit distance is RTED, which is guaranteed to be fast independent of the tree shape. Unfortunately, this algorithm requires up to twice the memory of its competitors. The memory is quadratic in the tree size and is a bottleneck for the tree edit distance computation.

In this paper we present a new, memory efficient algorithm for the tree edit distance, AP-TED (All Path Tree Edit Distance). Our algorithm runs at least as fast as RTED without trading in memory efficiency. This is achieved by releasing memory early during the first step of the algorithm, which computes a decomposition strategy for the actual distance computation. We show the correctness of our approach and prove an upper bound for the memory usage. The strategy computed by AP-TED is optimal in the class of all-path strategies, which subsumes the class of LRH strategies used in RTED. We further present the AP-TED+ algorithm, which requires less computational effort for very small subtrees and improves the runtime of the distance computation. Our experimental evaluation confirms the low memory requirements and the runtime efficiency of our approach.

Introduction

Data with hierarchical dependencies are often modelled as trees. Tree data appear in many applications, ranging from hierarchical data formats like JSON or XML to merger trees in astrophysics [33]. An interesting query computes the similarity between two trees. The standard measure for tree similarity is the tree edit distance, which is defined as the minimum-cost sequence of node edit operations that transform one tree into another. The tree edit distance has been successfully applied in bioinformatics (e.g., to find similarities between RNA secondary structures [1], [29], neuronal cells [21], or glycan structures [3]), in image analysis [7], pattern recognition [25], melody recognition [19], natural language processing [28], information extraction [12], [23], and document retrieval [22], and has received considerable attention from the database community [5], [8], [9], [10], [11], [16], [17], [18], [26], [27].

The fastest algorithms for the tree edit distance (TED) decompose the input trees into smaller subtrees and use dynamic programming to build the overall solution from the subtree solutions. The key difference between various TED algorithms is the decomposition strategy, which has a major impact on the runtime. Early attempts to compute TED [13], [24], [37] use a hard-coded strategy, which disregards or only partially considers the shape of the input trees. This may lead to very poor strategies and asymptotic runtime differences of up to a polynomial degree. The most recent development is the Robust Tree Edit Distance (RTED) algorithm [30], which operates in two steps (cf. Fig. 1(a)). In the first step, a decomposition strategy is computed. The strategy adapts to the input trees and is shown to be optimal among all previously proposed strategies. The actual distance computation is done in the second step, which executes the strategy.

In terms of runtime, the overhead for the strategy computation in RTED is small compared to the gain due to the better strategy. Unfortunately, this does not hold for the main memory consumption. Fig. 1(b) shows the memory usage for two example trees (perfect binary trees) of 8191 nodes: the strategy computation requires 1.1 GB of RAM, while the execution of the strategy (i.e., the actual distance computation) requires only 0.55 GB. Thus, for large instances, the strategy computation is the bottleneck and the fallback is a hard-coded strategy. This is undesirable since the gain of a good strategy grows with the instance size. Reducing the memory requirements of the strategy computation affects the maximum tree size that can be processed. This is crucial especially for large trees like abstract syntax trees of source code repositories [15], [20] (Emacs: >10k nodes and MythTV: >50k nodes) or merger trees in astrophysics1 [33].

In this paper we propose the AP-TED algorithm, which solves the memory problem of the strategy computation. This is achieved by computing the strategy bottom-up using dynamic programming and releasing part of the memorization tables early. We prove that our algorithm requires at most 1/3 of the memory that is needed by RTED׳s strategy computation [30]. As a result, the memory cost of the strategy computation is never above the cost of the distance computation. Our extensive experimental evaluation on various tree shapes, which require very different strategies, confirms our analytic memory bound and shows that our algorithm is often much better than its theoretical upper bound. For some tree shapes, it even runs in linear space, while the RTED strategy algorithm always requires quadratic space.

In addition to reducing the memory usage, AP-TED computes the optimum in a larger class of strategies than RTED. Strategies are expressed by root-leaf paths that guide the decomposition of the input trees. A path decomposes a tree into subtrees by deleting nodes and edges on a root-leaf path. Each resulting subtree is recursively decomposed by a new root-leaf path. RTED computes the optimal LRH strategy. An LRH strategy considers only left, right, and heavy paths. The left (right) root-leaf path connects each parent with its first (last) child; the heavy path connects the parent with the rightmost child that roots the largest subtree. AP-TED considers all root-leaf paths and is not limited to left, right, and heavy paths. Thus, our strategy is at least as good as the strategies used by RTED. To the best of our knowledge, this is the first algorithm to compute the optimal all-path strategy. The runtime complexity of our strategy algorithm is O(n2) as for the RTED strategy. This result is surprising since in each recursive step we need to consider a linear number of paths compared to only three paths (left, right, and heavy) in the RTED strategy. Our empirical evaluation suggests that in practice our strategy algorithm is even slightly faster than the RTED strategy algorithm since it allocates less memory.

On the distance computation side, we observe that a large number of subproblems that result from the tree decompositions are very small trees with one or two nodes only. We show that a significant boost can be achieved by treating these cases separately. We introduce the AP-TED+ algorithm, which leverages that fact and achieves runtime improvements of more than 50% in some cases.

Summarizing, the contributions of this paper are the following:

  • Memory efficiency. We substantially reduce the memory requirements w.r.t. previous strategy computation algorithms by traversing the trees bottom-up and systematically releasing memory early. The resulting AP-TED algorithm always consumes less memory for the strategy computation than for the actual distance computation and thus breaks the bottleneck of previous algorithms. (We show the correctness of our approach and prove an upper bound for the memory usage.)

  • Optimal all-path strategy. The decomposition strategy used by AP-TED is optimal in the class of all-path strategies. This class generalizes LRH strategies and contains all strategies of previous TED algorithms. Although our strategy algorithm must consider more paths, it is as efficient as the strategy algorithm in RTED (quadratic in the input size).

  • New single-path functions. We develop AP-TED+, which leverages two new single-path functions to compute the distance of subtree pairs when one of the subtrees is small. This case occurs frequently during the decomposition process. Our new single-path functions run in linear time and at most linear space, which substantially improves over the single-path functions ΔL, ΔR, and ΔI used in RTED [30]. To take full advantage of the new functions, we integrate them into the strategy computation to obtain better strategies. Our experiments confirm the significant runtime improvement.

The paper is structured as follows. Section 2 sets the stage for our discussion of strategy algorithms. In Section 3 we define the problem, and we present our AP-TED algorithm in Section 4. The memory efficient implementation of the strategy computation in AP-TED is discussed in Section 5. The AP-TED+ algorithm is presented in Section 6. We treat related work in Section 7, experimentally evaluate our solution in Section 8, and conclude in Section 9.

Section snippets

Notation

We follow the notation of [30] when possible. A tree F is a directed, acyclic, connected graph with nodes N(F) and edges E(F)N(F)×N(F), where each node has at most one incoming edge. Each node has a label, which is not necessarily unique within the tree. The nodes of a tree F are strictly and totally ordered such that (a) v>w for any edge (v,w)E(F), and (b) for any two nodes f,g, if f<g and f is not a descendant of g, then f<g for all descendants g of g. The tree traversal that visits all

Problem definition

As outlined in previous sections, the path strategies introduced by Pawlik and Augsten [30] generalize all state-of-the-art algorithms for computing the tree edit distance. They consider the class of LRH strategies and show optimality. However, LRH strategies limit the paths to be left, right, or heavy. We observe that allowing all paths leads to less expensive strategies. Another drawback of the RTED algorithm is the fact that the computation of the optimal strategy requires more space than

AP-TED algorithm

Until now, only LRH strategies have been considered in literature [13], [24], [30], [37]. They are limited to left, right and heavy paths only. LRH strategies are only a fraction of all possible path strategies. There may exist non-LRH path strategies that lead to better solutions. In principle, all possible path strategies must be checked for the best result. In this section we present AP-TED, a new algorithm that computes the tree edit distance with the optimal all-path strategy. The core of

Memory efficiency in AP-TED

The main memory requirement is a bottleneck of the tree edit distance computation. The strategy computation in RTED exceeds the memory needed for executing the strategy. Our AP-TED strategy algorithm reduces the memory usage by at least 2/3 and never uses more memory than the execution of the strategy. We achieve that by decreasing the maximum size of the data structures used for strategy computation.

AP-TED+ algorithm

The RTED algorithm computes the tree edit distance by executing the single-path functions for the subtree pairs resulting from the strategy. We observe that when one of the input trees in a single-path function is small, the distance can be computed more efficiently than with the existing single-path functions. We address two special cases, which are very frequent and have a high impact on the runtime: one- and two-node trees. We present AP-TED+, a new algorithm that improves over previous

Related work

Tree edit distance algorithms. The tree edit distance has a recursive solution, which decomposes the input trees into smaller subtrees and subforests. The best known algorithms are dynamic programming implementations of this recursive solution, where small subproblems are computed first. The first tree edit distance algorithm was proposed by Tai [34]. It runs in O(n6) time and space where n is the number of tree nodes. The runtime complexity is given by the number of subproblems that must be

Experiments

In this section we experimentally evaluate AP-TED and AP-TED+ and compare them to RTED [30]. Our empirical evaluation on real-world and synthetic data confirms our analytical results: computing the strategy in AP-TED is as efficient as in RTED, but requires significantly less memory. In particular, the strategy computation requires less memory than the actual tree edit distance computation.

Set-up. All algorithms are implemented as single-thread applications in Java 1.7. We run the experiments

Conclusion

In this paper we develop two new algorithms for the tree edit distance: AP-TED and AP-TED+. The strategy computation is a main memory bottleneck of the state-of-the-art solution, RTED [30]. The memory required for the strategy computation can be twice the memory needed for the actual tree edit distance computation. Our AP-TED strategy algorithm reduces the memory by at least 2/3 compared to the strategy computation in RTED and never uses more memory than the distance computation. The

Acknowledgements

This work is partially supported by the SyRA project of the Free University of Bozen-Bolzano, Italy.

References (37)

  • T. Dalamagas et al.

    A methodology for clustering xml documents by structure

    Inf. Syst.

    (2006)
  • S. Dulucq et al.

    Decomposition algorithms for the tree edit distance problem

    J. Discret. Algorithms

    (2005)
  • B. Ma et al.

    Computing similarity between RNA structures

    Theor. Comput. Sci.

    (2002)
  • T. Akutsu

    Tree edit distance problems algorithms and applications to bioinformatics

    IEICE Trans. Inf. Syst. E

    (2010)
  • T. Akutsu et al.

    Approximating tree edit distance through string edit distance

    Algorithmica

    (2010)
  • K.F. Aoki et al.

    Efficient tree-matching methods for accurate carbohydrate database queries

    Genome Inform.

    (2003)
  • T. Aratsu et al.

    Approximating tree edit distance through string edit distance for binary tree codes

    Fundam. Inform.

    (2010)
  • N. Augsten et al.

    Efficient top-k approximate subtree matching in small memory

    IEEE Trans. Knowl. Data Eng. (TKDE)

    (2011)
  • N. Augsten et al.

    The pq-gram distance between ordered labeled trees

    ACM Trans. Database Syst. (TODS)

    (2010)
  • J. Bellando, R. Kothari, Region-based modeling and tree edit distance as a basis for gesture recognition, in:...
  • S.S. Chawathe, Comparing hierarchical data in external memory, in: International Conference on Very Large Data Bases...
  • G. Cobéna, S. Abiteboul, A. Marian, Detecting changes in xml documents, in: International Conference on Data...
  • S. Cohen, Indexing for subtree similarity-search using edit distance, in: ACM SIGMOD International Conference on...
  • D. de Castro Reis, P.B. Golgher, A.S. da Silva, A.H.F. Laender, Automatic web news extraction using tree edit distance,...
  • E.D. Demaine et al.

    An optimal decomposition algorithm for tree edit distance

    ACM Trans. Algorithms

    (2009)
  • J.-R. Falleri, F. Morandat, X. Blanc, M. Martinez, M. Montperrus, Fine-grained and accurate source code differencing,...
  • J.P. Finis, M. Raiber, N. Augsten, R. Brunel, A. Kemper, F. Färber, Rws-diff: flexible and efficient change detection...
  • M. Garofalakis et al.

    Xml stream processing using tree-edit distance embeddings

    ACM Trans. Database Syst. (TODS)

    (2005)
  • Cited by (0)

    View full text