Elsevier

Big Data Research

Volume 6, December 2016, Pages 43-63
Big Data Research

Boosting the Efficiency of Large-Scale Entity Resolution with Enhanced Meta-Blocking

https://doi.org/10.1016/j.bdr.2016.08.002Get rights and content

Abstract

Entity Resolution constitutes a quadratic task that typically scales to large entity collections through blocking. The resulting blocks can be restructured by Meta-blocking to raise precision at a limited cost in recall. At the core of this procedure lies the blocking graph, where the nodes correspond to entities and the edges connect the comparable pairs. There are several configurations for Meta-blocking, but no hints on best practices. In general, the node-centric approaches are more robust and suitable for a series of applications, but suffer from low precision, due to the large number of unnecessary comparisons they retain.

In this work, we present three novel methods for node-centric Meta-blocking that significantly improve precision. We also introduce a pre-processing method that restricts the size of the blocking graph by removing a large number of noisy edges. As a result, it reduces the overhead time of Meta-blocking by 2 to 5 times, while increasing precision by up to an order of magnitude for a minor cost in recall. The same technique can be applied as graph-free Meta-blocking, enabling for the first time Entity Resolution over very large datasets even on commodity hardware. We evaluate our approaches through an extensive experimental study over 19 voluminous, established datasets. The outcomes indicate best practices for the configuration of Meta-blocking and verify that our techniques reduce the resolution time of state-of-the-art methods by up to an order of magnitude.

Introduction

A common task in the context of Web Data is Entity Resolution (ER), i.e., the identification of different entity profiles that pertain to the same real-world object. Exhaustive solutions to this task suffer from low efficiency, due to their inherently quadratic complexity: every entity profile has to be compared with all others. This problem is accentuated by the continuously larger size of datasets that are now available on the Web. For example, the LODStats1 Web application recorded around a billion triples for Linked Open Data in December, 2011, which had grown to more than 100 billion triples by March, 2016. As a result, ER typically scales to large data collections through approximate techniques, which sacrifice recall to a controllable extent in order to enhance precision and time efficiency.

The most popular among these techniques is blocking [1], [2], [3]. It groups similar entities into clusters (called blocks) so that comparisons are executed only between the entities within each block [4], [5]. Typically, blocking methods for Big Data have to overcome high levels of noise not only in attribute values, but also in attribute names, due to the unprecedented schema heterogeneity. For instance, Google Base2 alone encompasses 100,000 distinct schemata that correspond to 10,000 entity types [6]. Most blocking methods deal with these high levels of noise through redundancy [1], [7]: they place every entity profile into multiple blocks so as to reduce the likelihood of missed matches.

The simplest method of this type is Token Blocking [9], [2]. It disregards schema information and semantics, creating a separate block for every token that appears in the attribute values of at least two entities. To illustrate its functionality, consider the entity profiles in Fig. 1(a), where p1 and p2 match with p3 and p4, respectively; Token Blocking clusters them in the blocks of Fig. 1(b), which place both pairs of duplicates in at least one common block at the cost of 13 comparisons, in total. The resulting computational cost is high, given that the brute-force approach executes 15 comparisons.

This is a general trait of block collections that involve redundancy: in their effort to achieve high recall, they produce a large number of unnecessary comparisons. These come in two forms: the redundant ones repeatedly compare the same entity profiles across different blocks, while the superfluous ones compare non-matching entities. In our example, b2 and b4 contain one redundant comparison each, which are repeated in b1 and b3, respectively; all other blocks entail superfluous comparisons between non-matching entity profiles, except for the redundant comparison p3p5 in b8 (it is repeated in b6). In total, the blocks of Fig. 1(b) involve 3 redundant and 8 superfluous out of the 13 comparisons.

Current state-of-the-art. To mitigate this phenomenon, methods such as Comparison Propagation [10] and Iterative Blocking [11] aim to process an existing block collection in the optimal way (see Section 2 for more details). Among these methods, Meta-blocking achieves the best balance between precision and recall, being one of the few techniques to scale well to millions of entities [7], [8]. In essence, it restructures a block collection B into a new one B that contains a significantly lower number of unnecessary comparisons, while detecting almost the same number of duplicates. This procedure operates in 2 steps.

First, it transforms B into the blocking graph GB, which contains a node ni for every entity pi in B and an edge ei,j for every pair of co-occurring entities pi and pj (i.e., entities sharing at least one block). Fig. 2(a) depicts the graph for the blocks in Fig. 1(b). As no parallel edges are constructed, every pair of entities is compared at most once, thus eliminating all redundant comparisons.

Second, it annotates every edge with a weight analogous to the likelihood that the adjacent entities are matching, based on the blocks they have in common. For instance, the edges in Fig. 2(a) are weighted with the Jaccard similarity of the lists of blocks containing their adjacent entities. The edges with low weights correspond to superfluous comparisons and are pruned. A possible approach is to discard all edges with a weight lower than the overall mean one (1/4). This yields the pruned graph in Fig. 2(b).

Pruning algorithms of this type are called edge-centric, because they iterate over the edges of the blocking graph and retain the globally best ones. Higher recall is achieved by the node-centric pruning algorithms, which iterate over the nodes of the blocking graph and retain the locally best edges. These are the edges with the highest weights in each neighborhood and correspond to the most likely matches for each entity. In contrast, the edge-centric algorithms do not guarantee to include every entity in the restructured blocks. Their recall is lower than the node-centric algorithms by 20%, on average, when compared under the same settings [7].

To illustrate the functionality of node-centric approaches, consider the pruned blocking graph in Fig. 3(a); for each node in Fig. 2(a), it has retained the adjacent edges that exceed the average weight of the neighborhood. Regardless of the type of the pruning algorithm, the restructured block collection B is formed by creating a new block for every retained edge – as depicted in Figs. 2(c) and 3(b). In both cases, B maintains the original recall, while reducing the number of executed comparisons to 5 and 9, respectively.

Open issues. Despite the significant enhancements in efficiency, Meta-blocking suffers from three drawbacks:

(i) Though more robust to recall, the node-centric pruning algorithms exhibit low efficiency, because they retain a considerable portion of redundant and superfluous comparisons. In most cases, their precision is lower than the edge-centric ones by 50% [7]. This is also illustrated in our example, where the restructured blocks of Fig. 3(b) contain 4 redundant comparisons in b2, b4, b6 and b8 and 3 superfluous in b5, b7 and b9; the edge-centric counterpart in Fig. 2(c) retains just 3 superfluous comparisons.

(ii) The processing of voluminous datasets involves a significant overhead. The corresponding blocking graphs comprise millions nodes that are strongly connected with billions edges. Inevitably, the pruning of such graphs is very time-consuming, leaving plenty of room for improving its efficiency (see Section 5.6).

(iii) Meta-blocking is difficult to configure. There are five different weighting schemes that can be combined with four pruning algorithms, thus yielding 20 pruning schemes, in total (see Section 3 for more details). As yet, there are no guidelines on how to choose the best configuration for the application at hand and the available resources.

Proposed solution. In this paper, we describe novel techniques for overcoming the weaknesses of Meta-blocking.

First, we propose three new node-centric pruning algorithms that achieve significantly higher precision than the existing ones. The most conservative approach, Redundancy Pruning, produces restructured blocks with no redundant comparisons and prunes up to 50% more comparisons. It achieves the same recall as the existing techniques, but its precision is almost the double. The other two methods exploit generic properties of the blocking graph to prune at least 50% more comparisons than the existing techniques. Graph Partitioning applies only to bipartite blocking graphs, considering exclusively one of the two partitions in its processing. Reciprocal Pruning applies to any blocking graph, retaining only the edges that are important for both adjacent entities. Their recall is slightly lower than the baselines, but their precision is higher by up to an order of magnitude.

Second, we introduce Block Filtering, which removes every entity from the blocks that are the least important for it. This approach can be used in two ways: (i) In conjunction with graph-based pruning schemes, it acts as a pre-processing technique that shrinks the blocking graph, by discarding more than 50% of its unnecessary edges. Thus, it enhances the scalability of all graph-based Meta-blocking techniques to a significant extent. (ii) As a stand-alone, graph-free Meta-blocking method that involves significantly lower space and time complexities (i.e., overhead). Its configuration is straightforward, it scales to voluminous datasets even with commodity hardware, and it requires up to 2 orders of magnitude less time than the state-of-the-art method.

Finally, we address the problem of a-priori selecting the best pruning scheme, depending on the application at hand and the available resources. We analytically compare the performance of all Meta-blocking methods over 12 real and 7 synthetic established benchmarks, which range from few thousands to several million entities. Our experimental results provide insights into the effect of weighting schemes on each pruning algorithm and identify the pruning schemes that consistently exhibit the best balance between recall, precision and run-time for the main types of ER applications. Our thorough experiments also verify that our techniques outperform the best relevant methods in the literature as well as the best Meta-blocking techniques to a significant extent.

Contributions & paper organization. In summary, we make the following contributions:

  • We present three new node-centric pruning algorithms that significantly improve the precision of existing ones from 30% to 800% at a small cost in recall.

  • We introduce a graph-free technique that minimizes the overhead of Meta-blocking by cleaning the blocking graph from most of its noisy edges. With its help, ER scales to large datasets even with limited resources and the resolution time improves almost by an order of magnitude.

  • We experimentally verify the superior performance of our new methods through an extensive study over 19 voluminous datasets with different characteristics. Its outcomes provide insights into the best configuration for Meta-blocking, depending on the resources and the application at hand. The code and the data of our experiments are publicly available for any interested researcher.3

The rest of the paper is structured as follows: in Section 2, we delve into the most relevant works in the literature, while in Section 3, we formally define the task of Meta-blocking, elaborating on its main notions. Section 4 introduces our novel techniques, and Section 5 presents our thorough experimental evaluation. We conclude the paper in Section 6 along with directions for future work.

Section snippets

Related work

Entity Resolution has been the focus of numerous works that aim to tame its quadratic complexity and scale it to large volumes of data [4], [5]. A large part of the proposed techniques are approximate, with blocking being the most popular among them [1]. Some blocking methods produce disjoint blocks (e.g., Standard Blocking [12]), but most of them yield overlapping blocks with redundant comparisons. In this way, they achieve high recall in the context of noisy data. Depending on the

Preliminaries

In this section, we first elaborate on the main notions of Entity Resolution and its evaluation metrics. Then, we delve into the functionality of Meta-blocking and introduce the weighting schemes and the pruning algorithms that have been proposed in the literature. Finally, we distinguish the applications of ER into two main categories and explain which pruning algorithms are suitable for each of them.

Entity Resolution  At the core of ER lies the notion of entity profile, p. As such, we define

Proposed approach

In this section, we present our four new methods that lead to scalable Meta-blocking. The first three constitute novel node-centric pruning algorithms that achieve higher precision than the existing ones. Their operation relies on the structure of the directed pruned blocking graph and aims to reduce the noisy, unnecessary edges that are retained for each node. They are competitive to each other and can only be used interchangeably.

Graph Partitioning exploits the bipartite blocking graphs that

Evaluation

In this section, we delve into the performance characteristics of our methods through a thorough experimental study. We begin with a presentation of the datasets and the relevant metrics in Section 5.1. We then discuss the performance of the existing Meta-blocking techniques in Section 5.2. We examine the three new methods for node-centric pruning in Section 5.3. In Section 5.4, we fine-tune Block Filtering and demonstrate its beneficial effect on the input blocks, their blocking graph and the

Conclusions

In this paper, we introduced three new node-centric pruning algorithms and compared them with the existing ones through an extensive experimental study. Redundancy Pruning does not affect recall, yet it saves around 30% more comparisons. Reciprocal Pruning decreases recall to a limited extend, but discards more than 66% additional comparisons. Graph Partitioning prunes 50% more comparisons for practically no impact on recall. The last method applies only to Clean–Clean ER, whereas the other two

Acknowledgements

This work was partially supported by EU H2020 BigDataEurope (#644564) project.

References (26)

  • B. Kenig et al.

    MFIBlocks: an effective blocking algorithm for entity resolution

    Inf. Syst.

    (2013)
  • P. Christen

    A survey of indexing techniques for scalable record linkage and deduplication

    IEEE Trans. Knowl. Data Eng.

    (2012)
  • G. Papadakis et al.

    A blocking framework for entity resolution in highly heterogeneous information spaces

    IEEE Trans. Knowl. Data Eng.

    (2013)
  • P. Christen

    Data Matching, Data-Centric Systems and Applications

    (2012)
  • A. Elmagarmid et al.

    Duplicate record detection: a survey

    IEEE Trans. Knowl. Data Eng.

    (2007)
  • J. Madhavan et al.

    Web-scale data integration: you can afford to pay as you go

  • G. Papadakis et al.

    Meta-blocking: taking entity resolution to the next level

    IEEE Trans. Knowl. Data Eng.

    (2014)
  • Giovanni Simonini et al.

    BLAST: a loosely schema-aware meta-blocking approach for entity resolution

    PVLDB

    (2016)
  • G. Papadakis et al.

    Efficient entity resolution for large heterogeneous information spaces

  • G. Papadakis et al.

    Eliminating the redundancy in blocking-based entity resolution methods

  • S.E. Whang et al.

    Entity resolution with iterative blocking

  • I. Fellegi et al.

    A theory for record linkage

    J. Am. Stat. Assoc.

    (1969)
  • A.N. Aizawa et al.

    A fast linkage detection scheme for multi-source information integration

  • Cited by (0)

    View full text