Towards the efficient parallelization of multi-pass adaptive blocking for entity matching

doi:10.1016/j.jpdc.2016.11.002

Journal of Parallel and Distributed Computing

Volume 101, March 2017, Pages 27-40

https://doi.org/10.1016/j.jpdc.2016.11.002 Get rights and content

Highlights

•
A new approach for the Entity Matching task parallelization is proposed.
•
The idea relies on performing a MapReduce-based multi-pass adaptive blocking strategy.
•
The proposed approach shows significant superior performance efficiency.

Abstract

Modern parallel computing programming models, such as MapReduce (MR), have proven to be powerful tools for efficient parallel execution of data-intensive tasks such as Entity Matching (EM) in the era of Big Data. For this reason, studies about challenges and possible solutions of how EM can benefit from this well-known cloud computing programming model have become an important demand nowadays. Furthermore, the effectiveness and scalability of MR-based implementations for EM depend on how well the workload distribution is balanced among all reduce tasks. In this article, we investigate how MapReduce can be used to perform efficient (load balanced) parallel EM using a variation of the multi-pass Sorted Neighborhood Method (SNM) that uses a varying size (adaptive) window. We propose Multi-pass MapReduce Duplicate Count Strategy (MultiMR-DCS++), a MR-based approach for multi-pass adaptive SNM, aiming to increase even more the performance of the SNM. The evaluation results based on real-world datasets and cluster infrastructure show that our approach increases the performance of MapReduce-based SNM regarding the EM execution time and detection quality.

Introduction

Cloud computing has become an important resource for efficiently processing data and computationally intensive application tasks in the era of Big Data [11]. Extensive powerful distributed hardware and service infrastructures capable of processing millions of these tasks are available around the world. Aiming to make efficient use of such cluster environments, many programming models have been created to deal with a vast amount of data. In this context, MapReduce (MR) [6], a well-known programming model for parallel processing on cluster infrastructures, has given to the data management community a powerful “chainsaw” to tackle Big Data problems. Its simplicity, flexibility, fault tolerance and capability for being a scalable parallel shared-nothing data-processing make MR an excellent resource for the efficient workload distribution of data-intensive tasks.

In this article, we investigate the MR-based parallelization of data-processing for the complex problem of Entity Matching (EM) (also known as entity resolution, deduplication, record linkage, or reference reconciliation), i.e., the task of identifying entities referring to the same real-world object [14]. Given the pairwise-comparison nature of the problem, EM is a data-intensive and performance critical task that demands studies on how it can benefit from cloud computing. The task is a fundamental problem in every information integration and data cleansing application [11], e.g., to find duplicate costumers or to match product descriptions on enterprise datasets. It is also essential for other types of applications, such as web pages deduplication [3], plagiarism detection [5] and click fraud detection [20].

Detecting such similar pairs is challenging nowadays. Besides the need of applying matching techniques on the Cartesian product of all input entities which leads to a computational cost in the order of $O (n^{2})$ , there is an increasing trend of applications being expected to deal with vast amounts of data that usually do not fit in the main memory of one machine. This means that the application of such approach is ineffective for Big Data datasets. One way to minimize the workload caused by the Cartesian product execution and to maintain the match quality is to reduce the search space by applying blocking techniques [11]. Such techniques work by partitioning the input data into blocks of similar entities and restricting the EM process to entities that belong to the same block. For instance, it is sufficient to compare entities of the same manufacturer when matching product offers.

The Sorted Neighborhood Method (SNM) [8] is one of the most popular blocking approaches. It sorts all entities using an appropriate blocking key, e.g., the first three letters of the entity name, and only compares entities within a predefined (and fixed) distance window $w$ . SNM thus reduces the execution complexity to $O (n \cdot w)$ for the actual matching. Fig. 1 shows an execution example of SNM for a window size $w = 3$ . The input set $Products Source$ consists of $n = 9$ entities (from Camera Samsung DV150 to IPod Nano 16G Original) represented by letters $L$ (from $A$ to $I$ ). All the entities are sorted according to their blocking key $K$ ( $Cam$ , $Iph$ , or $IPo$ ) which in turn is composed by the first three letters of the product name. Initially, the window includes the first three entities ( $A, D, B$ ) and generates three pairs of comparisons $[(A - D), (A - B), (D - B)]$ . After that, the window is slided down (one entity) to cover the entities $D$ , $B$ , $E$ and two more pairs of comparisons are generated $[(D - E), (B - E)]$ . The sliding process is repeated until the window reaches the last three entities ( $C, G, I$ ). In this process, the number of comparisons generated is $(n - w / 2) \cdot (w - 1)$ .

However, the SNM presents a critical performance disadvantage due to the fixed and difficult to configure window size: if it is selected too small, some duplicates might be missed (e.g., in Fig. 1, the similar pair $B - H$ is not detected). On the other hand, a too large window leads to unnecessary comparisons (e.g., in Fig. 1, the computation of the pair $A - B$ is unnecessary). Note that if effectiveness is more relevant than the execution time, the ideal window size should be equal to the size of the largest duplicate sequence in the dataset. Thus, it is common to request the intervention of a data specialist to solve this tradeoff (small/large window size). To overcome this disadvantage, the authors of [7] proposed an efficient (not parallel) SNM variation denoted as Duplicate Count Strategy (DCS) that follows the idea of increasing the window size in regions of high similarity and decreasing it in regions of low similarity. They also proved that their improved variant of DCS, known as DCS++, overcomes the performance of traditional SNM by obtaining at least the same matching results with a significant reduction in the number of entity comparisons.

Besides the difficulty in configuring the window size, there is the challenge of choosing a ideal/proper/effective blocking key, which may appear especially when dealing with dirty (inaccurate, incomplete or erroneous) input data. In this case, it may not be sufficient to use a single blocking key to find all duplicates. Even using an adaptive window size, if the blocking is not ideal, it is quite common that the similarity value between distant entities remains considerable. To overcome this problem, the SNM also has a multi-pass variant [8], in which multiple blocking keys (e.g., using multiple or the combination of entity attributes) are generated. For each generated blocking key, a new window slide is performed over the set of entities sorted according to the respective blocking key. The multi-pass variant is an important resource for the cases in which the effectiveness of the similarity detection is essential.

Even with the significant advances in the SNM design, EM remains a critical task in terms of performance when applied to large datasets. Thus, this work proposes a MR-based approach capable of combining the efficiency gain achieved by the DCS++ method with the benefit of efficient parallelization of data-intensive tasks in cluster infrastructures. Thereby, we can decrease even more the execution time of EM tasks performed with the multi-pass SNM. In this sense, we make the following contributions:

•
We propose the multi-pass MapReduce-based Duplicate Count Strategy (MultiMR-DCS++), a MR-based approach that provides an efficient parallelization of the multi-pass DCS++ method [7] by using multiple MR jobs and applying a tailored data replication during data redistribution to allow the resizing of the adaptive window. The approach also addresses the data skewness problem with an automatic data partitioning strategy that is combined with MultiMR-DCS++ to provide a satisfactory load balancing across the available nodes.
•
We evaluate the MultiMR-DCS++ (adaptive window) against the MR-based multi-pass SNM state of the art approach RepSN [12] (fixed window) and show that our approach provides a better performance by diminishing the overall EM execution time. The evaluation is performed on a real cluster environment and uses real-world data.

In our previous work [19], we proposed an approach for the single-pass MapReduce-based Duplicate Count Strategy. The solution provides an efficient parallelization of the DCS++ method [7] by using multiple MR jobs. However, the solution presented in [19] does not support the performance of a multi-pass DCS++ variant within the same MR jobs. This means that, to perform the multi-pass DCS++, the MR process must be repeated for each pass. In this new study, we present a more sophisticated model in terms of robustness and extensibility that addresses both the single- and multi-pass MR-DCS++ without the need of the MR process serialization.

This article is structured as follows. Section 2 introduces the EM performance problem and explains how this problem can be treated with the MapReduce programming paradigm. Section 3 discusses related work. Section 4 describes our load balanced multi-pass adaptive windowing approach for EM using MapReduce. Section 5 presents the performed experiments and evaluation. Finally, Section 6 concludes the article and provides suggestions for future work.

Section snippets

Background

In this work, we consider the problem of EM within one data source. The input data source $S$ contains a finite set of entities $e$ . The task is to identify all pairs of entities $M = {(e_{i}, e_{k}) ∣ e_{i}, e_{k} \in S}$ that are regarded as similar.

Furthermore, we focus on the following EM challenge: minimize the execution time necessary to identify all the similar entities in a given dataset. Thinking this way, our optimization goal is: for a given similar identifier operator (matcher) and its inputs, we want to

Related work

Entity Matching (EM) is a very studied research topic [3], [21]. Many EM approaches have been proposed and evaluated as described in the recent surveys [4], [14]. As modern databases are becoming larger, deduplicating or matching them requires increasingly massive amounts of computing power and storage resources. Researchers have begun to investigate how modern parallel and distributed computing environments can be employed to reduce the time required to conduct large-scale entity matching

General Multi-pass MR-based DCS++ workflow

Occasionally, especially when dealing with dirty (inaccurate, incomplete or erroneous) input data, it is not sufficient to use a single blocking key to generate satisfactory EM results. Even using an adaptive window size, if the blocking is not ideal, it is quite common that the similarity between distant entities remains considerable. Multi-pass SNM addresses this problem by employing multiple blocking keys (e.g., using multiple entity attributes) and match passes in order to combine the

Evaluation

In the following, we evaluate the single- and multi-pass MR-DCS++ against the single- and multi-pass RepSN² approaches, regarding three critical performance factors: degree of skewness (Section 5.1), the number of nodes available ( $n$ ) in the cluster environment (Section 5.2) and the tradeoff between the matching quality and execution time (Section 5.3). In each

Summary and outlook

We proposed a novel MultiMR-based approach for solving the problem of the adaptive SNM parallelization, multi-pass MR-DCS++. The solution provides an efficient parallelization of the $DCS + +$ method [7] by using multiple MR jobs and applying a tailored data replication during data redistribution to allow the resizing of the adaptive window. The approach also addresses the data skewness problem with an automatic mechanism of data partitioning that can be combined with MR-DCS++ to ensure a

Acknowledgments

The results of this work have been partially funded by EUBra-BIGSEA (690116), a Research & Innovation Action (RIA) funded by the European Commission under the Cooperation Programme, Horizon 2020 and the Ministério de Ciência, Tecnologia e Inovação (MCTI), RNP/Brazil (grant GA-0000000650/04).

Demetrio Gomes Mestre received the bachelor’s and master’s degrees in computer science from Federal University of Campina Grande, Brazil. He is currently a Ph.D. student at Federal University of Campina Grande. His research interests include data quality, Big Data and cloud computing.

References (27)

H. Kopcke et al.
Frameworks for entity matching: A comparison
Data Knowl. Eng.
(2010)
Apache hadoop. URL: http://hadoop.apache.org/ (accessed:...
O. Benjelloun, H. Garcia-molina, H. Gong, H. Kawai, T.E. Larson, D. Menestrina, S. Thavisomboon, D-swoosh: A family of...
P. Christen
Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection
(2012)
P. Christen
A survey of indexing techniques for scalable record linkage and deduplication
IEEE Trans. Knowl. Data Eng.
(2012)
G. Cosma et al.
An approach to source-code plagiarism detection and investigation using latent semantic analysis
IEEE Trans. Comput.
(2012)
J. Dean et al.
Mapreduce: simplified data processing on large clusters
Commun. ACM
(2008)
U. Draisbach et al.
Adaptive windows for duplicate detection
M.A. Hernández et al.
The merge/purge problem for large databases
SIGMOD Rec.
(1995)
S.-C. Hsueh et al.
A load-balanced mapreduce algorithm for blocking-based entity-resolution with multiple keys
Parallel Distrib. Comput.
(2014)

T. Kirsten, L. Kolb, M. Hartung, A. Gross, H. Kopcke, E. Rahm, Data partitioning for parallel entity matching, in: 8th...

L. Kolb et al.

Load balancing for mapreduce-based entity resolution

L. Kolb et al.

Multi-pass sorted neighborhood blocking with mapreduce

Comput. Sci.

(2012)

Cited by (7)

Estimating record linkage costs in distributed environments
2020, Journal of Parallel and Distributed Computing
Citation Excerpt :
These works usually employ consolidated frameworks, such as Hadoop [11] and Spark [2], and/or cloud computing capabilities in order to allocate virtual machines on-demand to meet predefined Quality of Service (QoS) requirements. Besides dealing with the distribution of the comparisons between entities (usually taking into account the input generated by the indexing phase), these works also tackle problems that are inherent to distributed environments, such as load unbalancing [11,20,22]. The experimental results carried out by these works show that an RL task may greatly benefit from cloud computing environments.
Record Linkage (RL) is the task of identifying duplicate entities in a dataset or multiple datasets. In the era of Big Data, this task has gained notorious attention due to the intrinsic quadratic complexity of the problem in relation to the size of the dataset. In practice, this task can be outsourced to a cloud service, and thus, a service customer may be interested in estimating the costs of a record linkage solution before executing it. Since the execution time of a record linkage solution depends on a combination of various algorithms, their respective parameter values and the employed cloud infrastructure, in practice it is hard to perform an a priori estimation of infrastructure costs for executing a record linkage task. Besides estimating customer costs, the estimation of record linkage costs is also important to evaluate whether (or not) the application of a set of RL parameter values will satisfy predefined time and budget restrictions. Aiming to tackle these challenges, we propose a theoretical model for estimating RL costs taking into account the main steps that may influence the execution time of the RL task. We also propose an algorithm, denoted as $T B F$ , for evaluating the feasibility of RL parameter values, given a set of predefined customer restrictions. We evaluate the efficacy of the proposed model combined with regression techniques using record linkage results processed in real distributed environments. Based on the experimental results, we show that the employed regression technique has significant influence over the estimated record linkage costs. Moreover, we conclude that specific regression techniques are more suitable for estimating record linkage costs, depending on the evaluated scenario.
Heuristic-based approaches for speeding up incremental record linkage
2018, Journal of Systems and Software
Record Linkage is the task of processing a dataset in order to identify which records refer to the same real world entity. The intrinsic complexity of this task brings many challenges to traditional or naive approaches, especially in contexts such as Big Data, unstructured data and frequent data increments over the dataset. To deal with these contexts, especially the latter, an incremental record linkage approach may be employed in order to avoid (re)processing the entire dataset to update the deduplication results. For doing so, different classification techniques can be employed to identify duplicate entities. Recently, many algorithms have been proposed to combine collective classification, which employs clustering algorithms, together with the incremental principle. In this article, we propose new metrics for incremental record linkage using collective classification and new heuristics (which combine clustering, coverage component filters and a greedy approach) to speed up even more a solution to incremental record linkage. These heuristics have been evaluated using three different scale datasets and the results were analyzed and discussed based on both classical and the newly proposed metrics. The experiments present different trade-offs, regarding efficacy and efficiency results, which are generated by the considered heuristics. Also, the results indicate that, for large and frequent data increments, it is possible to slightly reduce efficacy results by employing a coverage filter-based heuristic that is reasonably faster than the current state-of-the-art approach. In turn, it is also possible to employ single-pass clustering algorithms, which are able to execute significantly faster than the state-of-the-art approach at the cost of sacrificing precision results.
The state of the art and taxonomy of big data analytics: view from new big data framework
2020, Artificial Intelligence Review
Exploiting block co-occurrence to control block sizes for entity resolution
2020, Knowledge and Information Systems
Assuring cloud QoS through loop feedback controller assisted vertical provisioning
2019, CLOSER 2019 - Proceedings of the 9th International Conference on Cloud Computing and Services Science
GIS and data: Three applications to enhance mobility
2018, Proceedings of the Brazilian Symposium on GeoInformatics

View all citing articles on Scopus

Carlos Eduardo Santos Pires received the bachelor’s and master’s degrees in computer science from Federal University of Campina Grande, Brazil. He also finished his Ph.D. from Federal University of Pernambuco, Brazil. He is currently a full Professor at Federal University of Campina Grande. His research interests include data quality and Big Data.

Dimas Cassimiro Nascimento received the bachelor’s and master’s degrees in computer science from Federal University of Campina Grande, Brazil. He is currently a Ph.D. student at Federal University of Campina Grande. His research interests include data quality, machine learning and cloud computing.

View full text

Towards the efficient parallelization of multi-pass adaptive blocking for entity matching

Highlights

Abstract

Introduction

Section snippets

Background

Related work

General Multi-pass MR-based DCS++ workflow

Evaluation

Summary and outlook

Acknowledgments

Data Knowl. Eng.

Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection

A survey of indexing techniques for scalable record linkage and deduplication

IEEE Trans. Knowl. Data Eng.

An approach to source-code plagiarism detection and investigation using latent semantic analysis

IEEE Trans. Comput.

Mapreduce: simplified data processing on large clusters

Commun. ACM

Adaptive windows for duplicate detection

The merge/purge problem for large databases

SIGMOD Rec.

A load-balanced mapreduce algorithm for blocking-based entity-resolution with multiple keys

Parallel Distrib. Comput.

Load balancing for mapreduce-based entity resolution

Multi-pass sorted neighborhood blocking with mapreduce

Comput. Sci.