Towards the efficient parallelization of multi-pass adaptive blocking for entity matching

https://doi.org/10.1016/j.jpdc.2016.11.002Get rights and content

Highlights

  • A new approach for the Entity Matching task parallelization is proposed.

  • The idea relies on performing a MapReduce-based multi-pass adaptive blocking strategy.

  • The proposed approach shows significant superior performance efficiency.

Abstract

Modern parallel computing programming models, such as MapReduce (MR), have proven to be powerful tools for efficient parallel execution of data-intensive tasks such as Entity Matching (EM) in the era of Big Data. For this reason, studies about challenges and possible solutions of how EM can benefit from this well-known cloud computing programming model have become an important demand nowadays. Furthermore, the effectiveness and scalability of MR-based implementations for EM depend on how well the workload distribution is balanced among all reduce tasks. In this article, we investigate how MapReduce can be used to perform efficient (load balanced) parallel EM using a variation of the multi-pass Sorted Neighborhood Method (SNM) that uses a varying size (adaptive) window. We propose Multi-pass MapReduce Duplicate Count Strategy (MultiMR-DCS++), a MR-based approach for multi-pass adaptive SNM, aiming to increase even more the performance of the SNM. The evaluation results based on real-world datasets and cluster infrastructure show that our approach increases the performance of MapReduce-based SNM regarding the EM execution time and detection quality.

Introduction

Cloud computing has become an important resource for efficiently processing data and computationally intensive application tasks in the era of Big Data  [11]. Extensive powerful distributed hardware and service infrastructures capable of processing millions of these tasks are available around the world. Aiming to make efficient use of such cluster environments, many programming models have been created to deal with a vast amount of data. In this context, MapReduce (MR)  [6], a well-known programming model for parallel processing on cluster infrastructures, has given to the data management community a powerful “chainsaw” to tackle Big Data problems. Its simplicity, flexibility, fault tolerance and capability for being a scalable parallel shared-nothing data-processing make MR an excellent resource for the efficient workload distribution of data-intensive tasks.

In this article, we investigate the MR-based parallelization of data-processing for the complex problem of Entity Matching (EM) (also known as entity resolution, deduplication, record linkage, or reference reconciliation), i.e., the task of identifying entities referring to the same real-world object  [14]. Given the pairwise-comparison nature of the problem, EM is a data-intensive and performance critical task that demands studies on how it can benefit from cloud computing. The task is a fundamental problem in every information integration and data cleansing application  [11], e.g., to find duplicate costumers or to match product descriptions on enterprise datasets. It is also essential for other types of applications, such as web pages deduplication  [3], plagiarism detection  [5] and click fraud detection  [20].

Detecting such similar pairs is challenging nowadays. Besides the need of applying matching techniques on the Cartesian product of all input entities which leads to a computational cost in the order of O(n2), there is an increasing trend of applications being expected to deal with vast amounts of data that usually do not fit in the main memory of one machine. This means that the application of such approach is ineffective for Big Data datasets. One way to minimize the workload caused by the Cartesian product execution and to maintain the match quality is to reduce the search space by applying blocking techniques  [11]. Such techniques work by partitioning the input data into blocks of similar entities and restricting the EM process to entities that belong to the same block. For instance, it is sufficient to compare entities of the same manufacturer when matching product offers.

The Sorted Neighborhood Method (SNM)  [8] is one of the most popular blocking approaches. It sorts all entities using an appropriate blocking key, e.g., the first three letters of the entity name, and only compares entities within a predefined (and fixed) distance window w. SNM thus reduces the execution complexity to O(nw) for the actual matching. Fig. 1 shows an execution example of SNM for a window size w=3. The input set Products Source consists of n=9 entities (from Camera Samsung DV150 to IPod Nano 16G Original) represented by letters L (from A to I). All the entities are sorted according to their blocking key K (Cam, Iph, or IPo) which in turn is composed by the first three letters of the product name. Initially, the window includes the first three entities (A,D,B) and generates three pairs of comparisons [(AD),(AB),(DB)]. After that, the window is slided down (one entity) to cover the entities D, B, E and two more pairs of comparisons are generated [(DE),(BE)]. The sliding process is repeated until the window reaches the last three entities (C,G,I). In this process, the number of comparisons generated is (nw/2)(w1).

However, the SNM presents a critical performance disadvantage due to the fixed and difficult to configure window size: if it is selected too small, some duplicates might be missed (e.g., in Fig. 1, the similar pair BH is not detected). On the other hand, a too large window leads to unnecessary comparisons (e.g., in Fig. 1, the computation of the pair AB is unnecessary). Note that if effectiveness is more relevant than the execution time, the ideal window size should be equal to the size of the largest duplicate sequence in the dataset. Thus, it is common to request the intervention of a data specialist to solve this tradeoff (small/large window size). To overcome this disadvantage, the authors of  [7] proposed an efficient (not parallel) SNM variation denoted as Duplicate Count Strategy (DCS) that follows the idea of increasing the window size in regions of high similarity and decreasing it in regions of low similarity. They also proved that their improved variant of DCS, known as DCS++, overcomes the performance of traditional SNM by obtaining at least the same matching results with a significant reduction in the number of entity comparisons.

Besides the difficulty in configuring the window size, there is the challenge of choosing a ideal/proper/effective blocking key, which may appear especially when dealing with dirty (inaccurate, incomplete or erroneous) input data. In this case, it may not be sufficient to use a single blocking key to find all duplicates. Even using an adaptive window size, if the blocking is not ideal, it is quite common that the similarity value between distant entities remains considerable. To overcome this problem, the SNM also has a multi-pass variant  [8], in which multiple blocking keys (e.g., using multiple or the combination of entity attributes) are generated. For each generated blocking key, a new window slide is performed over the set of entities sorted according to the respective blocking key. The multi-pass variant is an important resource for the cases in which the effectiveness of the similarity detection is essential.

Even with the significant advances in the SNM design, EM remains a critical task in terms of performance when applied to large datasets. Thus, this work proposes a MR-based approach capable of combining the efficiency gain achieved by the DCS++ method with the benefit of efficient parallelization of data-intensive tasks in cluster infrastructures. Thereby, we can decrease even more the execution time of EM tasks performed with the multi-pass SNM. In this sense, we make the following contributions:

  • We propose the multi-pass MapReduce-based Duplicate Count Strategy (MultiMR-DCS++), a MR-based approach that provides an efficient parallelization of the multi-pass DCS++ method  [7] by using multiple MR jobs and applying a tailored data replication during data redistribution to allow the resizing of the adaptive window. The approach also addresses the data skewness problem with an automatic data partitioning strategy that is combined with MultiMR-DCS++ to provide a satisfactory load balancing across the available nodes.

  • We evaluate the MultiMR-DCS++ (adaptive window) against the MR-based multi-pass SNM state of the art approach RepSN   [12] (fixed window) and show that our approach provides a better performance by diminishing the overall EM execution time. The evaluation is performed on a real cluster environment and uses real-world data.

In our previous work  [19], we proposed an approach for the single-pass MapReduce-based Duplicate Count Strategy. The solution provides an efficient parallelization of the DCS++ method  [7] by using multiple MR jobs. However, the solution presented in  [19] does not support the performance of a multi-pass DCS++ variant within the same MR jobs. This means that, to perform the multi-pass DCS++, the MR process must be repeated for each pass. In this new study, we present a more sophisticated model in terms of robustness and extensibility that addresses both the single- and multi-pass MR-DCS++ without the need of the MR process serialization.

This article is structured as follows. Section  2 introduces the EM performance problem and explains how this problem can be treated with the MapReduce programming paradigm. Section  3 discusses related work. Section  4 describes our load balanced multi-pass adaptive windowing approach for EM using MapReduce. Section  5 presents the performed experiments and evaluation. Finally, Section  6 concludes the article and provides suggestions for future work.

Section snippets

Background

In this work, we consider the problem of EM within one data source. The input data source S contains a finite set of entities e. The task is to identify all pairs of entities M={(ei,ek)ei,ekS} that are regarded as similar.

Furthermore, we focus on the following EM challenge: minimize the execution time necessary to identify all the similar entities in a given dataset. Thinking this way, our optimization goal is: for a given similar identifier operator (matcher) and its inputs, we want to

Related work

Entity Matching (EM) is a very studied research topic [3], [21]. Many EM approaches have been proposed and evaluated as described in the recent surveys  [4], [14]. As modern databases are becoming larger, deduplicating or matching them requires increasingly massive amounts of computing power and storage resources. Researchers have begun to investigate how modern parallel and distributed computing environments can be employed to reduce the time required to conduct large-scale entity matching

General Multi-pass MR-based DCS++ workflow

Occasionally, especially when dealing with dirty (inaccurate, incomplete or erroneous) input data, it is not sufficient to use a single blocking key to generate satisfactory EM results. Even using an adaptive window size, if the blocking is not ideal, it is quite common that the similarity between distant entities remains considerable. Multi-pass SNM addresses this problem by employing multiple blocking keys (e.g., using multiple entity attributes) and match passes in order to combine the

Evaluation

In the following, we evaluate the single- and multi-pass MR-DCS++ against the single- and multi-pass RepSN2 approaches, regarding three critical performance factors: degree of skewness (Section  5.1), the number of nodes available (n) in the cluster environment (Section  5.2) and the tradeoff between the matching quality and execution time (Section  5.3). In each

Summary and outlook

We proposed a novel MultiMR-based approach for solving the problem of the adaptive SNM parallelization, multi-pass MR-DCS++. The solution provides an efficient parallelization of the DCS++ method  [7] by using multiple MR jobs and applying a tailored data replication during data redistribution to allow the resizing of the adaptive window. The approach also addresses the data skewness problem with an automatic mechanism of data partitioning that can be combined with MR-DCS++ to ensure a

Acknowledgments

The results of this work have been partially funded by EUBra-BIGSEA (690116), a Research & Innovation Action (RIA) funded by the European Commission under the Cooperation Programme, Horizon 2020 and the Ministério de Ciência, Tecnologia e Inovação (MCTI), RNP/Brazil (grant GA-0000000650/04).

Demetrio Gomes Mestre received the bachelor’s and master’s degrees in computer science from Federal University of Campina Grande, Brazil. He is currently a Ph.D. student at Federal University of Campina Grande. His research interests include data quality, Big Data and cloud computing.

References (27)

  • H. Kopcke et al.

    Frameworks for entity matching: A comparison

    Data Knowl. Eng.

    (2010)
  • Apache hadoop. URL: http://hadoop.apache.org/ (accessed:...
  • O. Benjelloun, H. Garcia-molina, H. Gong, H. Kawai, T.E. Larson, D. Menestrina, S. Thavisomboon, D-swoosh: A family of...
  • P. Christen

    Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection

    (2012)
  • P. Christen

    A survey of indexing techniques for scalable record linkage and deduplication

    IEEE Trans. Knowl. Data Eng.

    (2012)
  • G. Cosma et al.

    An approach to source-code plagiarism detection and investigation using latent semantic analysis

    IEEE Trans. Comput.

    (2012)
  • J. Dean et al.

    Mapreduce: simplified data processing on large clusters

    Commun. ACM

    (2008)
  • U. Draisbach et al.

    Adaptive windows for duplicate detection

  • M.A. Hernández et al.

    The merge/purge problem for large databases

    SIGMOD Rec.

    (1995)
  • S.-C. Hsueh et al.

    A load-balanced mapreduce algorithm for blocking-based entity-resolution with multiple keys

    Parallel Distrib. Comput.

    (2014)
  • T. Kirsten, L. Kolb, M. Hartung, A. Gross, H. Kopcke, E. Rahm, Data partitioning for parallel entity matching, in: 8th...
  • L. Kolb et al.

    Load balancing for mapreduce-based entity resolution

  • L. Kolb et al.

    Multi-pass sorted neighborhood blocking with mapreduce

    Comput. Sci.

    (2012)
  • Cited by (7)

    • Estimating record linkage costs in distributed environments

      2020, Journal of Parallel and Distributed Computing
      Citation Excerpt :

      These works usually employ consolidated frameworks, such as Hadoop [11] and Spark [2], and/or cloud computing capabilities in order to allocate virtual machines on-demand to meet predefined Quality of Service (QoS) requirements. Besides dealing with the distribution of the comparisons between entities (usually taking into account the input generated by the indexing phase), these works also tackle problems that are inherent to distributed environments, such as load unbalancing [11,20,22]. The experimental results carried out by these works show that an RL task may greatly benefit from cloud computing environments.

    • Assuring cloud QoS through loop feedback controller assisted vertical provisioning

      2019, CLOSER 2019 - Proceedings of the 9th International Conference on Cloud Computing and Services Science
    • GIS and data: Three applications to enhance mobility

      2018, Proceedings of the Brazilian Symposium on GeoInformatics
    View all citing articles on Scopus

    Demetrio Gomes Mestre received the bachelor’s and master’s degrees in computer science from Federal University of Campina Grande, Brazil. He is currently a Ph.D. student at Federal University of Campina Grande. His research interests include data quality, Big Data and cloud computing.

    Carlos Eduardo Santos Pires received the bachelor’s and master’s degrees in computer science from Federal University of Campina Grande, Brazil. He also finished his Ph.D. from Federal University of Pernambuco, Brazil. He is currently a full Professor at Federal University of Campina Grande. His research interests include data quality and Big Data.

    Dimas Cassimiro Nascimento received the bachelor’s and master’s degrees in computer science from Federal University of Campina Grande, Brazil. He is currently a Ph.D. student at Federal University of Campina Grande. His research interests include data quality, machine learning and cloud computing.

    View full text