MapReduce-based entity matching with multiple blocking functions

Jin, Cheqing; Chen, Jie; Liu, Huiping

doi:10.1007/s11704-016-5346-4

MapReduce-based entity matching with multiple blocking functions

Research Article
Published: 09 August 2016

Volume 11, pages 895–911, (2017)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Cheqing Jin¹,
Jie Chen¹ &
Huiping Liu¹

71 Accesses
3 Citations
Explore all metrics

Abstract

Entity matching that aims at finding some records belonging to the same real-world objects has been studied for decades. In order to avoid verifying every pair of records in a massive data set, a common method, known as the blocking-based method, tends to select a small proportion of record pairs for verification with a far lower cost than O(n ²), where n is the size of the data set. Furthermore, executing multiple blocking functions independently is critical since much more matching records can be found in this way, so that the quality of the query result can be improved significantly.

It is popular to use the MapReduce (MR) framework to improve the performance and the scalability of some complicated queries by running a lot of map (/reduce) tasks in parallel. However, entity matching upon the MapReduce framework is non-trivial due to two inevitable challenges: load balancing and pair deduplication. In this paper, we propose a novel solution, called MrEm, to handle these challenges with the support of multiple blocking functions. Although the existing work can deal with load balancing and pair deduplication respectively, it still cannot deal with both challenges at the same time. Theoretical analysis and experimental results upon real and synthetic data sets illustrate the high effectiveness and efficiency of our proposed solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient MapReduce-Based Method for Massive Entity Matching

Random-Based Algorithm for Efficient Entity Matching

Experimental Evaluation Among Reblocking Techniques Applied to the Entity Resolution

References

Benjelloun O, Garcia-Molina H, Menestrina D, Su Q, Whang S E, Widom J. Swoosh: a generic approach to entity resolution. The VLDB Journal—The International Journal on Very Large Data Bases, 2009, 18(1): 255–276
Article Google Scholar
Bilenko M, Mooney R J. Adadptive duplicate detection using learnable string similarity measures. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2003, 39–48
Google Scholar
Guo S T, Dong X L, Srivastava D, Zajac R. Record linkage with uniqueness constraints and erroneous values. Proceedings of the VLDB Endowment, 2010, 3(1–2): 417–428
Article Google Scholar
Li P, Dong X L, Maurino A, Srivastava D. Linkingtemporal records. Proceedings of the VLDB Endowment, 2011, 4(11): 956–967
Google Scholar
Rastogi V, Dalvi N, Garofalakis M. Large-scale collective entity matching. Proceedings of the VLDB Endowment, 2011, 4(4): 208–218
Article Google Scholar
Bilenko M, Kamath B, Mooney R J. Adaptive blocking: learning to scale up record linkage. In: Proceedings of the 6th IEEE International Conference on Data Mining. 2006, 87–96
Google Scholar
Christen P. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(9): 1537–1555
Article Google Scholar
De Vries T, Ke H, Chawla S, Christen P. Robust record linkage blocking using suffix arrays and bloom filters. ACM Transactions on Knowledge Discovery from Data, 2011, 5(2): 9
Article Google Scholar
Michelson M, Knoblock C A. Learning blocking schemes for record linkage. In: Proceedings of the National Conference on Artificial Intelligence. 2006, 440–445
Google Scholar
Fellegi I P, Sunter A B. A theory for record linkage. Journal of the American Statistical Association, 1969, 64(328): 1183–1210
Article MATH Google Scholar
Hernández M A, Stolfo S J. The merge/purge problem for large databases. ACM SIGMOD Record, 1995, 24(2): 127–138
Article Google Scholar
Gionis A, Indyk P, Motwani R. Similarity search in high dimensions via hashing. The VLDB Journal — The International Journal on Very Large Data Bases, 1999, 99(6): 518–529
Google Scholar
Indyk P, Motwani R. Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the 30th Annual ACM Symposium on Theory of Computing. 1998, 604–613
Google Scholar
Kolb L, Thor A, Rahm E. Multi-pass sorted neighborhood blocking with MapReduce. Computer Science-Research and Development, 2012, 27(1): 45–63
Article Google Scholar
Whang S E, Menestrina D, Koutrika G, Theobald M, Garcia-Molina H. Entity resolution with iterative blocking. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data. 2009, 219–232
Google Scholar
Kolb L, Thor A, Rahm E. Load balancing for MapReduce-based entity resolution. In: Proceedings of the 28th IEEE International Conference on Data Engineering. 2012, 618–629
Google Scholar
Köpcke H, Thor A, Rahm E. Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment, 2010, 3(1–2): 484–493
Article Google Scholar
Kolb L, Thor A, Rahm E. Don’t match twice:redundancy-free similarity computation with MapReduce. In: Proceedings of the 2nd Workshop on Data Analytics in the Cloud. 2013, 1–5
Google Scholar
Kolb L, Rahm E. Parallel entity resolution with dedoop. Datenbank-Spektrum, 2013, 13(1): 23–32
Article Google Scholar
Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Communications of the ACM, 2008, 51(1): 107–113
Article Google Scholar
White T. Hadoop: The Definitive Guide. 3rd ed. O’Reilly Media, Inc., 2012
Google Scholar
Mitzenmacher M. Compressed bloom filters. IEEE/ACM Transactions on Networking, 2002, 10(5): 604–612
Article MATH Google Scholar
Vernica R, CareyMJ, Li C. Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 495–506
Chapter Google Scholar
Baxter R, Christen P, Churches T. A comparison of fast blocking methods for record linkage. ACM SIGKDD, 2003, 3: 25–27
Google Scholar
Cohen W W, Richman J. Learning to match and cluster large highdimensional data sets for data integration. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002, 475–480
Google Scholar
Jin L, Li C, Mehrotra S. Efficient record linkage in large data sets. In: Proceedings of the 8th International Conference on Database Systems for Advanced Applications. 2003, 137–146
Google Scholar
He Y B, Tan H Y, Luo WM, Feng S Z, Fan J P. MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data. Frontiers of Computer Science, 2014, 8(1): 83–99
Article MathSciNet Google Scholar
Das Sarma A, He Y Y, Chaudhuri S. Clusterjoin: a similarity joins framework using map-reduce. Proceedings of the VLDB Endowment, 2014, 7(12): 1059–1070
Article Google Scholar
Deng D, Li G L, Hao S, Wang J N, Feng J H. Massjoin: a MapReducebased method for scalable string similarity joins. In: proceedings of the 30th IEEE International Conference on Data Engineering. 2014, 340–351
Google Scholar
Kim Y, Shim K. Parallel top-k similarity join algorithms using MapReduce. In: Proceedings of the 28th IEEE International Conference on Data Engineering. 2012, 510–521
Google Scholar

Download references

Acknowledgements

Our research is supported by the National Basic Research Program of China (2012CB316203), the National Natural Science Foundation of China (Grant Nos. 61370101 and U1501252), Shanghai Knowledge Service Platform Project (ZF1213), and Innovation Program of Shanghai Municipal Education Commission (14ZZ045).

Author information

Authors and Affiliations

Institute for Data Science and Engineering, School of Computer Science and Software Engineering, East China Normal University, Shanghai, 200062, China
Cheqing Jin, Jie Chen & Huiping Liu

Authors

Cheqing Jin
View author publications
You can also search for this author in PubMed Google Scholar
Jie Chen
View author publications
You can also search for this author in PubMed Google Scholar
Huiping Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cheqing Jin.

Additional information

Cheqing Jin is a professor at East China Normal University, China. He received his master and bachelor degrees from Zhejiang University (ECNU), China in 1999 and 2002 respectively, and his PhD degree from Fudan University, China in 2005, all in Computer Science. He worked as an assistant professor in East China University of Science and Technology, China from 2005 to 2008, afterwards he joined ECNU on October 2008. In 2003 and 2007, he visited the University of Hong Kong, China and the Chinese University of Hong Kong, China respectively. He has acted as the PC members for more than ten conferences. His main research interests include streaming data management, location-based services, uncertain data management, data quality, and database benchmarking.

Jie Chen received his undergraduate and master degree from East China Normal University, China in 2011 and 2014, respectively. As of now, he is working in Pay-Pal, Risk Management team to be a risk analyst. His research area is data quality and data mining, especially for handling big data.

Huiping Liu received the BS degree in software engineering from East China Normal University, China in 2013. Currently, he is a PhD student supervised by Prof. Cheqing Jin. His research mainly focuses on data quality, massive data mining and processing, and location-based services.

Electronic supplementary material

Supplementary material, approximately 201 KB.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jin, C., Chen, J. & Liu, H. MapReduce-based entity matching with multiple blocking functions. Front. Comput. Sci. 11, 895–911 (2017). https://doi.org/10.1007/s11704-016-5346-4

Download citation

Received: 14 August 2015
Accepted: 15 December 2015
Published: 09 August 2016
Issue Date: October 2017
DOI: https://doi.org/10.1007/s11704-016-5346-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MapReduce-based entity matching with multiple blocking functions

Abstract

Access this article

Similar content being viewed by others

Efficient MapReduce-Based Method for Massive Entity Matching

Random-Based Algorithm for Efficient Entity Matching

Experimental Evaluation Among Reblocking Techniques Applied to the Entity Resolution

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material, approximately 201 KB.

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MapReduce-based entity matching with multiple blocking functions

Abstract

Access this article

Similar content being viewed by others

Efficient MapReduce-Based Method for Massive Entity Matching

Random-Based Algorithm for Efficient Entity Matching

Experimental Evaluation Among Reblocking Techniques Applied to the Entity Resolution

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material, approximately 201 KB.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation