Abstract
Record Matching refers to identifying pairs of records that relate to the same entities across different data sources. In many applications of data mining, record matching is usually associated to quadratic complexity. In practice, the number of non-matching record pairs always far exceeds the number of matching pairs, and this is called imbalance problem. Blocking is a technique of data reduction, which can filter unlikely matching pairs before record matching. However, for big data there is no fast and effective blocking algorithm yet. In this paper, we report on big data infrastructure to improve efficiency of blocking. Our approach runs blocking process independently and distributedly on the partitions of whole data. To improve efficiency, we adopt a probabilistic technique to balance the speed and the effect of the algorithm that we proposed for distributed blocking. Our experimental analysis endorses the superiority of our technique and shows its novel scalability.















Similar content being viewed by others
References
Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1
Bilenko M, Kamath B, Mooney RJ (2006) Adaptive blocking: learning to scale up record linkage. In: ICDM’06 Sixth International Conference on Data Mining (IEEE, 2006), pp 87–96
Michelson M, Knoblock CA (2006) Learning blocking schemes for record linkage. In: Proceedings of the National Conference on Artificial Intelligence, vol 21. (Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2006), p 440
Whang SE, Menestrina D, Koutrika G, Theobald M, Garcia-Molina H (2009) Entity resolution with iterative blocking. In: SIGMOD Conference, pp 219–232
Dou C, Sun D, Wong R (2016) Unsupervised blocking of imbalanced datasets for record matching. In: International Conference on Web Information Systems Engineering. Springer, Berlin
Dou C, Sun D, Chen YC, Li G, Liu J (2016) Probabilistic parallelisation of blocking non-matched records for big data. In: 2016 IEEE International Conference on Big Data (Big Data), pp 3465–3473. doi:10.1109/BigData.2016.7841009
Chaudhuri S, Chen BC, Ganti V, Kaushik R (2007) Example-driven design of efficient record matching queries. In: Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB Endowment), pp 327–338
Bilenko M, Mooney RJ (2003) On evaluation and training-set construction for duplicate detection. In: Proceedings of the KDD-2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pp 7–12
Arasu A, Götz M, Kaushik R (2010) On active learning of record matching packages. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (ACM), pp 783–794
Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: KDD. ACM, pp 269–278
Tong S, Koller D (2001) Support vector machine active learning with applications to text classification. J Mach Learn Res 2:45
Tejada S, Knoblock CA, Minton S (2002) Learning domain-independent string transformation weights for high accuracy object identification. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp 350–359
Newcombe HB (1988) Handbook of record linkage: methods for health and statistical studies, administration, and business. Oxford University Press, Oxford
Wang R, Sun D, Li G, Atif M, Nepal S, LogProv (2016) Logging events as provenance of big data analytics pipelines with trustworthiness. In: 2016 IEEE International Conference on Big Data (Big Data), pp 1402–1411. doi:10.1109/BigData.2016.7840748
Wu D, Zhu L, Xu X, Sakr S, Sun D, Lu Q (2016) Building pipelines for heterogeneous execution environments for big data processing. IEEE Softw 33(2):60. doi:10.1109/MS.2016.35
Akbudak K, Aykanat C (2017) Exploiting locality in sparse matrix–matrix multiplication on many-core architectures. IEEE Trans Parallel Distrib Syst PP(99):1–1. doi:10.1109/TPDS.2017.2656893
Kunfang S, Lu H (2016) Efficient querying distributed big-XML data using MapReduce. Int J Grid High Perform Comput 8(3):70
Zeng Q, Zhao M, Liu P, Yadav P, Calo S, Lobo J (2015) Enforcement of autonomous authorizations in collaborative distributed query evaluation. IEEE Trans Knowl Data Eng 27(4):979
Slagter K, Hsu CH, Chung YC (2015) An adaptive and memory efficient sampling mechanism for partitioning in MapReduce. Int J Parallel Program 43(3):489
Christen P, Churches T, Hegland M (2004) Febrl—a parallel open source data linkage system. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, Berlin, pp 638–647
Kim Hs, Lee D (2007) Parallel linkage. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management. ACM, pp 283–292
Efthymiou V, Stefanidis K, Christophides V (2015) Big data entity resolution: from highly to somehow similar entity descriptions in the Web. In: 2015 IEEE International Conference on Big Data (Big Data). IEEE, pp 401–410
Christen P (2012) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 24(9):1537
Kolb L, Thor A, Rahm E (2012) Dedoop: efficient deduplication with Hadoop. Proc VLDB Endow 5(12):1878
Hernández MA, Stolfo SJ (1995) The merge/purge problem for large databases. In: ACM Sigmod Record, vol 24. ACM, pp 127–138
Kolb L, Thor A, Rahm E (2012) Multi-pass sorted neighborhood blocking with MapReduce. Comput Sci Res Dev 27(1):45
Efthymiou V, Papadakis G, Papastefanatos G, Stefanidis K, Palpanas T (2015) Parallel meta-blocking: realizing scalable entity resolution over large, heterogeneous data. In: 2015 IEEE International Conference on Big Data (Big Data). IEEE, pp 411–420
Papadakis G, Koutrika G, Palpanas T, Nejdl W (2014) Meta-blocking: taking entity resolutionto the next level. IEEE Trans Knowl Data Eng 26(8):1946
Wang L, Tao J, Ranjan R, Marten H, Streit A, Chen J, Chen D (2013) G-Hadoop: MapReduce across distributed data centers for data-intensive computing. Future Gener Comput Syst 29(3):739
Jayalath C, Stephen J, Eugster P (2014) From the cloud to the atmosphere: running mapreduce across data centers. IEEE Trans Comput 63(1):74
Luo Y, Plale B (2012) Hierarchical mapreduce programming model and scheduling algorithms. In: Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (Ccgrid 2012). IEEE Computer Society, pp 769–774
Shabeera T, Madhu Kumar S (2015) Optimising virtual machine allocation in MapReduce cloud for improved data locality. Int J Big Data Intell 2(1):2
Hsu CH, Slagter KD, Chung YC (2015) Locality and loading aware virtual machine mapping techniques for optimizing communications in mapreduce applications. Future Gener Comput Syst 53:43
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Dou, C., Cui, Y., Sun, D. et al. Unsupervised blocking and probabilistic parallelisation for record matching of distributed big data. J Supercomput 75, 623–645 (2019). https://doi.org/10.1007/s11227-017-2008-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-017-2008-8