Unsupervised blocking and probabilistic parallelisation for record matching of distributed big data

Dou, Chenxiao; Cui, Yi; Sun, Daniel; Wong, Raymond; Atif, Muhammad; Li, Guoqiang; Ranjan, Rajiv

doi:10.1007/s11227-017-2008-8

Unsupervised blocking and probabilistic parallelisation for record matching of distributed big data

Published: 16 March 2017

Volume 75, pages 623–645, (2019)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Chenxiao Dou¹,
Yi Cui²,
Daniel Sun^1,3,
Raymond Wong¹,
Muhammad Atif⁴,
Guoqiang Li² &
…
Rajiv Ranjan⁵

390 Accesses
3 Citations
Explore all metrics

Abstract

Record Matching refers to identifying pairs of records that relate to the same entities across different data sources. In many applications of data mining, record matching is usually associated to quadratic complexity. In practice, the number of non-matching record pairs always far exceeds the number of matching pairs, and this is called imbalance problem. Blocking is a technique of data reduction, which can filter unlikely matching pairs before record matching. However, for big data there is no fast and effective blocking algorithm yet. In this paper, we report on big data infrastructure to improve efficiency of blocking. Our approach runs blocking process independently and distributedly on the partitions of whole data. To improve efficiency, we adopt a probabilistic technique to balance the speed and the effect of the algorithm that we proposed for distributed blocking. Our experimental analysis endorses the superiority of our technique and shows its novel scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised Blocking of Imbalanced Datasets for Record Matching

A Scalable and Efficient Subgroup Blocking Scheme for Multidatabase Record Linkage

Hashing-Based Distributed Multi-party Blocking for Privacy-Preserving Record Linkage

Notes

References

Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1
Bilenko M, Kamath B, Mooney RJ (2006) Adaptive blocking: learning to scale up record linkage. In: ICDM’06 Sixth International Conference on Data Mining (IEEE, 2006), pp 87–96
Michelson M, Knoblock CA (2006) Learning blocking schemes for record linkage. In: Proceedings of the National Conference on Artificial Intelligence, vol 21. (Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2006), p 440
Whang SE, Menestrina D, Koutrika G, Theobald M, Garcia-Molina H (2009) Entity resolution with iterative blocking. In: SIGMOD Conference, pp 219–232
Dou C, Sun D, Wong R (2016) Unsupervised blocking of imbalanced datasets for record matching. In: International Conference on Web Information Systems Engineering. Springer, Berlin
Dou C, Sun D, Chen YC, Li G, Liu J (2016) Probabilistic parallelisation of blocking non-matched records for big data. In: 2016 IEEE International Conference on Big Data (Big Data), pp 3465–3473. doi:10.1109/BigData.2016.7841009
Chaudhuri S, Chen BC, Ganti V, Kaushik R (2007) Example-driven design of efficient record matching queries. In: Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB Endowment), pp 327–338
Bilenko M, Mooney RJ (2003) On evaluation and training-set construction for duplicate detection. In: Proceedings of the KDD-2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pp 7–12
Arasu A, Götz M, Kaushik R (2010) On active learning of record matching packages. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (ACM), pp 783–794
Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: KDD. ACM, pp 269–278
Tong S, Koller D (2001) Support vector machine active learning with applications to text classification. J Mach Learn Res 2:45
MATH Google Scholar
Tejada S, Knoblock CA, Minton S (2002) Learning domain-independent string transformation weights for high accuracy object identification. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp 350–359
Newcombe HB (1988) Handbook of record linkage: methods for health and statistical studies, administration, and business. Oxford University Press, Oxford
Google Scholar
Wang R, Sun D, Li G, Atif M, Nepal S, LogProv (2016) Logging events as provenance of big data analytics pipelines with trustworthiness. In: 2016 IEEE International Conference on Big Data (Big Data), pp 1402–1411. doi:10.1109/BigData.2016.7840748
Wu D, Zhu L, Xu X, Sakr S, Sun D, Lu Q (2016) Building pipelines for heterogeneous execution environments for big data processing. IEEE Softw 33(2):60. doi:10.1109/MS.2016.35
Article Google Scholar
Akbudak K, Aykanat C (2017) Exploiting locality in sparse matrix–matrix multiplication on many-core architectures. IEEE Trans Parallel Distrib Syst PP(99):1–1. doi:10.1109/TPDS.2017.2656893
Kunfang S, Lu H (2016) Efficient querying distributed big-XML data using MapReduce. Int J Grid High Perform Comput 8(3):70
Zeng Q, Zhao M, Liu P, Yadav P, Calo S, Lobo J (2015) Enforcement of autonomous authorizations in collaborative distributed query evaluation. IEEE Trans Knowl Data Eng 27(4):979
Article Google Scholar
Slagter K, Hsu CH, Chung YC (2015) An adaptive and memory efficient sampling mechanism for partitioning in MapReduce. Int J Parallel Program 43(3):489
Article Google Scholar
Christen P, Churches T, Hegland M (2004) Febrl—a parallel open source data linkage system. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, Berlin, pp 638–647
Kim Hs, Lee D (2007) Parallel linkage. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management. ACM, pp 283–292
Efthymiou V, Stefanidis K, Christophides V (2015) Big data entity resolution: from highly to somehow similar entity descriptions in the Web. In: 2015 IEEE International Conference on Big Data (Big Data). IEEE, pp 401–410
Christen P (2012) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 24(9):1537
Article Google Scholar
Kolb L, Thor A, Rahm E (2012) Dedoop: efficient deduplication with Hadoop. Proc VLDB Endow 5(12):1878
Article Google Scholar
Hernández MA, Stolfo SJ (1995) The merge/purge problem for large databases. In: ACM Sigmod Record, vol 24. ACM, pp 127–138
Kolb L, Thor A, Rahm E (2012) Multi-pass sorted neighborhood blocking with MapReduce. Comput Sci Res Dev 27(1):45
Article Google Scholar
Efthymiou V, Papadakis G, Papastefanatos G, Stefanidis K, Palpanas T (2015) Parallel meta-blocking: realizing scalable entity resolution over large, heterogeneous data. In: 2015 IEEE International Conference on Big Data (Big Data). IEEE, pp 411–420
Papadakis G, Koutrika G, Palpanas T, Nejdl W (2014) Meta-blocking: taking entity resolutionto the next level. IEEE Trans Knowl Data Eng 26(8):1946
Article Google Scholar
Wang L, Tao J, Ranjan R, Marten H, Streit A, Chen J, Chen D (2013) G-Hadoop: MapReduce across distributed data centers for data-intensive computing. Future Gener Comput Syst 29(3):739
Article Google Scholar
Jayalath C, Stephen J, Eugster P (2014) From the cloud to the atmosphere: running mapreduce across data centers. IEEE Trans Comput 63(1):74
Article MathSciNet MATH Google Scholar
Luo Y, Plale B (2012) Hierarchical mapreduce programming model and scheduling algorithms. In: Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (Ccgrid 2012). IEEE Computer Society, pp 769–774
Shabeera T, Madhu Kumar S (2015) Optimising virtual machine allocation in MapReduce cloud for improved data locality. Int J Big Data Intell 2(1):2
Article Google Scholar
Hsu CH, Slagter KD, Chung YC (2015) Locality and loading aware virtual machine mapping techniques for optimizing communications in mapreduce applications. Future Gener Comput Syst 53:43
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of New South Wales, Sydney, NSW, Australia
Chenxiao Dou, Daniel Sun & Raymond Wong
School of Software, Shanghai Jiao Tong University, Shanghai, China
Yi Cui & Guoqiang Li
Data61, CSIRO, Canberra, ACT, Australia
Daniel Sun
National Computational Infrastructure, Canberra, ACT, Australia
Muhammad Atif
Newcastle University, Newcastle upon Tyne, UK
Rajiv Ranjan

Authors

Chenxiao Dou
View author publications
You can also search for this author inPubMed Google Scholar
Yi Cui
View author publications
You can also search for this author inPubMed Google Scholar
Daniel Sun
View author publications
You can also search for this author inPubMed Google Scholar
Raymond Wong
View author publications
You can also search for this author inPubMed Google Scholar
Muhammad Atif
View author publications
You can also search for this author inPubMed Google Scholar
Guoqiang Li
View author publications
You can also search for this author inPubMed Google Scholar
Rajiv Ranjan
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Daniel Sun.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dou, C., Cui, Y., Sun, D. et al. Unsupervised blocking and probabilistic parallelisation for record matching of distributed big data. J Supercomput 75, 623–645 (2019). https://doi.org/10.1007/s11227-017-2008-8

Download citation

Published: 16 March 2017
Issue Date: 06 February 2019
DOI: https://doi.org/10.1007/s11227-017-2008-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised blocking and probabilistic parallelisation for record matching of distributed big data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Unsupervised Blocking of Imbalanced Datasets for Record Matching

A Scalable and Efficient Subgroup Blocking Scheme for Multidatabase Record Linkage

Hashing-Based Distributed Multi-party Blocking for Privacy-Preserving Record Linkage

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now