An Efficient Blocking Technique for Reference Matching using MapReduce

Paradies, Marcus

doi:10.1007/s13222-011-0051-9

An Efficient Blocking Technique for Reference Matching using MapReduce

Fachbeitrag
Published: 19 February 2011

Volume 11, pages 47–49, (2011)
Cite this article

Datenbank-Spektrum Aims and scope Submit manuscript

Marcus Paradies¹

130 Accesses
Explore all metrics

Abstract

Document Clustering has become an increasingly important task in the area of data mining and information retrieval. With growing data volumes, CPU—and memory-efficient techniques for clustering algorithms are receiving considerable attention in the research community. To deal with huge amounts of data (e.g., documents from Wikipedia or CiteSeerX which are several GB in size), distributed clustering techniques have been designed to provide scalable and flexible approaches. We study the problem of document clustering in the area of Entity Matching, where documents from various data sources are matched together. More specifically, we focus on a common optimization technique called blocking which reduces the enormous search space by clustering the data sources into smaller groups and processes comparisons only within a group. In this article, we describe our experiences and findings in applying the MapReduce framework to deal with huge bibliographic data sets and to provide a flexible, scalable and easy-to-use blocking technique to reduce the search space for Entity Matching.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Dean J, Ghemawat S, Inc G (2004) MapReduce: simplified data processing on large clusters. In: OSDI 04: Proceedings of the 6th conference on symposium on operating systems design and implementation. USENIX Association
Google Scholar
Fernandez A, Gomez S (2008) Solving non-uniqueness in agglomerative hierarchical clustering using multidendrograms. J Classif 25:43–65
Article MATH MathSciNet Google Scholar
Kirsten T, Kolb L, Hartung M, Gross A, Köpcke H, Rahm E (2010) Data partitioning for parallel entity matching. CoRR
McCallum A, Nigam K, Ungar LH (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’00. ACM, New York, pp 169–178
Chapter Google Scholar
Papadimitriou S, Sun J (2008) Disco: distributed co-clustering with map-reduce: a case study towards petabyte-scale end-to-end mining. In: Proceedings of the 2008 eighth IEEE international conference on data mining. IEEE Comput Soc, Washington, pp 512–521
Chapter Google Scholar
Vernica R, Carey MJ, Li C (2010) Efficient parallel set-similarity joins using mapreduce. In: SIGMOD conference, pp 495–506
Chapter Google Scholar
Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce. In: Proceedings of the 1st international conference on cloud computing, CloudCom ’09. Springer, Berlin, pp 674–679
Google Scholar

Download references

Author information

Authors and Affiliations

FG Datenbanken & Informationssysteme, TU Ilmenau, Ilmenau, Deutschland
Marcus Paradies

Authors

Marcus Paradies
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marcus Paradies.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Paradies, M. An Efficient Blocking Technique for Reference Matching using MapReduce. Datenbank Spektrum 11, 47–49 (2011). https://doi.org/10.1007/s13222-011-0051-9

Download citation

Received: 14 January 2011
Accepted: 08 February 2011
Published: 19 February 2011
Issue Date: April 2011
DOI: https://doi.org/10.1007/s13222-011-0051-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Efficient Blocking Technique for Reference Matching using MapReduce

Abstract

Access this article

Similar content being viewed by others

Efficient Entity Resolution for Bibliographic Data Using MapReduce

Large Scale Citation Matching Using Apache Hadoop

An Efficient Document Indexing-Based Similarity Search in Large Datasets

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An Efficient Blocking Technique for Reference Matching using MapReduce

Abstract

Access this article

Similar content being viewed by others

Efficient Entity Resolution for Bibliographic Data Using MapReduce

Large Scale Citation Matching Using Apache Hadoop

An Efficient Document Indexing-Based Similarity Search in Large Datasets

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation