Exploiting block co-occurrence to control block sizes for entity resolution

Nascimento, Dimas Cassimiro; Pires, Carlos Eduardo Santos; Mestre, Demetrio Gomes

doi:10.1007/s10115-019-01347-0

Exploiting block co-occurrence to control block sizes for entity resolution

Regular Paper
Published: 26 March 2019

Volume 62, pages 359–400, (2020)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Dimas Cassimiro Nascimento^1,2,
Carlos Eduardo Santos Pires² &
Demetrio Gomes Mestre²

316 Accesses
6 Citations
Explore all metrics

Abstract

The problem of identifying duplicated entities in a dataset has gained increasing importance during the last decades. Due to the large size of the datasets, this problem can be very costly to be solved due to its intrinsic quadratic complexity. Both researchers and practitioners have developed a variety of techniques aiming to speed up a solution to this problem. One of these techniques is called blocking, an indexing technique that splits the dataset into a set of blocks, such that each block contains entities that share a common property evaluated by a blocking key function. In order to improve the efficacy of the blocking technique, multiple blocking keys may be used, and thus, a set of blocking results is generated. In this paper, we investigate how to control the size of the blocks generated by the use of multiple blocking keys and maintain reasonable quality results, which is measured by the quality of the produced blocks. By controlling the size of the blocks, we can reduce the overall cost of solving an entity resolution problem and facilitate the execution of a variety of tasks (e.g., real-time and privacy-preserving entity resolution). For doing so, we propose many heuristics which exploit the co-occurrence of entities among the generated blocks for pruning, splitting and merging blocks. The experimental results we carry out using four datasets confirm the adequacy of the proposed heuristics for generating block sizes within a predefined range threshold as well as maintaining reasonable blocking quality results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Experimental Evaluation Among Reblocking Techniques Applied to the Entity Resolution

Record Searching Using Dynamic Blocking for Entity Resolution Systems

The role of transitive closure in evaluating blocking methods for dirty entity resolution

Article 19 October 2021

Notes

Portions of the entity schema can also be exploited [26].
For single blocking key indexing, we employ the following notation: $\{bk_1\{e_1,e_2,\ldots \}, bk_2\{e_3,e_4,\ldots \},bk_3\{e_5,e_6,\ldots \},\ldots \}$, such that each $bk_i$ represents an unique blocking key value generated by the employed blocking key function.
For indexing using multiple blocking key functions, we employ the following notation: $\{\{bk_{i,1}\{e_1,e_2,\ldots \}, bk_{i,2}\{e_3,e_4,\ldots \}, bk_{i,3}\{e_5,e_6,\ldots \},\ldots \}, \{bk_{j,1}\{e_1,e_2,\ldots \},bk_{j,2}\{e_3,e_4,\ldots \}, bk_{j,3}\{e_5,e_6, \ldots \},\ldots \},\ldots \}$, s.t. each $bk_{l,m}$ is the m-th blocking key value generated by the l-th blocking key function.
These schemes have produced encouraging experimental results in [26].
The merge operations may be performed using blocks from different blocking results (although this possibility is not shown in Fig. 1).
We assume that the changes in the blocks performed by the algorithm are reflected in the input blocking collection $\mathcal {B}$. We use the same assumption for Algorithms 3–7.
This special case is not shown in the MaxIntersectionMerge algorithm in order to simplify its representation.
For simplification, we consider that only the blocks $R\{d_7, d_8, d_{12}\}$, $Spain\{d_2, d_3, d_8, d_9, d_{10}\}$, $C\{d_2, d_3, d_4\}$ and $L\{d_{10}, d_{14}\}$ are pruned by the lBCE algorithm, but all blocks in $\mathcal {B}_{D_1}$ are considered in order to calculate the co-occurrence score between entities.
https://github.com/TeamCohen/secondstring/tree/master/data.
http://dblab.cs.toronto.edu/project/stringer/.
https://sites.google.com/site/febrldata/febrl70k/febrl_70k.csv.
https://sites.google.com/site/anhaidgroup/useful-stuff/data.
http://sourceforge.net/projects/febrl/.
$\lambda $ values greater than 0.4 have produced low-quality results in the conducted experiments and thus are not reported in this paper.
The “Default” combination (in Table 8) is the only algorithm that is not influenced by the $\lambda $ parameter.
These results are highlighted in bold in Tables 9, 10, 11 and 12.
These results are highlighted in bold in Tables 9, 10, 11 and 12.

References

Batini C, Scannapieco M (2016) Data quality dimensions. Springer, Cham, pp 21–51
MATH Google Scholar
Batini C, Cappiello C, Francalanci C, Maurino A (2009) Methodologies for data quality assessment and improvement. ACM Comput Surv (CSUR) 41(3):16
Article Google Scholar
Bilenko M, Kamath B, Mooney RJ (2006) Adaptive blocking: learning to scale up record linkage. In: Sixth international conference on data mining, ICDM’06. IEEE, pp 87–96
Christen P (2012) Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer, New York
Book Google Scholar
Christen P (2012) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 24(9):1537–1555
Article Google Scholar
Cohen WW, Richman J (2002) Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 475–480
Costa G, Manco G, Ortale R (2010) An incremental clustering scheme for data de-duplication. Data Min Knowl Discov 20(1):152–187
Article MathSciNet Google Scholar
Covell M, Baluja S (2009) Lsh banding for large-scale retrieval with memory and recall constraints. In: IEEE international conference on acoustics, speech and signal processing, ICASSP 2009. IEEE, pp 1865–1868
De Vries T, Ke H, Chawla S, Christen P (2009) Robust record linkage blocking using suffix arrays. In: Proceedings of the 18th ACM conference on Information and knowledge management. ACM, pp 305–314
do Nascimento DC, Pires CES, Mestre DG (2018) Heuristic-based approaches for speeding up incremental record linkage. J Syst Softw 137:335–354
Article Google Scholar
Ebraheem M, Thirumuruganathan S, Joty S, Ouzzani M, Tang N (2018) Distributed representations of tuples for entity resolution. Proc VLDB Endow 11(11):1454–1467
Article Google Scholar
Fisher J, Christen P, Wang Q, Rahm E (2015) A clustering-based framework to control block sizes for entity resolution. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 279–288
Ganganath N, Cheng CT, Chi KT (2014) Data clustering with cluster size constraints using a modified $k$-means algorithm. In: 2014 International conference on cyber-enabled distributed computing and knowledge discovery (CyberC). IEEE, pp 158–161
Giraud-Carrier C, Goodliffe J, Jones BM, Cueva S (2015) Effective record linkage for mining campaign contribution data. Knowl Inf Syst 45(2):389–416
Article Google Scholar
Gomes Mestre D, Pires CES (2013) Improving load balancing for mapreduce-based entity matching. In: 2013 IEEE symposium on computers and communications (ISCC). IEEE, pp 000618–000624
Gruenheid A, Dong XL, Srivastava D (2014) Incremental record linkage. Proc VLDB Endow 7(9):697–708
Article Google Scholar
Hassanzadeh O, Chiang F, Lee HC, Miller RJ (2009) Framework for evaluating clustering algorithms in duplicate detection. Proc VLDB Endow 2(1):1282–1293
Article Google Scholar
Kolb L, Thor A, Rahm E (2012) Multi-pass sorted neighborhood blocking with mapreduce. Comput Sci Res Dev 27(1):45–63
Article Google Scholar
Koudas N, Sarawagi S, Srivastava D (2006) Record linkage: similarity measures and algorithms. In: Proceedings of the 2006 ACM SIGMOD international conference on Management of data. ACM, pp 802–803
Malinen MI, Fränti P (2014) Balanced $k$-means for clustering. In: Joint IAPR international workshops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR). Springer, pp 32–41
Mann W, Augsten N, Bouros P (2016) An empirical evaluation of set similarity join techniques. Proc VLDB Endow 9(9):636–647
Article Google Scholar
Mestre DG, Pires CE, Nascimento DC (2015) Adaptive sorted neighborhood blocking for entity matching with mapreduce. In: Proceedings of the 30th annual ACM symposium on applied computing. ACM, pp 981–987
Mestre DG, Pires CES, Nascimento DC (2017) Towards the efficient parallelization of multi-pass adaptive blocking for entity matching. J Parallel Distrib Comput 101:27–40
Article Google Scholar
Michelson M, Knoblock CA (2006) Learning blocking schemes for record linkage. In: AAAI, pp 440–445
Nascimento DC, Pires CE, Mestre D (2015) Data quality monitoring of cloud databases based on data quality SLAs. In: Trovati M, Hill R, Anjum A, Zhu S, Liu L (eds) Big-data analytics and cloud computing. Springer, Cham, pp 3–20
Chapter Google Scholar
Papadakis G, Koutrika G, Palpanas T, Nejdl W (2014) Meta-blocking: taking entity resolutionto the next level. IEEE Trans Knowl Data Eng 26(8):1946–1960
Article Google Scholar
Papadakis G, Papastefanatos G, Koutrika G (2014) Supervised meta-blocking. Proc VLDB Endow 7(14):1929–1940
Article Google Scholar
Papenbrock T, Heise A, Naumann F (2015) Progressive duplicate detection. IEEE Trans Knowl Data Eng 27(5):1316–1329
Article Google Scholar
Ramadan B, Christen P, Liang H, Gayler RW (2015) Dynamic sorted neighborhood indexing for real-time entity resolution. J Data Inf Qual 6(4):15
Google Scholar
Ranbaduge T, Vatsalan D, Christen P (2015) Clustering-based scalable indexing for multi-party privacy-preserving record linkage. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 549–561
Ranbaduge T, Vatsalan D, Christen P, Verykios V (2016) Hashing-based distributed multi-party blocking for privacy-preserving record linkage. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 415–427
Rebollo-Monedero D, Solé M, Nin J, Forné J (2013) A modification of the $k$-means method for quasi-unsupervised learning. Knowl Based Syst 37:176–185
Article Google Scholar
Vatsalan D, Christen P, Verykios VS (2013) A taxonomy of privacy-preserving record linkage techniques. Inf Syst 38(6):946–969
Article Google Scholar
Vatsalan D, Christen P (2013) Sorted nearest neighborhood clustering for efficient private blocking. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 341–352
Verykios VS, Karakasidis A, Mitrogiannis VK (2009) Privacy preserving record linkage approaches. Int J Data Min Model Manag 1(2):206–221
MATH Google Scholar
Whang SE, Menestrina D, Koutrika G, Theobald M, Garcia-Molina H (2009) Entity resolution with iterative blocking. In: Proceedings of the 2009 ACM SIGMOD international conference on management of data. ACM, pp 219–232
Whang SE, Marmaros D, Garcia-Molina H (2013) Pay-as-you-go entity resolution. IEEE Trans Knowl Data Eng 25(5):1111–1124
Article Google Scholar
Yan S, Lee D, Kan MY, Giles LC (2007) Adaptive sorted neighborhood methods for efficient record linkage. In: Proceedings of the 7th ACM/IEEE-CS joint conference on digital libraries. ACM, pp 185–194
Zhu S, Wang D, Li T (2010) Data clustering with size constraints. Knowl Based Syst 23(8):883–889
Article Google Scholar

Download references

Author information

Authors and Affiliations

Federal Rural University of Pernambuco, Garanhuns, Brazil
Dimas Cassimiro Nascimento
Federal University of Campina Grande, Campina Grande, Brazil
Dimas Cassimiro Nascimento, Carlos Eduardo Santos Pires & Demetrio Gomes Mestre

Authors

Dimas Cassimiro Nascimento
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Eduardo Santos Pires
View author publications
You can also search for this author in PubMed Google Scholar
Demetrio Gomes Mestre
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dimas Cassimiro Nascimento.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Employed blocking keys

Let S be a string, $F_n(S)$ be the first n letters of S, nB(S) be the first n bigrams [4] of S and nT(S) be the first n trigrams [4] of S. Using this notation, in Table 13 we present the blocking key functions which have been used in order to index the datasets in the conducted experiments (Sect. 5). Since the entities in the CoraATDV dataset do not present a predefined schema, we have split the entities by white space. Each part of the split result is denoted as SplitLine[k], such that k is the index of the resulting array.

Table 13 Employed blocking key functions

Full size table

Fixed window size for SN

Let $S_{min}$ and $S_{max}$ be two size parameters, D be a dataset, $\mathcal {F}$ be a set of blocking key functions and $\mathcal {B}$ be the blocking collection generated by indexing D using $\mathcal {F}$. Taking into account the problem presented in Definition 6, the pruned blocking collection $\mathcal {B}_{max}'$ that yields to the maximum aggregated cardinality is composed by ($|D|\cdot |\mathcal {F}|$ div $S_{max}$) blocks containing $S_{max}$ entities and a block containing ($|D|\cdot |\mathcal {F}|$ mod $S_{max}$) entities. Following Definition (1), we can calculate the aggregated cardinality generated by $||\mathcal {B}'_{max}||$ as shown in Eq. (5).

$$\begin{aligned} ||\mathcal {B}'_{max}||&= |\mathcal {F}| \cdot 2^{-1} \cdot \Big ( (|D| \text { div } S_{max}) \cdot S_{max} \cdot (S_{max}-1) \nonumber \\&\quad +\cdot (|D| \text { mod } S_{max}) \cdot ((|D| \text { mod } S_{max}) - 1) \Big ) \end{aligned}$$

(5)

Therefore, by employing a sorted neighborhood (SN) approach with a fixed window size equal to w over the blocks in $\mathcal {B}$, the number of generated comparisons cannot exceed $||\mathcal {B}'_{max}||$. Since the SN algorithm is executed $|\mathcal {F}|$ times (i.e., for each employed blocking key function), the number of comparisons generated by employing the SN method for each blocking key function cannot exceed $\frac{||\mathcal {B}'_{max}||}{|\mathcal {F}|}$. Thus, we need to configure a fixed window size (w) that generates at most $\frac{||\mathcal {B}'_{max}||}{|\mathcal {F}|}$ comparisons.

Since the number of comparisons generated by the SN algorithm can be theoretically estimated [5], we can calculate the maximum window size allowed ($w_{max}$) based on Eq. (6).

$$\begin{aligned}&(w-1) \cdot (|D|-\frac{w}{2}) = \frac{||\mathcal {B}'_{max}||}{|\mathcal {F}|} \nonumber \\&w \cdot |D| - \frac{w^2}{2} - |D| + \frac{w}{2} = \frac{||\mathcal {B}'_{max}||}{|\mathcal {F}|} \nonumber \\&\quad - \frac{w^2}{2} + \frac{(2 \cdot w \cdot |D|) + w}{2} - \frac{||\mathcal {B}'_{max}||}{|\mathcal {F}|} - |D| = 0 \nonumber \\&w^2 - (2 \cdot |D| + 1) \cdot w + 2 \cdot \Big (\frac{||\mathcal {B}'_{max}||}{|\mathcal {F}|} + |D| \Big ) = 0 \end{aligned}$$

(6)

From Eq. (6), we conclude that $w_{max}$ can be calculated as shown in Eq. (7), i.e., the maximum value of w that generates at most $\frac{||\mathcal {B}'_{max}||}{|\mathcal {F}|}$ comparisons between entities.

$$\begin{aligned} w_{max} = \frac{(2 \cdot |D| + 1) + \sqrt{(-2 \cdot |D| + 1)^2 - 4 \cdot 2 \cdot \Big (\frac{||\mathcal {B}'_{max}||}{|\mathcal {F}|} + |D|\Big )}}{2} \end{aligned}$$

(7)

Note that, since our approach can perform shrink and exclude operations over the input blocking collection $\mathcal {B}$, we cannot calculate the minimum number of comparisons to be generated by the pruned blocking collection $\mathcal {B}'$. For this reason, for each dataset, we employ the SN method using two variations of the $w_{max}$ (following Eq. (7)) result.

Block size distribution

We present the distribution of block sizes in the input blocking collection of two datasets (Cora and DBLPM4) employed in the experiments and in the pruned blocking collection generated by one of the proposed algorithms (lECP—see Table 8), in Figs. 10 and 11.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nascimento, D.C., Pires, C.E.S. & Mestre, D.G. Exploiting block co-occurrence to control block sizes for entity resolution. Knowl Inf Syst 62, 359–400 (2020). https://doi.org/10.1007/s10115-019-01347-0

Download citation

Received: 30 May 2018
Revised: 23 February 2019
Accepted: 27 February 2019
Published: 26 March 2019
Issue Date: January 2020
DOI: https://doi.org/10.1007/s10115-019-01347-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploiting block co-occurrence to control block sizes for entity resolution

Abstract

Access this article

Similar content being viewed by others

Experimental Evaluation Among Reblocking Techniques Applied to the Entity Resolution

Record Searching Using Dynamic Blocking for Entity Resolution Systems

The role of transitive closure in evaluating blocking methods for dirty entity resolution

Notes

References