Parallelizing Record Linkage for Disclosure Risk Assessment

Guisado-Gámez, Joan; Prat-Pérez, Arnau; Nin, Jordi; Muntés-Mulero, Victor; Larriba-Pey, Josep Ll.

doi:10.1007/978-3-540-87471-3_16

Joan Guisado-Gámez¹,
Arnau Prat-Pérez¹,
Jordi Nin²,
Victor Muntés-Mulero¹ &
…
Josep Ll. Larriba-Pey¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5262))

Included in the following conference series:

International Conference on Privacy in Statistical Databases

1046 Accesses
2 Citations

Abstract

Handling very large volumes of confidential data is becoming a common practice in many organizations such as statistical agencies. This calls for the use of protection methods that have to be validated in terms of the quality they provide. With the use of Record Linkage (RL) it is possible to compute the disclosure risk, which gives a measure of the quality of a data protection method. However, the RL methods proposed in the literature are computationally costly, which poses difficulties when frequent RL processes have to be executed on large data.

Here, we propose a distributed computing technique to improve the performance of a RL process. We show that our technique not only improves the computing time of a RL process significantly, but it is also scalable in a distributed environment. Also, we show that distributed computation can be complemented with SMP based parallelization in each node increasing the final speedup.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. The Very Large Database Journal, 334–350 (2001)
Google Scholar
Newcombe, H.B.: Record linking: The design of efficient systems for linking records into individuals and family histories. American Journal of Human Genetics (1967)
Google Scholar
Do, H.H., Rahm, E.: COMA - A system for exible combination of schema matching approaches. In: Proceedings of the 28th Very Large Databases Conference, pp. 610–621 (2002)
Google Scholar
Kim, H., Lee, D.: Parallel Linkage. In: CIKM, Lisboa, Portugal (2007)
Google Scholar
http://www.idescat.net
Gómez, J., Larriba, J.L., Ribes, J.: Improving Record Linkage Performance. Technical report UPC-DAC-RR-2006-15
Google Scholar
Jaro, M.A.: Advances in Record Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Society, 414–420 (1989)
Google Scholar
Atencia, M., Schorlemmer, M.: A formal model for situated semantic alignment. In: Proceedings of the 6th International Conference in Agent and Multiagent Systems (2007)
Google Scholar
Bilenko, M., Basu, S., Sahami, M.: Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Sopping. In: Proceedings of the 5th Int’l. Conference on Data Mining 2005, pp. 58–65 (2005)
Google Scholar
Hernandez, M., Stolfo, S.: The merge/purge problem for large database. In: ACT SGMOD Conf. Proc., pp. 127–138 (1995)
Google Scholar
Christen, P., Churches, T.: Febrl: Freely extensible biomedical record linkage. Joint Computer Science Technical Report TR-CS-02-05 (2002)
Google Scholar
Brown, R.G.: Engineering a Beowulf-style Compute Cluster. Duke University Physics Department (2004)
Google Scholar
Deen, S.M., Amin, R.R., Taylor, M.C.: Data integration in distributed databases. IEEE Transactions on Software Engineering (1987)
Google Scholar
Sung, S.Y., Li, Z., Peng, S.: A Fast Filtering Scheme for Large Database Cleansing. In: International Conference on Information and Knowledge Management (CIKM), McLean, Virginia,USA (2002)
Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 707–710 (1966)
Google Scholar
Torra, V., Domingo-Ferrer, J.: Record linkage methods for multidatabase data mining. In: Information Fusion in Data Mining, pp. 101–132. Springer, Heidelberg (2003)
Google Scholar
Winkler, W.E.: Data cleaning methods. In: Proc. SIGKDD 2003, Washington (2003)
Google Scholar
Winkler, W.E.: Re-identification methods for masked microdata. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 216–230. Springer, Heidelberg (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

DAMA-UPC, Dept. d’Arquitectura de Computadors, Universitat Politècnica de Catalunya, Campus Nord, C/Jordi Girona 1-3, 08034, Barcelona, Catalonia, Spain
Joan Guisado-Gámez, Arnau Prat-Pérez, Victor Muntés-Mulero & Josep Ll. Larriba-Pey
IIIA, Artificial Intelligence Research Institute CSIC, Spanish National Research Council, Campus UAB s/n, 08193, Bellaterra, Catalonia, Spain
Jordi Nin

Authors

Joan Guisado-Gámez
View author publications
You can also search for this author in PubMed Google Scholar
Arnau Prat-Pérez
View author publications
You can also search for this author in PubMed Google Scholar
Jordi Nin
View author publications
You can also search for this author in PubMed Google Scholar
Victor Muntés-Mulero
View author publications
You can also search for this author in PubMed Google Scholar
Josep Ll. Larriba-Pey
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Josep Domingo-Ferrer Yücel Saygın

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Guisado-Gámez, J., Prat-Pérez, A., Nin, J., Muntés-Mulero, V., Larriba-Pey, J.L. (2008). Parallelizing Record Linkage for Disclosure Risk Assessment. In: Domingo-Ferrer, J., Saygın, Y. (eds) Privacy in Statistical Databases. PSD 2008. Lecture Notes in Computer Science, vol 5262. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87471-3_16

Download citation

DOI: https://doi.org/10.1007/978-3-540-87471-3_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-87470-6
Online ISBN: 978-3-540-87471-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics