Abstract
Handling very large volumes of confidential data is becoming a common practice in many organizations such as statistical agencies. This calls for the use of protection methods that have to be validated in terms of the quality they provide. With the use of Record Linkage (RL) it is possible to compute the disclosure risk, which gives a measure of the quality of a data protection method. However, the RL methods proposed in the literature are computationally costly, which poses difficulties when frequent RL processes have to be executed on large data.
Here, we propose a distributed computing technique to improve the performance of a RL process. We show that our technique not only improves the computing time of a RL process significantly, but it is also scalable in a distributed environment. Also, we show that distributed computation can be complemented with SMP based parallelization in each node increasing the final speedup.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. The Very Large Database Journal, 334–350 (2001)
Newcombe, H.B.: Record linking: The design of efficient systems for linking records into individuals and family histories. American Journal of Human Genetics (1967)
Do, H.H., Rahm, E.: COMA - A system for exible combination of schema matching approaches. In: Proceedings of the 28th Very Large Databases Conference, pp. 610–621 (2002)
Kim, H., Lee, D.: Parallel Linkage. In: CIKM, Lisboa, Portugal (2007)
Gómez, J., Larriba, J.L., Ribes, J.: Improving Record Linkage Performance. Technical report UPC-DAC-RR-2006-15
Jaro, M.A.: Advances in Record Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Society, 414–420 (1989)
Atencia, M., Schorlemmer, M.: A formal model for situated semantic alignment. In: Proceedings of the 6th International Conference in Agent and Multiagent Systems (2007)
Bilenko, M., Basu, S., Sahami, M.: Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Sopping. In: Proceedings of the 5th Int’l. Conference on Data Mining 2005, pp. 58–65 (2005)
Hernandez, M., Stolfo, S.: The merge/purge problem for large database. In: ACT SGMOD Conf. Proc., pp. 127–138 (1995)
Christen, P., Churches, T.: Febrl: Freely extensible biomedical record linkage. Joint Computer Science Technical Report TR-CS-02-05 (2002)
Brown, R.G.: Engineering a Beowulf-style Compute Cluster. Duke University Physics Department (2004)
Deen, S.M., Amin, R.R., Taylor, M.C.: Data integration in distributed databases. IEEE Transactions on Software Engineering (1987)
Sung, S.Y., Li, Z., Peng, S.: A Fast Filtering Scheme for Large Database Cleansing. In: International Conference on Information and Knowledge Management (CIKM), McLean, Virginia,USA (2002)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 707–710 (1966)
Torra, V., Domingo-Ferrer, J.: Record linkage methods for multidatabase data mining. In: Information Fusion in Data Mining, pp. 101–132. Springer, Heidelberg (2003)
Winkler, W.E.: Data cleaning methods. In: Proc. SIGKDD 2003, Washington (2003)
Winkler, W.E.: Re-identification methods for masked microdata. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 216–230. Springer, Heidelberg (2004)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Guisado-Gámez, J., Prat-Pérez, A., Nin, J., Muntés-Mulero, V., Larriba-Pey, J.L. (2008). Parallelizing Record Linkage for Disclosure Risk Assessment. In: Domingo-Ferrer, J., Saygın, Y. (eds) Privacy in Statistical Databases. PSD 2008. Lecture Notes in Computer Science, vol 5262. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87471-3_16
Download citation
DOI: https://doi.org/10.1007/978-3-540-87471-3_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-87470-6
Online ISBN: 978-3-540-87471-3
eBook Packages: Computer ScienceComputer Science (R0)