Skip to main content

Parallelizing Record Linkage for Disclosure Risk Assessment

  • Conference paper
Book cover Privacy in Statistical Databases (PSD 2008)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5262))

Included in the following conference series:

Abstract

Handling very large volumes of confidential data is becoming a common practice in many organizations such as statistical agencies. This calls for the use of protection methods that have to be validated in terms of the quality they provide. With the use of Record Linkage (RL) it is possible to compute the disclosure risk, which gives a measure of the quality of a data protection method. However, the RL methods proposed in the literature are computationally costly, which poses difficulties when frequent RL processes have to be executed on large data.

Here, we propose a distributed computing technique to improve the performance of a RL process. We show that our technique not only improves the computing time of a RL process significantly, but it is also scalable in a distributed environment. Also, we show that distributed computation can be complemented with SMP based parallelization in each node increasing the final speedup.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. The Very Large Database Journal, 334–350 (2001)

    Google Scholar 

  2. Newcombe, H.B.: Record linking: The design of efficient systems for linking records into individuals and family histories. American Journal of Human Genetics (1967)

    Google Scholar 

  3. Do, H.H., Rahm, E.: COMA - A system for exible combination of schema matching approaches. In: Proceedings of the 28th Very Large Databases Conference, pp. 610–621 (2002)

    Google Scholar 

  4. Kim, H., Lee, D.: Parallel Linkage. In: CIKM, Lisboa, Portugal (2007)

    Google Scholar 

  5. http://www.idescat.net

  6. Gómez, J., Larriba, J.L., Ribes, J.: Improving Record Linkage Performance. Technical report UPC-DAC-RR-2006-15

    Google Scholar 

  7. Jaro, M.A.: Advances in Record Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Society, 414–420 (1989)

    Google Scholar 

  8. Atencia, M., Schorlemmer, M.: A formal model for situated semantic alignment. In: Proceedings of the 6th International Conference in Agent and Multiagent Systems (2007)

    Google Scholar 

  9. Bilenko, M., Basu, S., Sahami, M.: Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Sopping. In: Proceedings of the 5th Int’l. Conference on Data Mining 2005, pp. 58–65 (2005)

    Google Scholar 

  10. Hernandez, M., Stolfo, S.: The merge/purge problem for large database. In: ACT SGMOD Conf. Proc., pp. 127–138 (1995)

    Google Scholar 

  11. Christen, P., Churches, T.: Febrl: Freely extensible biomedical record linkage. Joint Computer Science Technical Report TR-CS-02-05 (2002)

    Google Scholar 

  12. Brown, R.G.: Engineering a Beowulf-style Compute Cluster. Duke University Physics Department (2004)

    Google Scholar 

  13. Deen, S.M., Amin, R.R., Taylor, M.C.: Data integration in distributed databases. IEEE Transactions on Software Engineering (1987)

    Google Scholar 

  14. Sung, S.Y., Li, Z., Peng, S.: A Fast Filtering Scheme for Large Database Cleansing. In: International Conference on Information and Knowledge Management (CIKM), McLean, Virginia,USA (2002)

    Google Scholar 

  15. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 707–710 (1966)

    Google Scholar 

  16. Torra, V., Domingo-Ferrer, J.: Record linkage methods for multidatabase data mining. In: Information Fusion in Data Mining, pp. 101–132. Springer, Heidelberg (2003)

    Google Scholar 

  17. Winkler, W.E.: Data cleaning methods. In: Proc. SIGKDD 2003, Washington (2003)

    Google Scholar 

  18. Winkler, W.E.: Re-identification methods for masked microdata. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 216–230. Springer, Heidelberg (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Josep Domingo-Ferrer Yücel Saygın

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Guisado-Gámez, J., Prat-Pérez, A., Nin, J., Muntés-Mulero, V., Larriba-Pey, J.L. (2008). Parallelizing Record Linkage for Disclosure Risk Assessment. In: Domingo-Ferrer, J., Saygın, Y. (eds) Privacy in Statistical Databases. PSD 2008. Lecture Notes in Computer Science, vol 5262. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87471-3_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-87471-3_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-87470-6

  • Online ISBN: 978-3-540-87471-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics