Skip to main content

Comparison of Public-Domain Software and Services For Probabilistic Record Linkage and Address Standardization

  • Conference paper
  • First Online:
Towards Integrative Machine Learning and Knowledge Extraction

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10344))

Abstract

Probabilistic record linkage (PRL) refers to the process of matching records from various data sources such as database tables with some missing or corrupted index values. Human is often involved in a loop to review cases that an algorithm cannot match. PRL can be applied to join or de-duplicate records, or to impute missing data, resulting in better overall data quality. An important subproblem in PRL is to parse a field such as address into its components, e.g., street number, street name, city, state, and zip code. Various data analysis techniques such as natural language processing and machine learning methods are often gainfully employed in both PRL and address standardization to achieve higher accuracies of linking or prediction. This work compares the performance of four reputable PRL packages freely available in the public domain, namely FRIL, Link Plus, R RecordLinkage, and SERF. In addition, we evaluate the baseline performance and sensitivity of four address-parsing web services including the Data Science Toolkit, Geocoder.us, Google Maps APIs, and the U.S. address parser. Finally, we present some of the strengths and limitations of the software and services we have evaluated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012)

    Book  Google Scholar 

  2. Herzog, T.N., Scheuren, F.J., Winkler, W.E.: Data Quality and Record Linkage Techniques. Springer, New York (2007)

    MATH  Google Scholar 

  3. Benjelloun, O., Garcia-Molina, H., Kawai, H., Larson, T.E., Menestrina, D., Su, Q., Thavisomboon, S., Widom, J.: Generic entity resolution in the SERF project. IEEE Data Eng. Bull. 29(2), 13–20 (2006)

    Google Scholar 

  4. Talburt, J.R.: Entity Resolution and Information Quality. Elsevier, New York (2011)

    Google Scholar 

  5. Dunn, H.L.: Record linkage. Am. J. Public Health Nations Health 36, 1412–1416 (1946)

    Article  Google Scholar 

  6. Schwartz, E.E.: Some observations on the Canadian family allowances program. Soc. Serv. Rev. 20(4), 451–473 (1946)

    Article  Google Scholar 

  7. Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64, 1183–1210 (1969)

    Article  MATH  Google Scholar 

  8. Holzinger, A., Jurisica, I.: Interactive Knowledge Discovery and Data Mining in Biomedical Informatics: State-of-the-Art and Future Challenges, vol. 8401. Springer, Heidelberg (2014)

    Google Scholar 

  9. Hudson, K., Lifton, R., Patrick-Lake, B.: The precision medicine initiative cohort program – building a research foundation for 21st century medicine (2015). http://acd.od.nih.gov/reports/DRAFT-PMI-WG-Report-9-11-2015-508.pdf

  10. Biemer, P.: Introduction to Part 2: Survey processing. In: Pfeffermann, D., Rao, C.R. (eds.) Handbook of Statistics 29A: Sample Surveys: Design, Methods and Applications, pp. 157–162. Elsevier (2009)

    Google Scholar 

  11. Wagner, D., Layne, M.: The Person Identification Validation System (PVS): Applying the Center for Administrative Records Research and Applications’ (CARRA) record linkage software. CARRA working paper series (2014)

    Google Scholar 

  12. Mulrow, E., Mushtaq, A., Pramanik, S., Fontes, A.: Assessment of the US Census Bureau’s Person Identification Validation System. Technical report, NORC at the University of Chicago (2011)

    Google Scholar 

  13. Jurczyk, P., Lu, J.J., Xiong, L., Cragan, J.D., Correa, A.: FRIL: A tool for comparative record linkage. In: AMIA Annual Symposium Proceedings. American Medical Informatics Association, vol. 2008, pp. 440–444 (2008)

    Google Scholar 

  14. Centers for Disease Control and Prevention (CDC): Link Plus (2008). http://www.cdc.gov/cancer/npcr/tools/egistryplus/lp.htm

  15. Sariyar, M., Borg, A.: The RecordLinkage package: Detecting errors in data. R J. 2, 61–67 (2010)

    Google Scholar 

  16. MatchWare Technologies Inc: AutoMatch: Generalized record linkage system user’s manual (1996)

    Google Scholar 

  17. DataLadder: DataMatch. http://dataladder.com/data-matching-software/

  18. Campbell, K.M.: The Link King: Record linkage and consolidation software. http://www.the-link-king.com/index.html

  19. Campbell, K.M., Deck, D., Krupski, A.: Record linkage software in the public domain: a comparison of Link Plus, the Link King, and a “basic” deterministic algorithm. Health Inform. J. 14, 5–15 (2008)

    Article  Google Scholar 

  20. Gregg, F., Deng, C., Batchkarov, M., Cochrane, J.: usaddress (2014). https://github.com/datamade/usaddress

  21. Google Inc.: The Google Maps Geocoding API. https://developers.google.com/maps/documentation/geocoding/intro

  22. geocoder.us. http://206.220.230.164

  23. Warden, P.: The Data Science Toolkit. http://www.datasciencetoolkit.org

  24. Valassis: Residential & Business Database. http://www.valassis.com/direct-mail/mailing-lists/residential-and-business-lists.aspx

  25. Winkler, W.E.: The state of record linkage and current research problems. In: Statistical Research Division, US Census Bureau. Citeseer (1999)

    Google Scholar 

  26. Jurczyk, P.: FRIL: Fine-grained record integration and linkage tool tutorial, version 3.2 (2009)

    Google Scholar 

  27. Cohen, W., Ravikumar, P., Fienberg, S.: SecondString: An open source Java toolkit of approximate string-matching techniques (2003). http://secondstring.sourceforge.net

  28. Gisgraphy: Address parser. http://www.gisgraphy.com/

  29. Yahoo: Placefinder. https://developer.yahoo.com/boss/geo/

  30. Gisgraphy: Gisgraphy results comparison. http://www.gisgraphy.com/compare/

  31. Deng, C., Ernsthausen, J.: Parsing addresses with usaddress (2014) Blog article. https://datamade.us/blog/parsing-addresses-with-usaddress

  32. Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge (2009)

    MATH  Google Scholar 

  33. Google Inc.: googlemaps 2.2 Python Client for Google Maps Services [21] (2015). https://pypi.python.org/pypi/googlemaps/

  34. Yu, X.: pygeocoder 1.2.5. Python interface for Google Geocoding API [21] (2014). https://pypi.python.org/pypi/pygeocoder

  35. Choi, S.C.T., Lin, Y.H.: Comparison of Public-Domain Software and Services for Probabilistic Record Linkage and Address Standardization. GitHub repository, https://github.com/schoi32/prl-splncs

  36. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)

    Google Scholar 

  37. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Li, F.F.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 248–255. IEEE (2009)

    Google Scholar 

  38. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et al.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint (2016). arXiv:1603.04467

  39. Parkhi, O.M., Vedaldi, A., Zisserman, A., et al.: Deep face recognition. In: BMVC, vol. 1, pp. 41.1–41.12. BMVA Press (2015)

    Google Scholar 

  40. Vedaldi, A., Lenc, K.: MatConvNet: Convolutional neural networks for Matlab. In: Proceedings of the 23rd ACM international conference on Multimedia, pp. 689–692. ACM (2015)

    Google Scholar 

  41. MapQuest: Geocoding API. https://developer.mapquest.com/products/geocoding

  42. Microsoft Corporation: Bing Maps REST services. https://msdn.microsoft.com/en-us/library/ff701713.aspx

  43. Philips, L.: The double metaphone search algorithm. C/C++ Users J. 18, 38–43 (2000)

    Google Scholar 

  44. NORC at the University of Chicago: Task 4, further PBS research report. Technical report Not published (2012)

    Google Scholar 

  45. DataMade: probablepeople. Python library (2014). https://github.com/datamade/probablepeople

Download references

Acknowledgements

We thank Kirk Wolter, Ned English, and Ilana Ventura for discussion. We also thank Katie Dekker for her expertise in sampling the USPS address database [24]. We are grateful to the feedback from Andreas Holzinger. The first author would like to thank the following people for interesting and inspiring discussion: Forest Gregg, Lulu Kang, Aleksandr Likhterman, Lek-Heng Lim, Dean Resnick, and students enrolled in the research course SCI 498 / MATH 491 Computational Social Sciences, Illinois Institute of Technology, Summer 2016 — in particular, Fabrício Soares deserves special thanks for figuring out the IP address of geocoder.us server [22]. Last but not least, we appreciate the assistance from Jack Huang, University of Chicago, in verifying and enhancing our code for address standardization in Summer 2017.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sou-Cheng T. Choi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Choi, SC.T., Lin, Y., Mulrow, E. (2017). Comparison of Public-Domain Software and Services For Probabilistic Record Linkage and Address Standardization. In: Holzinger, A., Goebel, R., Ferri, M., Palade, V. (eds) Towards Integrative Machine Learning and Knowledge Extraction. Lecture Notes in Computer Science(), vol 10344. Springer, Cham. https://doi.org/10.1007/978-3-319-69775-8_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-69775-8_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-69774-1

  • Online ISBN: 978-3-319-69775-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics