Abstract
Probabilistic record linkage (PRL) refers to the process of matching records from various data sources such as database tables with some missing or corrupted index values. Human is often involved in a loop to review cases that an algorithm cannot match. PRL can be applied to join or de-duplicate records, or to impute missing data, resulting in better overall data quality. An important subproblem in PRL is to parse a field such as address into its components, e.g., street number, street name, city, state, and zip code. Various data analysis techniques such as natural language processing and machine learning methods are often gainfully employed in both PRL and address standardization to achieve higher accuracies of linking or prediction. This work compares the performance of four reputable PRL packages freely available in the public domain, namely FRIL, Link Plus, R RecordLinkage, and SERF. In addition, we evaluate the baseline performance and sensitivity of four address-parsing web services including the Data Science Toolkit, Geocoder.us, Google Maps APIs, and the U.S. address parser. Finally, we present some of the strengths and limitations of the software and services we have evaluated.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012)
Herzog, T.N., Scheuren, F.J., Winkler, W.E.: Data Quality and Record Linkage Techniques. Springer, New York (2007)
Benjelloun, O., Garcia-Molina, H., Kawai, H., Larson, T.E., Menestrina, D., Su, Q., Thavisomboon, S., Widom, J.: Generic entity resolution in the SERF project. IEEE Data Eng. Bull. 29(2), 13–20 (2006)
Talburt, J.R.: Entity Resolution and Information Quality. Elsevier, New York (2011)
Dunn, H.L.: Record linkage. Am. J. Public Health Nations Health 36, 1412–1416 (1946)
Schwartz, E.E.: Some observations on the Canadian family allowances program. Soc. Serv. Rev. 20(4), 451–473 (1946)
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64, 1183–1210 (1969)
Holzinger, A., Jurisica, I.: Interactive Knowledge Discovery and Data Mining in Biomedical Informatics: State-of-the-Art and Future Challenges, vol. 8401. Springer, Heidelberg (2014)
Hudson, K., Lifton, R., Patrick-Lake, B.: The precision medicine initiative cohort program – building a research foundation for 21st century medicine (2015). http://acd.od.nih.gov/reports/DRAFT-PMI-WG-Report-9-11-2015-508.pdf
Biemer, P.: Introduction to Part 2: Survey processing. In: Pfeffermann, D., Rao, C.R. (eds.) Handbook of Statistics 29A: Sample Surveys: Design, Methods and Applications, pp. 157–162. Elsevier (2009)
Wagner, D., Layne, M.: The Person Identification Validation System (PVS): Applying the Center for Administrative Records Research and Applications’ (CARRA) record linkage software. CARRA working paper series (2014)
Mulrow, E., Mushtaq, A., Pramanik, S., Fontes, A.: Assessment of the US Census Bureau’s Person Identification Validation System. Technical report, NORC at the University of Chicago (2011)
Jurczyk, P., Lu, J.J., Xiong, L., Cragan, J.D., Correa, A.: FRIL: A tool for comparative record linkage. In: AMIA Annual Symposium Proceedings. American Medical Informatics Association, vol. 2008, pp. 440–444 (2008)
Centers for Disease Control and Prevention (CDC): Link Plus (2008). http://www.cdc.gov/cancer/npcr/tools/egistryplus/lp.htm
Sariyar, M., Borg, A.: The RecordLinkage package: Detecting errors in data. R J. 2, 61–67 (2010)
MatchWare Technologies Inc: AutoMatch: Generalized record linkage system user’s manual (1996)
DataLadder: DataMatch. http://dataladder.com/data-matching-software/
Campbell, K.M.: The Link King: Record linkage and consolidation software. http://www.the-link-king.com/index.html
Campbell, K.M., Deck, D., Krupski, A.: Record linkage software in the public domain: a comparison of Link Plus, the Link King, and a “basic” deterministic algorithm. Health Inform. J. 14, 5–15 (2008)
Gregg, F., Deng, C., Batchkarov, M., Cochrane, J.: usaddress (2014). https://github.com/datamade/usaddress
Google Inc.: The Google Maps Geocoding API. https://developers.google.com/maps/documentation/geocoding/intro
geocoder.us. http://206.220.230.164
Warden, P.: The Data Science Toolkit. http://www.datasciencetoolkit.org
Valassis: Residential & Business Database. http://www.valassis.com/direct-mail/mailing-lists/residential-and-business-lists.aspx
Winkler, W.E.: The state of record linkage and current research problems. In: Statistical Research Division, US Census Bureau. Citeseer (1999)
Jurczyk, P.: FRIL: Fine-grained record integration and linkage tool tutorial, version 3.2 (2009)
Cohen, W., Ravikumar, P., Fienberg, S.: SecondString: An open source Java toolkit of approximate string-matching techniques (2003). http://secondstring.sourceforge.net
Gisgraphy: Address parser. http://www.gisgraphy.com/
Yahoo: Placefinder. https://developer.yahoo.com/boss/geo/
Gisgraphy: Gisgraphy results comparison. http://www.gisgraphy.com/compare/
Deng, C., Ernsthausen, J.: Parsing addresses with usaddress (2014) Blog article. https://datamade.us/blog/parsing-addresses-with-usaddress
Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge (2009)
Google Inc.: googlemaps 2.2 Python Client for Google Maps Services [21] (2015). https://pypi.python.org/pypi/googlemaps/
Yu, X.: pygeocoder 1.2.5. Python interface for Google Geocoding API [21] (2014). https://pypi.python.org/pypi/pygeocoder
Choi, S.C.T., Lin, Y.H.: Comparison of Public-Domain Software and Services for Probabilistic Record Linkage and Address Standardization. GitHub repository, https://github.com/schoi32/prl-splncs
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Li, F.F.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 248–255. IEEE (2009)
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et al.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint (2016). arXiv:1603.04467
Parkhi, O.M., Vedaldi, A., Zisserman, A., et al.: Deep face recognition. In: BMVC, vol. 1, pp. 41.1–41.12. BMVA Press (2015)
Vedaldi, A., Lenc, K.: MatConvNet: Convolutional neural networks for Matlab. In: Proceedings of the 23rd ACM international conference on Multimedia, pp. 689–692. ACM (2015)
MapQuest: Geocoding API. https://developer.mapquest.com/products/geocoding
Microsoft Corporation: Bing Maps REST services. https://msdn.microsoft.com/en-us/library/ff701713.aspx
Philips, L.: The double metaphone search algorithm. C/C++ Users J. 18, 38–43 (2000)
NORC at the University of Chicago: Task 4, further PBS research report. Technical report Not published (2012)
DataMade: probablepeople. Python library (2014). https://github.com/datamade/probablepeople
Acknowledgements
We thank Kirk Wolter, Ned English, and Ilana Ventura for discussion. We also thank Katie Dekker for her expertise in sampling the USPS address database [24]. We are grateful to the feedback from Andreas Holzinger. The first author would like to thank the following people for interesting and inspiring discussion: Forest Gregg, Lulu Kang, Aleksandr Likhterman, Lek-Heng Lim, Dean Resnick, and students enrolled in the research course SCI 498 / MATH 491 Computational Social Sciences, Illinois Institute of Technology, Summer 2016 — in particular, Fabrício Soares deserves special thanks for figuring out the IP address of geocoder.us server [22]. Last but not least, we appreciate the assistance from Jack Huang, University of Chicago, in verifying and enhancing our code for address standardization in Summer 2017.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Choi, SC.T., Lin, Y., Mulrow, E. (2017). Comparison of Public-Domain Software and Services For Probabilistic Record Linkage and Address Standardization. In: Holzinger, A., Goebel, R., Ferri, M., Palade, V. (eds) Towards Integrative Machine Learning and Knowledge Extraction. Lecture Notes in Computer Science(), vol 10344. Springer, Cham. https://doi.org/10.1007/978-3-319-69775-8_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-69775-8_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69774-1
Online ISBN: 978-3-319-69775-8
eBook Packages: Computer ScienceComputer Science (R0)