Abstract
Entity identification deals with matching records from different datasets or within one dataset that represent the same real-world entity when unique identifiers are not available. Enabling data integration at record level as well as the detection of duplicates, entity identification plays a major role in data preprocessing, especially concerning data quality. This paper presents a framework for statistical entity identification in particular focusing on probabilistic record linkage and string matching and its implementation in R. According to the stages of the entity identification process, the framework is structured into seven core components: data preparation, candidate selection, comparison, scoring, classification, decision, and evaluation. Samples of real-world CRM datasets serve as illustrative examples.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
BAXTER, R., CHRISTEN, P., and CHURCHES, T. (2003): A Comparison of Fast Blocking Methods for Record Linkage. In: Proc. 1st Workshop on Data Cleaning, Record Linkage, and Object Consolidation, 9th ACM SIGKDD. Washington, D.C., August 2003.
BELIN, T.R. and RUBIN, D.B. (1995): A Method for Calibrating False-Match Rates in Record Linkage. J. American Statistical Association, 90, 694-707.
DEMPSTER, A.P., LAIRD, N.M. and RUBIN, D.B. (1977): Maximum Likelihood from In-complete Data via the EM-Algorithm. J. Royal Statistical Society (B), 39, 1-38.
DENK, M. (2002): Statistical Data Combination: A Metadata Framework for Record Linkage Procedures. Doctoral thesis, Dept. of Statistics, University of Vienna.
DENK, M. (2006): A Framework for Statistical Entity Identification to Enhance Data Quality. Report wp6dBiz14_br1. (EC3, Vienna, Austria). Submitted.
DENK, M. (2007): The StringMatch Toolbox: Determining String Compliance in R. In: Proc. IASC 07 - Statistics for Data Mining, Learning and Knowledge Extraction. Aveiro, Por-tugal, August 2007. Accepted.
DENK, M., FROESCHL, K.A., HACKL, P. and RAINER, N. (Eds.) (2004): Special Issue on Data Integration and Record Matching, Austrian J. Statistics, 33.
DENK, M., HACKL, P. and RAINER, N. (2005): String Matching Techniques: An Empirical Assessment Based on Statistics Austria’s Business Register. Austrian J. Statistics, 34(3), 235-250.
FELLEGI, I.P. and SUNTER, A.B. (1969): A Theory for Record Linkage. J. American Statis-tical Association, 64, 1183-1210.
GILL, L.E. (2001): Methods for automatic record matching and linking in their use in National Statistics. GSS Methodology Series, NSMS25, ONS UK.
HERZOG, T.N., SCHEUREN, F.J. and WINKLER, W.E. (2007): Data Quality and Record Linkage Techniques. Springer, New York.
JARO, M.A. (1989): Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association, 84, 414-420.
NAVARRO, G. (2001): A guided tour to approximate string matching. ACM Computing Sur-veys, 33(1), 31-88.
NEILING, M. (2004): Identifizierung von Realwelt-Objekten in multiplen Datenbanken. Doc-toral thesis, TU Cottbus. In German.
R DEVELOPMENT CORE TEAM (2006): R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
WINKLER, W.E. (1994): Advanced Methods for Record Linkage. In: Proc. Section on Survey Research Methods. American Statistical Association, 467-472.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Denk, M. (2008). A Framework for Statistical Entity Identification in R. In: Preisach, C., Burkhardt, H., Schmidt-Thieme, L., Decker, R. (eds) Data Analysis, Machine Learning and Applications. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78246-9_40
Download citation
DOI: https://doi.org/10.1007/978-3-540-78246-9_40
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78239-1
Online ISBN: 978-3-540-78246-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)