Abstract
Up to now, most campaign contribution data have been reported at the level of the donation. While these are interesting, one often needs to have information at the level of the donor. Obtaining information at that level is difficult as there is neither a unique repository of donations nor any standard across existing repositories. In order to more meaningfully mine campaign contribution data, political scientists need an accurate way of grouping, or linking, together donations made by the same donor. In this paper, we describe a record linkage technique that is applicable to various sources and across large geographical areas. We show how it may be effectively applied in the context of nationwide donation data and report on new, previously unattainable results about campaign contributors in the 2007–2008 US election cycle.




Similar content being viewed by others
Notes
The privacy concern may actually be quite prevalent as these same individuals (found in our linkage but who did not report giving to any candidates in the survey) are also more than twice as likely as others not to report their income.
Even when the “offending” individuals are not removed, FPR does not exceed 0.039 and precision does not go below 0.71 for any of the candidates.
References
Bilenko M, Mooney R, Cohen W, Ravikumar P, Fienberg S (2003) Adaptive name matching in information integration. IEEE Intel Syst 18(5):16–23
Cheatham M, Hitzler P (2013) String similarity metrics for ontology alignment. In: Proceedings of the twelfth international semantic Web conference (LNCS 8219), pp 294–309
Christen P (2006) A comparison of personal name matching: techniques and practical issues. Technical Report TR-CS-06-2, Department of Computer Science, The Australian National University
Christen P (2012) Data matching: Concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer, Berlin
Cohen W, Ravikumar P, Fiendberg S (2003) A comparison of string distance metrics for name-matching tasks. In: Proceedings of the eighteenth international joint conference on artificial intelligence, pp 73–78
Elfeky MG, Verykios VS, Elmagarmid AK, Ghanem TM, Huwait AR (2003) Record linkage: a machine learning approach, a toolbox, and a digital government Web service. Technical Report 03–024, Department of Computer Science, Purdue University
Elmagarmid A, Ipeitoris P, Verykios V (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16
Fellegi I, Sunter A (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210
Fu Z, Christen P, Boot M (2011) Automatic cleaning and linking of historical census data using household information. In: Proceedings of the IEEE eleventh international conference on data mining workshops, pp 413–420
Fu Z, Christen P, Zhou J (2014) A Graph Matching Method for Historical Census Household Linkage. In: Proceedings of the eighteenth Pacific-Asia conference on knowledge discovery and data mining (LNAI 8443), pp 485–496
Gadd T (1990) PHONIX : the algorithm. Prog Autom Library Inform Syst 24(4):363–366
Gu L, Baxter R, Vickers D, Rainsford C (2003) Record linkage: current practice and future directions. Tech. Rep. No. 03/83, CSIRO Mathematical and Information Sciences
Herzog TH, Scheuren F, Winkler WE (2010) Record Linkage. Wiley Interdiscip Rev Comput Stat 2(5):535–543
Hettiarachchi GP, Attygalle D, Hettiarachchi DS, Ebisuya A (2013) A generic statistical machine learning and data mining framework for record classification and linkage. Int J Intel Inform Process 4(2):96–106
Howe GR, Lindsay J (1981) A generalized iterative record linkage computer system for use in medical follow-up studies. Comput Biomed Res 14(4):327–340
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218
Irvine KA, Taylor LK (2011) The Centre for Health Record Linkage: fostering population health research in NSW. NSW Pub Health Bull 22(2):17–18
Ivie S, Pixton B, Giraud-Carrier C (2007) Metric-based data mining model for genealogical record linkage. In: Proceedings of the IEEE international conference on information reuse and integration, pp 538–543
Jaro M (1995) Probabilistic linkage of large public health data file. Stat Med 14(5–7):491–498
Lain SJ, Algert CS, Tasevski V, Morris JM, Roberts CL (2009) Record linkage to obtain birth outcomes for the evaluation of screening biomarkers in pregnancy: a feasibility study. BMC Med Res Methodol 9:48
Lait A, Randell B (1993) An assessment of name matching algorithms. Department of Computer Science, University of Newcastle upon Tyne, UK, Tech. rep
Levenshtein V (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Doklady 10:707–710
Monge A, Elkan C (1996) The field-matching problem: algorithm and applications. In: Proceedings of the second international conference on knowledge discovery and data mining, pp 267–270
Needleman S, Wunsch C (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453
Newcombe H, Kennedy J, Axford S, James A (1959) Automatic linkage of vital records. Science 130(3381):954–959
Pfeifer U, Poersch T, Fuhr N (1996) Retrieval effectiveness of proper name search methods. Inf Process Manag 32(6):667–679
Philips L (2000) The double-metaphone search algorithm. C/C++ Users J 18(6):38–43
Pixton B, Giraud-Carrier C (2005) MAL4:6 - Using data mining for record linkage. In: Proceedings of the 5th annual Workshop on technology for family history and genealogical research
Quass D, Starkey P (2003) Record Linkage for Genealogical Databases. In: Proceedings of the ACM SIGKDD workshop on data cleaning, record linkage, and object consolidation
Rand W (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850
Ruggles S (2002) Linking historical censuses: a new approach. Hist Comput 14(1+2):213–224
Solomon J (2007) Obama takes lead in money raised. Washington Post, July 2:A1
Stavrou EP, Baker DF, Bishop JF (2009) Maternal smoking during pregnancy and childhood cancer in New South Wales: a record linkage investigation. Cancer Causes Control 20(9):1551–1558
St. Sauver JL, Grossardt BR, Yawn BP, Melton LJ 3rd, Pankratz JJ, Brue SM, Rocca WA (2012) Data resource profile: the Rochester Epidemiology Project (REP) medical records-linkage system. Int J Epidemiol 41(6):1614–1624
Sweet C, Odyer T, Alhajj R (2007) Enhanced graph based genealogical record linkage. In: Proceedings of the third international conference on advanced data mining and applications (LNAI 4632), pp 476–487
Wilson DR (2011) Beyond probabilistic record linkage: Using neural networks and complex features to improve genealogical record linkage. In: Proceedings of the international joint conference on neural networks, pp 9–14
Winkler WE (2001) Record linkage software and methods for merging administrative lists. Statistical research report series No. RR2001/03. http://www.vrdc.cornell.edu/info7470/2011/Readings/rr2001-03
Winkler W (2006) Overview of record linkage and current research directions. Research Report Series (Statistics #2006-2). http://www.census.gov/srd/papers/pdf/rrs2006-02
Zobel J, Dart P (1995) Finding approximate matches in large lexicons. Softw Pract Exp 1:331–345
Acknowledgments
Our thanks to Yao Huang, Weston Rowley, David Wilcox and David Lassen for research assistance and computer code. We are also grateful to David Magleby and Joseph Olson for their support, encouragement, and advice. Finally, we thank the anonymous reviewers for their very useful comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
About this article
Cite this article
Giraud-Carrier, C., Goodliffe, J., Jones, B.M. et al. Effective record linkage for mining campaign contribution data. Knowl Inf Syst 45, 389–416 (2015). https://doi.org/10.1007/s10115-014-0812-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-014-0812-5