skip to main content
10.1145/2795218.2795222acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

CrowdLink: An Error-Tolerant Model for Linking Complex Records

Published: 31 May 2015 Publication History

Abstract

Record linkage (RL) refers to the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, databases), which is a long-standing challenge in database management. Algorithmic approaches have been proposed to improve RL quality, but remain far from perfect. Crowdsourcing offers a more accurate but expensive (and slow) way to bring human insight into the process. In this paper, we propose a new probabilistic model, namely CrowdLink, to tackle the above limitations. In particular, our model gracefully handles the crowd error and the correlation among different pairs, as well as enables us to decompose the records into small pieces (i.e. attributes) so that crowdsourcing workers can easily verify. Further, we develop efficient and effective algorithms to select the most valuable questions, in order to reduce the monetary cost of crowdsourcing. We conducted extensive experiments on both synthetic and real-world datasets. The experimental results verified the effectiveness and the applicability of our model.

References

[1]
Daren C. Brabham. Crowdsourcing as a model for problem solving an introduction and cases. Convergence February 2008 vol. 14 no. 1 75-90, 2008.
[2]
Alberto Caprara, Hans Kellerer, Ulrich Pferschy, and David Pisinger. Approximation algorithms for knapsack problems with cardinality constraints. European Journal of Operational Research, 123(2):333--345, 2000.
[3]
AnHai Doan, Raghu Ramakrishnan, and Alon Y. Halevy. Crowdsourcing systems on the world-wide web. Commun. ACM, 54(4):86--96, 2011.
[4]
Amber Feng, Michael J. Franklin, Donald Kossmann, Tim Kraska, Samuel Madden, Sukriti Ramesh, Andrew Wang, and Reynold Xin. Crowddb: Query processing with the vldb crowd. PVLDB, 4(12):1387--1390, 2011.
[5]
Ryan Gomes, Peter Welinder, Andreas Krause, and Pietro Perona. Crowdclustering. In NIPS, pages 558--566, 2011.
[6]
Stephen Guo, Aditya G. Parameswaran, and Hector Garcia-Molina. So who won?: dynamic max discovery with the crowd. In SIGMOD Conference, pages 385--396, 2012.
[7]
T.W. Malone, R. Laubacher, and C. Dellarocas. Harnessing crowds: Mapping the genome of collective intelligence. Research Paper No. 4732-09, MIT, Sloan School of Management, Massachusetts Institute of Technology, Cambridge, MA, USA, February 2009. Sloan Research Paper No. 4732-09.
[8]
Adam Marcus, Eugene Wu, David R. Karger, Samuel Madden, and Robert C. Miller. Human-powered sorts and joins. PVLDB, 5(1):13--24, 2011.
[9]
Aditya G. Parameswaran, Hector Garcia-Molina, Hyunjung Park, Neoklis Polyzotis, Aditya Ramesh, and Jennifer Widom. Crowdscreen: algorithms for filtering data with humans. In SIGMOD Conference, pages 361--372, 2012.
[10]
Aditya G. Parameswaran and Neoklis Polyzotis. Answering queries using humans, algorithms and databases. In CIDR, pages 160--166, 2011.
[11]
Aditya G. Parameswaran, Anish Das Sarma, Hector Garcia-Molina, Neoklis Polyzotis, and Jennifer Widom. Human-assisted graph search: it's okay to ask questions. PVLDB, 4(5):267--278, 2011.
[12]
B. Roos. Binomial approximation to the poisson binomial distribution: The krawtchouk expansion. Theory of Probability and its Applications, 45(2):258--272 (2000) and Teor. Veroyatn. Primen. 45, No. 2, 328--344, 2000.
[13]
Jiannan Wang, Guoliang Li, Tim Kraska, Michael J. Franklin, and Jianhua Feng. Leveraging transitive relations for crowdsourced joins. In SIGMOD Conference, pages 229--240, 2013.
[14]
Steven Euijong Whang, Peter Lofgren, and Hector Garcia-Molina. Question selection for crowd entity resolution. PVLDB, 6(6):349--360, 2013.

Cited By

View all
  • (2021)The Four Generations of Entity ResolutionSynthesis Lectures on Data Management10.2200/S01067ED1V01Y202012DTM06416:2(1-170)Online publication date: 15-Mar-2021
  • (2020)An Overview of End-to-End Entity Resolution for Big DataACM Computing Surveys10.1145/341889653:6(1-42)Online publication date: 6-Dec-2020
  • (2017)A Method for Entity Resolution in High Dimensional Data Using Ensemble ClassifiersMathematical Problems in Engineering10.1155/2017/49532802017:1Online publication date: 15-Feb-2017
  1. CrowdLink: An Error-Tolerant Model for Linking Complex Records

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ExploreDB '15: Proceedings of the Second International Workshop on Exploratory Search in Databases and the Web
    May 2015
    37 pages
    ISBN:9781450337403
    DOI:10.1145/2795218
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 31 May 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    SIGMOD/PODS'15
    Sponsor:
    SIGMOD/PODS'15: International Conference on Management of Data
    May 31 - June 4, 2015
    VIC, Melbourne, Australia

    Acceptance Rates

    ExploreDB '15 Paper Acceptance Rate 6 of 10 submissions, 60%;
    Overall Acceptance Rate 11 of 21 submissions, 52%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)The Four Generations of Entity ResolutionSynthesis Lectures on Data Management10.2200/S01067ED1V01Y202012DTM06416:2(1-170)Online publication date: 15-Mar-2021
    • (2020)An Overview of End-to-End Entity Resolution for Big DataACM Computing Surveys10.1145/341889653:6(1-42)Online publication date: 6-Dec-2020
    • (2017)A Method for Entity Resolution in High Dimensional Data Using Ensemble ClassifiersMathematical Problems in Engineering10.1155/2017/49532802017:1Online publication date: 15-Feb-2017
    • (2017)WaldoProceedings of the 2017 ACM International Conference on Management of Data10.1145/3035918.3035931(1133-1148)Online publication date: 9-May-2017

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media