skip to main content
10.1145/2396761.2398554acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
short-paper

Map to humans and reduce error: crowdsourcing for deduplication applied to digital libraries

Published: 29 October 2012 Publication History

Abstract

Detecting duplicate entities, usually by examining metadata, has been the focus of much recent work. Several methods try to identify duplicate entities, while focusing either on accuracy or on efficiency and speed - with still no perfect solution. We propose a combined layered approach for duplicate detection with the main advantage of using Crowdsourcing as a training and feedback mechanism. By using Active Learning techniques on human provided examples, we fine tune our algorithm toward better duplicate detection accuracy. We keep the training cost low by gathering training data on demand for borderline cases or for inconclusive assessments. We apply our simple and powerful methods to an online publication search system: First, we perform a coarse duplicate detection relying on publication signatures in real time. Then, a second automatic step compares duplicate candidates and increases accuracy while adjusting based on both feedback from our online users and from Crowdsourcing platforms. Our approach shows an improvement of 14% over the untrained setting and is at only 4% difference to the human assessors in accuracy.

References

[1]
J. Attenberg and F. Provost. Why Label when you can Search? Alternatives to Active Learning for Applying Human Resources to Build Classification Models Under Extreme Class Imbalance. KDD '10. ACM, 2010.
[2]
O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang, and J. Widom. Swoosh: A Generic Approach to Entity Resolution. The VLDB Journal, 18(1), 2009.
[3]
D. Cohn, L. Atlas, and R. Ladner. Improving Generalization with Active Learning. Mach. Learn., 15(2), 1994.
[4]
A. P. Dawid and A. M. Skene. Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm. Applied Statistics, 28(1), 1979.
[5]
G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for Large-Scale Entity Linking. WWW '12. ACM, 2012.
[6]
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 2007.
[7]
C. S. Firan, M. Georgescu, W. Nejdl, and X. Sun. FreeSearch - Literature Search in a Natural Way. In HCIR '11, 2011.
[8]
A. Y. Halevy, X. Dong, J. Madhavan, A. Y. Halevy, and J. Madhavan. Reference Reconciliation in Complex Information Spaces. In SIGMOD '05. ACM Press, 2005.
[9]
M. A. Hernández and S. J. Stolfo. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Data Min. Knowl. Discov., 2(1), 1998.
[10]
E. Ioannou, C. Niederôe, and W. Nejdl. Probabilistic Entity Linkage for Heterogeneous Information Spaces. In CAiSE, 2008.
[11]
P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. HCOMP '10. ACM, 2010.
[12]
M. Lukasiewycz, M. Glaß, F. Reimann, and J. Teich. Opt4J - A Modular Framework for Meta-heuristic Optimization. In GECCO '11, 2011.
[13]
Z. Miklós, N. Bonvin, P. Bouquet, M. Catasta, D. Cordioli, P. Fankhauser, J. Gaugaz, E. Ioannou, H. Koshutanski, A. Maña, C. Niederée, T. Palpanas, and H. Stoermer. From Web Data to Entities and Back. CAiSE, 2010.
[14]
V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from Crowds. Journal of Machine Learning Research, 11, 2010.
[15]
S. Sarawagi and A. Bhamidipaty. Interactive Deduplication Using Active Learning. KDD '02. ACM, 2002.
[16]
V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers. KDD '08. ACM, 2008.
[17]
O. Tamuz, C. Liu, S. Belongie, O. Shamir, and A. Kalai. Adaptively Learning the Crowd Kernel. In ICML, 2011.
[18]
L. von Ahn. Human Computation. In CIVR, 2009.
[19]
M.-C. Yuen, L.-J. Chen, I. King, and I. King. A Survey of Human Computation Systems. In CSE (4), 2009.

Cited By

View all
  • (2020)End-to-End Learning from Noisy Crowd to Supervised Machine Learning Models2020 IEEE Second International Conference on Cognitive Machine Intelligence (CogMI)10.1109/CogMI50398.2020.00013(17-26)Online publication date: Oct-2020
  • (2018)Reducing vertices in property graphsPLOS ONE10.1371/journal.pone.019191713:2(e0191917)Online publication date: 14-Feb-2018
  • (2018)Quality Control in CrowdsourcingACM Computing Surveys10.1145/314814851:1(1-40)Online publication date: 4-Jan-2018
  • Show More Cited By

Index Terms

  1. Map to humans and reduce error: crowdsourcing for deduplication applied to digital libraries

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management
      October 2012
      2840 pages
      ISBN:9781450311564
      DOI:10.1145/2396761
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 29 October 2012

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. active learning
      2. crowdsourcing
      3. duplicate detection
      4. machine learning
      5. optimization

      Qualifiers

      • Short-paper

      Conference

      CIKM'12
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

      Upcoming Conference

      CIKM '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)2
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 16 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2020)End-to-End Learning from Noisy Crowd to Supervised Machine Learning Models2020 IEEE Second International Conference on Cognitive Machine Intelligence (CogMI)10.1109/CogMI50398.2020.00013(17-26)Online publication date: Oct-2020
      • (2018)Reducing vertices in property graphsPLOS ONE10.1371/journal.pone.019191713:2(e0191917)Online publication date: 14-Feb-2018
      • (2018)Quality Control in CrowdsourcingACM Computing Surveys10.1145/314814851:1(1-40)Online publication date: 4-Jan-2018
      • (2018)Machine learning from crowds: A systematic review of its applicationsWIREs Data Mining and Knowledge Discovery10.1002/widm.12889:2Online publication date: 16-Oct-2018
      • (2014)Crowdsourcing algorithms for entity resolutionProceedings of the VLDB Endowment10.14778/2732977.27329827:12(1071-1082)Online publication date: 1-Aug-2014
      • (2014)Aggregation of Crowdsourced Labels Based on Worker HistoryProceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14)10.1145/2611040.2611074(1-11)Online publication date: 2-Jun-2014
      • (2014)When in Doubt Ask the CrowdProceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14)10.1145/2611040.2611047(1-12)Online publication date: 2-Jun-2014

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media