short-paper

Map to humans and reduce error: crowdsourcing for deduplication applied to digital libraries

Authors:

Mihai Georgescu,

Claudiu S. Firan,

Wolfgang Nejdl,

Julien GaugazAuthors Info & Claims

CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

Pages 1970 - 1974

https://doi.org/10.1145/2396761.2398554

Published: 29 October 2012 Publication History

Abstract

Detecting duplicate entities, usually by examining metadata, has been the focus of much recent work. Several methods try to identify duplicate entities, while focusing either on accuracy or on efficiency and speed - with still no perfect solution. We propose a combined layered approach for duplicate detection with the main advantage of using Crowdsourcing as a training and feedback mechanism. By using Active Learning techniques on human provided examples, we fine tune our algorithm toward better duplicate detection accuracy. We keep the training cost low by gathering training data on demand for borderline cases or for inconclusive assessments. We apply our simple and powerful methods to an online publication search system: First, we perform a coarse duplicate detection relying on publication signatures in real time. Then, a second automatic step compares duplicate candidates and increases accuracy while adjusting based on both feedback from our online users and from Crowdsourcing platforms. Our approach shows an improvement of 14% over the untrained setting and is at only 4% difference to the human assessors in accuracy.

References

[1]

J. Attenberg and F. Provost. Why Label when you can Search? Alternatives to Active Learning for Applying Human Resources to Build Classification Models Under Extreme Class Imbalance. KDD '10. ACM, 2010.

Digital Library

[2]

O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang, and J. Widom. Swoosh: A Generic Approach to Entity Resolution. The VLDB Journal, 18(1), 2009.

Digital Library

[3]

D. Cohn, L. Atlas, and R. Ladner. Improving Generalization with Active Learning. Mach. Learn., 15(2), 1994.

Digital Library

[4]

A. P. Dawid and A. M. Skene. Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm. Applied Statistics, 28(1), 1979.

[5]

G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for Large-Scale Entity Linking. WWW '12. ACM, 2012.

Digital Library

[6]

A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 2007.

Digital Library

[7]

C. S. Firan, M. Georgescu, W. Nejdl, and X. Sun. FreeSearch - Literature Search in a Natural Way. In HCIR '11, 2011.

[8]

A. Y. Halevy, X. Dong, J. Madhavan, A. Y. Halevy, and J. Madhavan. Reference Reconciliation in Complex Information Spaces. In SIGMOD '05. ACM Press, 2005.

Digital Library

[9]

M. A. Hernández and S. J. Stolfo. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Data Min. Knowl. Discov., 2(1), 1998.

Digital Library

[10]

E. Ioannou, C. Niederôe, and W. Nejdl. Probabilistic Entity Linkage for Heterogeneous Information Spaces. In CAiSE, 2008.

Digital Library

[11]

P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. HCOMP '10. ACM, 2010.

Digital Library

[12]

M. Lukasiewycz, M. Glaß, F. Reimann, and J. Teich. Opt4J - A Modular Framework for Meta-heuristic Optimization. In GECCO '11, 2011.

Digital Library

[13]

Z. Miklós, N. Bonvin, P. Bouquet, M. Catasta, D. Cordioli, P. Fankhauser, J. Gaugaz, E. Ioannou, H. Koshutanski, A. Maña, C. Niederée, T. Palpanas, and H. Stoermer. From Web Data to Entities and Back. CAiSE, 2010.

[14]

V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from Crowds. Journal of Machine Learning Research, 11, 2010.

Digital Library

[15]

S. Sarawagi and A. Bhamidipaty. Interactive Deduplication Using Active Learning. KDD '02. ACM, 2002.

Digital Library

[16]

V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers. KDD '08. ACM, 2008.

Digital Library

[17]

O. Tamuz, C. Liu, S. Belongie, O. Shamir, and A. Kalai. Adaptively Learning the Crowd Kernel. In ICML, 2011.

Digital Library

[18]

L. von Ahn. Human Computation. In CIVR, 2009.

[19]

M.-C. Yuen, L.-J. Chen, I. King, and I. King. A Survey of Human Computation Systems. In CSE (4), 2009.

Digital Library

Cited By

Younesian THong CGhiassi ABirke RChen L(2020)End-to-End Learning from Noisy Crowd to Supervised Machine Learning Models2020 IEEE Second International Conference on Cognitive Machine Intelligence (CogMI)10.1109/CogMI50398.2020.00013(17-26)Online publication date: Oct-2020
https://doi.org/10.1109/CogMI50398.2020.00013
Tomaszuk DPąk K(2018)Reducing vertices in property graphsPLOS ONE10.1371/journal.pone.019191713:2(e0191917)Online publication date: 14-Feb-2018
https://doi.org/10.1371/journal.pone.0191917
Daniel FKucherbaev PCappiello CBenatallah BAllahbakhsh M(2018)Quality Control in CrowdsourcingACM Computing Surveys10.1145/314814851:1(1-40)Online publication date: 4-Jan-2018
https://dl.acm.org/doi/10.1145/3148148
Show More Cited By

Index Terms

Map to humans and reduce error: crowdsourcing for deduplication applied to digital libraries
1. Information systems
  1. Information retrieval
  2. Information storage systems
    1. Record storage systems

Recommendations

Quest for the Golden Approach: An Experimental Evaluation of Duplicate Crowdtesting Reports Detection
ESEM '20: Proceedings of the 14th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)

Background: Given the invisibility and unpredictability of distributed crowdtesting processes, there is a large number of duplicate reports, and detecting these duplicate reports is an important task to help save testing effort. Although, many approaches ...
When in Doubt Ask the Crowd: Employing Crowdsourcing for Active Learning
WIMS '14: Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14)

Crowdsourcing has become ubiquitous in machine learning as a cost effective method to gather training labels. In this paper we examine the challenges that appear when employing crowdsourcing for active learning, in an integrated environment where an ...
Efficient and Effective Duplicate Detection in Hierarchical Data

Although there is a long line of work on identifying duplicates in relational data, only a few solutions focus on duplicate detection in more complex hierarchical structures, like XML data. In this paper, we present a novel method for XML duplicate ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

October 2012

2840 pages

ISBN:9781450311564

DOI:10.1145/2396761

General Chair:
Xuewen Chen
Wayne State University, USA
,
Program Chairs:
Guy Lebanon
Georgia Institute of Technology
,
Haixun Wang
Microsoft Research Asia
,
Mohammed J. Zaki
Rensselaer Polytechnic Institute

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 October 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Conference

CIKM'12

Sponsor:

CIKM'12: 21st ACM International Conference on Information and Knowledge Management

October 29 - November 2, 2012

Hawaii, Maui, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
262
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Younesian THong CGhiassi ABirke RChen L(2020)End-to-End Learning from Noisy Crowd to Supervised Machine Learning Models2020 IEEE Second International Conference on Cognitive Machine Intelligence (CogMI)10.1109/CogMI50398.2020.00013(17-26)Online publication date: Oct-2020
https://doi.org/10.1109/CogMI50398.2020.00013
Tomaszuk DPąk K(2018)Reducing vertices in property graphsPLOS ONE10.1371/journal.pone.019191713:2(e0191917)Online publication date: 14-Feb-2018
https://doi.org/10.1371/journal.pone.0191917
Daniel FKucherbaev PCappiello CBenatallah BAllahbakhsh M(2018)Quality Control in CrowdsourcingACM Computing Surveys10.1145/314814851:1(1-40)Online publication date: 4-Jan-2018
https://dl.acm.org/doi/10.1145/3148148
G. Rodrigo EAledo JGámez J(2018)Machine learning from crowds: A systematic review of its applicationsWIREs Data Mining and Knowledge Discovery10.1002/widm.12889:2Online publication date: 16-Oct-2018
https://doi.org/10.1002/widm.1288
Vesdapunt NBellare KDalvi N(2014)Crowdsourcing algorithms for entity resolutionProceedings of the VLDB Endowment10.14778/2732977.27329827:12(1071-1082)Online publication date: 1-Aug-2014
https://dl.acm.org/doi/10.14778/2732977.2732982
Georgescu MZhu XAkerkar RBassiliades NDavies JErmolayev V(2014)Aggregation of Crowdsourced Labels Based on Worker HistoryProceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14)10.1145/2611040.2611074(1-11)Online publication date: 2-Jun-2014
https://dl.acm.org/doi/10.1145/2611040.2611074
Georgescu MPham DFiran CGadiraju UNejdl WAkerkar RBassiliades NDavies JErmolayev V(2014)When in Doubt Ask the CrowdProceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14)10.1145/2611040.2611047(1-12)Online publication date: 2-Jun-2014
https://dl.acm.org/doi/10.1145/2611040.2611047

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten